[V4,1/4] scsi: hpsa: fix selection of reply queue

Message ID	20180309033218.23042-2-ming.lei@redhat.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: James Bottomley <James.Bottomley@HansenPartnership.com>, Jens Axboe <axboe@fb.com>, "Martin K . Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de>, linux-scsi@vger.kernel.org, linux-block@vger.kernel.org, Meelis Roos <mroos@linux.ee>, Don Brace <don.brace@microsemi.com>, Kashyap Desai <kashyap.desai@broadcom.com>, Laurence Oberman <loberman@redhat.com>, Mike Snitzer <snitzer@redhat.com>, Ming Lei <ming.lei@redhat.com>, Hannes Reinecke <hare@suse.de>, James Bottomley <james.bottomley@hansenpartnership.com>, Artem Bityutskiy <artem.bityutskiy@intel.com> Subject: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue Date: Fri, 9 Mar 2018 11:32:15 +0800 Message-Id: <20180309033218.23042-2-ming.lei@redhat.com> In-Reply-To: <20180309033218.23042-1-ming.lei@redhat.com> References: <20180309033218.23042-1-ming.lei@redhat.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Ming Lei March 9, 2018, 3:32 a.m. UTC

From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
one msix vector can be created without any online CPU mapped, then one
command's completion may not be notified.

This patch setups mapping between cpu and reply queue according to irq
affinity info retrived by pci_irq_get_affinity(), and uses this mapping
table to choose reply queue for queuing one command.

Then the chosen reply queue has to be active, and fixes IO hang caused
by using inactive reply queue which doesn't have any online CPU mapped.

Cc: Hannes Reinecke <hare@suse.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Don Brace <don.brace@microsemi.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Artem Bityutskiy <artem.bityutskiy@intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Don Brace <don.brace@microsemi.com>
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/hpsa.c | 73 +++++++++++++++++++++++++++++++++++++++--------------
 drivers/scsi/hpsa.h |  1 +
 2 files changed, 55 insertions(+), 19 deletions(-)

Christoph Hellwig March 10, 2018, 10:09 a.m. UTC | #1

> +static void hpsa_setup_reply_map(struct ctlr_info *h)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	for (queue = 0; queue < h->msix_vectors; queue++) {
> +		mask = pci_irq_get_affinity(h->pdev, queue);
> +		if (!mask)
> +			goto fallback;
> +
> +		for_each_cpu(cpu, mask)
> +			h->reply_map[cpu] = queue;
> +	}
> +	return;
> +
> +fallback:
> +	for_each_possible_cpu(cpu)
> +		h->reply_map[cpu] = 0;
> +}

> +	h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);
> +	if (!h->reply_map) {
> +		kfree(h);
> +		return NULL;
> +	}
> +	return h;

I really dislike this being open coded in drivers.  It really should
be helper chared with the blk-mq map building that drivers just use.

For now just have a low-level blk_pci_map_queues that
blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
it might make sense to change the blk-mq callout to that low-level
prototype as well.

Ming Lei March 10, 2018, 3:01 p.m. UTC | #2

On Sat, Mar 10, 2018 at 11:09:59AM +0100, Christoph Hellwig wrote:
> > +static void hpsa_setup_reply_map(struct ctlr_info *h)
> > +{
> > +	const struct cpumask *mask;
> > +	unsigned int queue, cpu;
> > +
> > +	for (queue = 0; queue < h->msix_vectors; queue++) {
> > +		mask = pci_irq_get_affinity(h->pdev, queue);
> > +		if (!mask)
> > +			goto fallback;
> > +
> > +		for_each_cpu(cpu, mask)
> > +			h->reply_map[cpu] = queue;
> > +	}
> > +	return;
> > +
> > +fallback:
> > +	for_each_possible_cpu(cpu)
> > +		h->reply_map[cpu] = 0;
> > +}
> 
> > +	h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);
> > +	if (!h->reply_map) {
> > +		kfree(h);
> > +		return NULL;
> > +	}
> > +	return h;
> 
> I really dislike this being open coded in drivers.  It really should
> be helper chared with the blk-mq map building that drivers just use.
> 
> For now just have a low-level blk_pci_map_queues that
> blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
> it might make sense to change the blk-mq callout to that low-level
> prototype as well.

The way for selecting reply queue is needed for non scsi_mq too.

Thanks,
Ming

Bityutskiy, Artem March 12, 2018, 7:37 a.m. UTC | #3

Linux-Regression-ID: lr#15a115

On Fri, 2018-03-09 at 11:32 +0800, Ming Lei wrote:
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),

> one msix vector can be created without any online CPU mapped, then one

> command's completion may not be notified.

> 

> This patch setups mapping between cpu and reply queue according to irq

> affinity info retrived by pci_irq_get_affinity(), and uses this mapping

> table to choose reply queue for queuing one command.

> 

> Then the chosen reply queue has to be active, and fixes IO hang caused

> by using inactive reply queue which doesn't have any online CPU mapped.

> 

> Cc: Hannes Reinecke <hare@suse.de>

> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,

> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,

> Cc: Christoph Hellwig <hch@lst.de>,

> Cc: Don Brace <don.brace@microsemi.com>

> Cc: Kashyap Desai <kashyap.desai@broadcom.com>

> Cc: Laurence Oberman <loberman@redhat.com>

> Cc: Meelis Roos <mroos@linux.ee>

> Cc: Artem Bityutskiy <artem.bityutskiy@intel.com>

> Cc: Mike Snitzer <snitzer@redhat.com>

> Tested-by: Laurence Oberman <loberman@redhat.com>

> Tested-by: Don Brace <don.brace@microsemi.com>

> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")

> Signed-off-by: Ming Lei <ming.lei@redhat.com>


Tested-by: Artem Bityutskiy <artem.bityutskiy@intel.com>

Link: https://lkml.kernel.org/r/1519311270.2535.53.camel@intel.com

These 2 patches make the Dell R640 regression that I reported go away.
Tested on top of v4.16-rc5, thanks!

-- 
Best Regards,
Artem Bityutskiy
---------------------------------------------------------------------
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Christoph Hellwig March 12, 2018, 7:52 a.m. UTC | #4

On Sat, Mar 10, 2018 at 11:01:43PM +0800, Ming Lei wrote:
> > I really dislike this being open coded in drivers.  It really should
> > be helper chared with the blk-mq map building that drivers just use.
> > 
> > For now just have a low-level blk_pci_map_queues that
> > blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
> > it might make sense to change the blk-mq callout to that low-level
> > prototype as well.
> 
> The way for selecting reply queue is needed for non scsi_mq too.

Which still doesn't prevent you from using a common helper.

Ming Lei March 12, 2018, 9:19 a.m. UTC | #5

On Mon, Mar 12, 2018 at 08:52:02AM +0100, Christoph Hellwig wrote:
> On Sat, Mar 10, 2018 at 11:01:43PM +0800, Ming Lei wrote:
> > > I really dislike this being open coded in drivers.  It really should
> > > be helper chared with the blk-mq map building that drivers just use.
> > > 
> > > For now just have a low-level blk_pci_map_queues that
> > > blk_mq_pci_map_queues, hpsa and megaraid can share.  In the long run
> > > it might make sense to change the blk-mq callout to that low-level
> > > prototype as well.
> > 
> > The way for selecting reply queue is needed for non scsi_mq too.
> 
> Which still doesn't prevent you from using a common helper.

The only common code is the following part:

+       for (queue = 0; queue < instance->msix_vectors; queue++) {
+               mask = pci_irq_get_affinity(instance->pdev, queue);
+               if (!mask)
+                       goto fallback;
+
+               for_each_cpu(cpu, mask)
+                       instance->reply_map[cpu] = queue;
+       }

For megraraid_sas, the fallback code need to handle mapping in the
following way for legacy vectors:

       for_each_possible_cpu(cpu)
               instance->reply_map[cpu] = cpu % instance->msix_vectors;


So not sure if it is worth of a common helper, given there may not be
potential users of the helper.

Thanks,
Ming

Don Brace March 12, 2018, 3:39 p.m. UTC | #6

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Thursday, March 08, 2018 9:32 PM
> To: James Bottomley <James.Bottomley@HansenPartnership.com>; Jens Axboe
> <axboe@fb.com>; Martin K . Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>; linux-scsi@vger.kernel.org; linux-
> block@vger.kernel.org; Meelis Roos <mroos@linux.ee>; Don Brace
> <don.brace@microsemi.com>; Kashyap Desai
> <kashyap.desai@broadcom.com>; Laurence Oberman
> <loberman@redhat.com>; Mike Snitzer <snitzer@redhat.com>; Ming Lei
> <ming.lei@redhat.com>; Hannes Reinecke <hare@suse.de>; James Bottomley
> <james.bottomley@hansenpartnership.com>; Artem Bityutskiy
> <artem.bityutskiy@intel.com>
> Subject: [PATCH V4 1/4] scsi: hpsa: fix selection of reply queue
> 
> EXTERNAL EMAIL
> 
> 
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
> one msix vector can be created without any online CPU mapped, then one
> command's completion may not be notified.
> 
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this mapping
> table to choose reply queue for queuing one command.
> 
> Then the chosen reply queue has to be active, and fixes IO hang caused
> by using inactive reply queue which doesn't have any online CPU mapped.
> 
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
> Cc: James Bottomley <james.bottomley@hansenpartnership.com>,
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Don Brace <don.brace@microsemi.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Meelis Roos <mroos@linux.ee>
> Cc: Artem Bityutskiy <artem.bityutskiy@intel.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Tested-by: Laurence Oberman <loberman@redhat.com>
> Tested-by: Don Brace <don.brace@microsemi.com>
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---

Acked-by: Don Brace <don.brace@microsemi.com>
Tested-by: Don Brace <don.brace@microsemi.com>
   * Rebuilt test rig: applied the following patches to Linus's tree 4.16.0-rc4+:
                 [PATCH V4 1_4] scsi: hpsa: fix selection of reply queue - Ming Lei <ming.lei@redhat.com> - 2018-03-08 2132.eml
                 [PATCH V4 3_4] scsi: introduce force_blk_mq - Ming Lei <ming.lei@redhat.com> - 2018-03-08 2132.eml
        * fio tests on 6 LVs on P441 controller (fw 6.59) 5 days.
        * fio tests on 10 HBA disks on P431 (fw 4.54) controller. 3 days. ( concurrent with P441 tests)

>  drivers/scsi/hpsa.c | 73 +++++++++++++++++++++++++++++++++++++++--------------
>  drivers/scsi/hpsa.h |  1 +
>  2 files changed, 55 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 5293e6827ce5..3a9eca163db8 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -1045,11 +1045,7 @@ static void set_performant_mode(struct ctlr_info
> *h, struct CommandList *c,
>                 c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
>                 if (unlikely(!h->msix_vectors))
>                         return;
> -               if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -                       c->Header.ReplyQueue =
> -                               raw_smp_processor_id() % h->nreply_queues;
> -               else
> -                       c->Header.ReplyQueue = reply_queue % h->nreply_queues;
> +               c->Header.ReplyQueue = reply_queue;
>         }
>  }
> 
> @@ -1063,10 +1059,7 @@ static void set_ioaccel1_performant_mode(struct
> ctlr_info *h,
>          * Tell the controller to post the reply to the queue for this
>          * processor.  This seems to give the best I/O throughput.
>          */
> -       if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -               cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
> -       else
> -               cp->ReplyQueue = reply_queue % h->nreply_queues;
> +       cp->ReplyQueue = reply_queue;
>         /*
>          * Set the bits in the address sent down to include:
>          *  - performant mode bit (bit 0)
> @@ -1087,10 +1080,7 @@ static void
> set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
>         /* Tell the controller to post the reply to the queue for this
>          * processor.  This seems to give the best I/O throughput.
>          */
> -       if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -               cp->reply_queue = smp_processor_id() % h->nreply_queues;
> -       else
> -               cp->reply_queue = reply_queue % h->nreply_queues;
> +       cp->reply_queue = reply_queue;
>         /* Set the bits in the address sent down to include:
>          *  - performant mode bit not used in ioaccel mode 2
>          *  - pull count (bits 0-3)
> @@ -1109,10 +1099,7 @@ static void set_ioaccel2_performant_mode(struct
> ctlr_info *h,
>          * Tell the controller to post the reply to the queue for this
>          * processor.  This seems to give the best I/O throughput.
>          */
> -       if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> -               cp->reply_queue = smp_processor_id() % h->nreply_queues;
> -       else
> -               cp->reply_queue = reply_queue % h->nreply_queues;
> +       cp->reply_queue = reply_queue;
>         /*
>          * Set the bits in the address sent down to include:
>          *  - performant mode bit not used in ioaccel mode 2
> @@ -1157,6 +1144,8 @@ static void __enqueue_cmd_and_start_io(struct
> ctlr_info *h,
>  {
>         dial_down_lockup_detection_during_fw_flash(h, c);
>         atomic_inc(&h->commands_outstanding);
> +
> +       reply_queue = h->reply_map[raw_smp_processor_id()];
>         switch (c->cmd_type) {
>         case CMD_IOACCEL1:
>                 set_ioaccel1_performant_mode(h, c, reply_queue);
> @@ -7376,6 +7365,26 @@ static void hpsa_disable_interrupt_mode(struct
> ctlr_info *h)
>         h->msix_vectors = 0;
>  }
> 
> +static void hpsa_setup_reply_map(struct ctlr_info *h)
> +{
> +       const struct cpumask *mask;
> +       unsigned int queue, cpu;
> +
> +       for (queue = 0; queue < h->msix_vectors; queue++) {
> +               mask = pci_irq_get_affinity(h->pdev, queue);
> +               if (!mask)
> +                       goto fallback;
> +
> +               for_each_cpu(cpu, mask)
> +                       h->reply_map[cpu] = queue;
> +       }
> +       return;
> +
> +fallback:
> +       for_each_possible_cpu(cpu)
> +               h->reply_map[cpu] = 0;
> +}
> +
>  /* If MSI/MSI-X is supported by the kernel we will try to enable it on
>   * controllers that are capable. If not, we use legacy INTx mode.
>   */
> @@ -7771,6 +7780,10 @@ static int hpsa_pci_init(struct ctlr_info *h)
>         err = hpsa_interrupt_mode(h);
>         if (err)
>                 goto clean1;
> +
> +       /* setup mapping between CPU and reply queue */
> +       hpsa_setup_reply_map(h);
> +
>         err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
>         if (err)
>                 goto clean2;    /* intmode+region, pci */
> @@ -8480,6 +8493,28 @@ static struct workqueue_struct
> *hpsa_create_controller_wq(struct ctlr_info *h,
>         return wq;
>  }
> 
> +static void hpda_free_ctlr_info(struct ctlr_info *h)
> +{
> +       kfree(h->reply_map);
> +       kfree(h);
> +}
> +
> +static struct ctlr_info *hpda_alloc_ctlr_info(void)
> +{
> +       struct ctlr_info *h;
> +
> +       h = kzalloc(sizeof(*h), GFP_KERNEL);
> +       if (!h)
> +               return NULL;
> +
> +       h->reply_map = kzalloc(sizeof(*h->reply_map) * nr_cpu_ids, GFP_KERNEL);
> +       if (!h->reply_map) {
> +               kfree(h);
> +               return NULL;
> +       }
> +       return h;
> +}
> +
>  static int hpsa_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  {
>         int dac, rc;
> @@ -8517,7 +8552,7 @@ static int hpsa_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
>          * the driver.  See comments in hpsa.h for more info.
>          */
>         BUILD_BUG_ON(sizeof(struct CommandList) %
> COMMANDLIST_ALIGNMENT);
> -       h = kzalloc(sizeof(*h), GFP_KERNEL);
> +       h = hpda_alloc_ctlr_info();
>         if (!h) {
>                 dev_err(&pdev->dev, "Failed to allocate controller head\n");
>                 return -ENOMEM;
> @@ -8916,7 +8951,7 @@ static void hpsa_remove_one(struct pci_dev *pdev)
>         h->lockup_detected = NULL;                      /* init_one 2 */
>         /* (void) pci_disable_pcie_error_reporting(pdev); */    /* init_one 1 */
> 
> -       kfree(h);                                       /* init_one 1 */
> +       hpda_free_ctlr_info(h);                         /* init_one 1 */
>  }
> 
>  static int hpsa_suspend(__attribute__((unused)) struct pci_dev *pdev,
> diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
> index 018f980a701c..fb9f5e7f8209 100644
> --- a/drivers/scsi/hpsa.h
> +++ b/drivers/scsi/hpsa.h
> @@ -158,6 +158,7 @@ struct bmic_controller_parameters {
>  #pragma pack()
> 
>  struct ctlr_info {
> +       unsigned int *reply_map;
>         int     ctlr;
>         char    devname[8];
>         char    *product_name;
> --
> 2.9.5

[V4,1/4] scsi: hpsa: fix selection of reply queue

Commit Message

Comments

Patch