diff mbox series

[2/2] scsi: core: avoid to pre-allocate big chunk for sg list

Message ID 20190423103240.29864-3-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series scis: core: avoid big pre-allocation for sg list | expand

Commit Message

Ming Lei April 23, 2019, 10:32 a.m. UTC
Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
and the buffer size is scsi_mq_sgl_size() which depends on smaller
value between shost->sg_tablesize and SG_CHUNK_SIZE.

Modern HBA's DMA capabilty is often capable of deadling with very
big segment number, so scsi_mq_sgl_size() is often big. Suppose the
max sg number of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be
4KB.

Then if one HBA has lots of queues, and each hw queue's depth is
big, the whole pre-allocation for sg list can consume huge memory.
For example of lpfc, nr_hw_queues can be 70, each queue's depth
can be 3781, so the pre-allocation for data sg list can be 70*3781*2k
=517MB for single HBA.

Also there is Red Hat internal reprot that scsi_debug based tests can't
be run any more since legacy io path is killed because too big
pre-allocation.

This patch switchs to runtime allocation for sg list, meantime
pre-allocate 2 inline sg entries. This way has been applied to
NVMe for a while, so it should be fine for SCSI too.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

Comments

Bart Van Assche April 23, 2019, 3:37 p.m. UTC | #1
On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
>  #define  SCSI_INLINE_PROT_SG_CNT  1
>  
> +#define  SCSI_INLINE_SG_CNT  2

So this patch inserts one kmalloc() and one kfree() call in the hot path
for every SCSI request with more than two elements in its scatterlist? Isn't
"2" too low? Or in other words, what is the performance impact of this patch
for a real-world workload?

Thanks,

Bart.
Christoph Hellwig April 24, 2019, 5:53 a.m. UTC | #2
On Tue, Apr 23, 2019 at 06:32:40PM +0800, Ming Lei wrote:
> big, the whole pre-allocation for sg list can consume huge memory.
> For example of lpfc, nr_hw_queues can be 70, each queue's depth
> can be 3781, so the pre-allocation for data sg list can be 70*3781*2k
> =517MB for single HBA.

We should probably limit the number of queues to something actually
useful, independent of your patch..

> +static bool scsi_use_inline_sg(struct scsi_cmnd *cmd)
> +{
> +	struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) +
> +		cmd->device->host->hostt->cmd_size;
> +
> +	return cmd->sdb.table.sgl == sg;
> +}

It might make more sense to have a helper to calculate the inline
sg address and use that for the comparism in scsi_mq_free_sgtables
and any other place that wants the address.

> +	if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd))
> +		sg_free_table_chained(&cmd->sdb.table, false);

This removes the last use of the first_chunk paramter to
sg_free_table_chained, please remove the paramter in an additional
patch.

> +	if (nr_segs <= SCSI_INLINE_SG_CNT)
> +		sdb->table.nents = sdb->table.orig_nents =
> +			SCSI_INLINE_SG_CNT;

Don't we need a sg_init_table here?

> +	else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs,
> +					NULL)))
>  		return BLK_STS_RESOURCE;

We should probably also be able to drop the last parameter to
sg_alloc_table_chained now.
Ming Lei April 24, 2019, 7:52 a.m. UTC | #3
On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
> >  #define  SCSI_INLINE_PROT_SG_CNT  1
> >  
> > +#define  SCSI_INLINE_SG_CNT  2
> 
> So this patch inserts one kmalloc() and one kfree() call in the hot path
> for every SCSI request with more than two elements in its scatterlist? Isn't

Slab or its variants are designed for fast path, and NVMe PCI uses slab
for allocating sg list in fast path too.

> "2" too low? Or in other words, what is the performance impact of this patch
> for a real-world workload?

2 is used by NVMe PCI for a while, and even recently it is reduced to 1.

I have run big BS(256k) fio test on scsi_debug(can_queue: 256, submit queues: 12 LUNs: 4,
delay: 0, io sched: none), not see obvious performance difference by this patch.

Thanks, 
Ming
Ming Lei April 24, 2019, 8:41 a.m. UTC | #4
On Wed, Apr 24, 2019 at 07:53:17AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 23, 2019 at 06:32:40PM +0800, Ming Lei wrote:
> > big, the whole pre-allocation for sg list can consume huge memory.
> > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > can be 3781, so the pre-allocation for data sg list can be 70*3781*2k
> > =517MB for single HBA.
> 
> We should probably limit the number of queues to something actually
> useful, independent of your patch..
> 
> > +static bool scsi_use_inline_sg(struct scsi_cmnd *cmd)
> > +{
> > +	struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) +
> > +		cmd->device->host->hostt->cmd_size;
> > +
> > +	return cmd->sdb.table.sgl == sg;
> > +}
> 
> It might make more sense to have a helper to calculate the inline
> sg address and use that for the comparism in scsi_mq_free_sgtables
> and any other place that wants the address.

Good idea, and the helper can be used in scsi_mq_prep_fn() too.

> 
> > +	if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd))
> > +		sg_free_table_chained(&cmd->sdb.table, false);
> 
> This removes the last use of the first_chunk paramter to
> sg_free_table_chained, please remove the paramter in an additional
> patch.

NVMe FC/RDMA still uses first_chunk.

> 
> > +	if (nr_segs <= SCSI_INLINE_SG_CNT)
> > +		sdb->table.nents = sdb->table.orig_nents =
> > +			SCSI_INLINE_SG_CNT;
> 
> Don't we need a sg_init_table here?

Will do it in V2.

> 
> > +	else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs,
> > +					NULL)))
> >  		return BLK_STS_RESOURCE;
> 
> We should probably also be able to drop the last parameter to
> sg_alloc_table_chained now.

NVMe FC/RDMA still uses first_chunk.

Thanks,
Ming
Christoph Hellwig April 24, 2019, 2:38 p.m. UTC | #5
On Wed, Apr 24, 2019 at 04:41:46PM +0800, Ming Lei wrote:
> > This removes the last use of the first_chunk paramter to
> > sg_free_table_chained, please remove the paramter in an additional
> > patch.
> 
> NVMe FC/RDMA still uses first_chunk.

Looks like they need a similar treatment then :)
James Bottomley April 24, 2019, 3:24 p.m. UTC | #6
On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote:
> On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote:
> > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
> > >  #define  SCSI_INLINE_PROT_SG_CNT  1
> > >  
> > > +#define  SCSI_INLINE_SG_CNT  2
> > 
> > So this patch inserts one kmalloc() and one kfree() call in the hot
> > path for every SCSI request with more than two elements in its
> > scatterlist? Isn't
> 
> Slab or its variants are designed for fast path, and NVMe PCI uses
> slab for allocating sg list in fast path too.

Actually, that's not really true  base kmalloc can do all sorts of
things including kick off reclaim so it's not really something we like
using in the fast path.  The only fast and safe kmalloc you can rely on
 in the fast path is GFP_ATOMIC which will fail quickly if no memory
can easily be found.  *However* the sg_table allocation functions are
all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC
mechanism for kmalloc initially coupled with a backing pool in case of
failure to ensure forward progress.

So, I think you're both right: you shouldn't simply use kmalloc, but
this implementation doesn't, it uses the sg_table allocation functions
which correctly control kmalloc to be lightweight and efficient and
able to make forward progress.

James
Bart Van Assche April 24, 2019, 3:32 p.m. UTC | #7
On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote:
> On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote:
> > On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote:
> > > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
> > > >  #define  SCSI_INLINE_PROT_SG_CNT  1
> > > >  
> > > > +#define  SCSI_INLINE_SG_CNT  2
> > > 
> > > So this patch inserts one kmalloc() and one kfree() call in the hot
> > > path for every SCSI request with more than two elements in its
> > > scatterlist? Isn't
> > 
> > Slab or its variants are designed for fast path, and NVMe PCI uses
> > slab for allocating sg list in fast path too.
> 
> Actually, that's not really true  base kmalloc can do all sorts of
> things including kick off reclaim so it's not really something we like
> using in the fast path.  The only fast and safe kmalloc you can rely on
>  in the fast path is GFP_ATOMIC which will fail quickly if no memory
> can easily be found.  *However* the sg_table allocation functions are
> all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC
> mechanism for kmalloc initially coupled with a backing pool in case of
> failure to ensure forward progress.
> 
> So, I think you're both right: you shouldn't simply use kmalloc, but
> this implementation doesn't, it uses the sg_table allocation functions
> which correctly control kmalloc to be lightweight and efficient and
> able to make forward progress.

Another concern is whether this change can cause a livelock. If the system
is running out of memory and the page cache submits a write request with
a scatterlist with more than two elements, if the kmalloc() for the
scatterlist fails, will that prevent the page cache from making any progress
with writeback?

Bart.
Jens Axboe April 24, 2019, 3:37 p.m. UTC | #8
On 4/24/19 9:32 AM, Bart Van Assche wrote:
> On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote:
>> On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote:
>>> On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote:
>>>> On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
>>>>>  #define  SCSI_INLINE_PROT_SG_CNT  1
>>>>>  
>>>>> +#define  SCSI_INLINE_SG_CNT  2
>>>>
>>>> So this patch inserts one kmalloc() and one kfree() call in the hot
>>>> path for every SCSI request with more than two elements in its
>>>> scatterlist? Isn't
>>>
>>> Slab or its variants are designed for fast path, and NVMe PCI uses
>>> slab for allocating sg list in fast path too.
>>
>> Actually, that's not really true  base kmalloc can do all sorts of
>> things including kick off reclaim so it's not really something we like
>> using in the fast path.  The only fast and safe kmalloc you can rely on
>>  in the fast path is GFP_ATOMIC which will fail quickly if no memory
>> can easily be found.  *However* the sg_table allocation functions are
>> all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC
>> mechanism for kmalloc initially coupled with a backing pool in case of
>> failure to ensure forward progress.
>>
>> So, I think you're both right: you shouldn't simply use kmalloc, but
>> this implementation doesn't, it uses the sg_table allocation functions
>> which correctly control kmalloc to be lightweight and efficient and
>> able to make forward progress.
> 
> Another concern is whether this change can cause a livelock. If the system
> is running out of memory and the page cache submits a write request with
> a scatterlist with more than two elements, if the kmalloc() for the
> scatterlist fails, will that prevent the page cache from making any progress
> with writeback?

The mempool prevents that - as long as we have at least one reserved unit
of the given size, we know we'll always make progress (eventually). The
finite life times of the sg pool allocations guarantee that.
James Bottomley April 24, 2019, 3:49 p.m. UTC | #9
On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote:
> On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote:
> > On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote:
> > > On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote:
> > > > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote:
> > > > >  #define  SCSI_INLINE_PROT_SG_CNT  1
> > > > >  
> > > > > +#define  SCSI_INLINE_SG_CNT  2
> > > > 
> > > > So this patch inserts one kmalloc() and one kfree() call in the
> > > > hot path for every SCSI request with more than two elements in
> > > > its scatterlist? Isn't
> > > 
> > > Slab or its variants are designed for fast path, and NVMe PCI
> > > uses slab for allocating sg list in fast path too.
> > 
> > Actually, that's not really true  base kmalloc can do all sorts of
> > things including kick off reclaim so it's not really something we
> > like using in the fast path.  The only fast and safe kmalloc you
> > can rely on  in the fast path is GFP_ATOMIC which will fail quickly
> > if no memory can easily be found.  *However* the sg_table
> > allocation functions are all pool backed (lib/sg_pool.c), so they
> > use the lightweight GFP_ATOMIC mechanism for kmalloc initially
> > coupled with a backing pool in case of failure to ensure forward
> > progress.
> > 
> > So, I think you're both right: you shouldn't simply use kmalloc,
> > but this implementation doesn't, it uses the sg_table allocation
> > functions which correctly control kmalloc to be lightweight and
> > efficient and able to make forward progress.
> 
> Another concern is whether this change can cause a livelock. If the
> system is running out of memory and the page cache submits a write
> request with a scatterlist with more than two elements, if the
> kmalloc() for the scatterlist fails, will that prevent the page cache
> from making any progress with writeback?

It's pool backed, as I said.  Is the concern there isn't enough depth
in the pools for a large write?

James
Bart Van Assche April 24, 2019, 4:09 p.m. UTC | #10
On Wed, 2019-04-24 at 08:49 -0700, James Bottomley wrote:
> On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote:
> > Another concern is whether this change can cause a livelock. If the
> > system is running out of memory and the page cache submits a write
> > request with a scatterlist with more than two elements, if the
> > kmalloc() for the scatterlist fails, will that prevent the page cache
> > from making any progress with writeback?
> 
> It's pool backed, as I said.  Is the concern there isn't enough depth
> in the pools for a large write?

That memory pool is used by multiple drivers. Most but not all
sg_alloc_table_chained() calls happen from inside .queue_rq() implementations.
One sg_alloc_table_chained() call occurs in the NFS server code. I'm not sure
whether it is guaranteed that an sg_alloc_table_chained() will succeed sooner
or later under low memory conditions. Additionally, new
sg_alloc_table_chained() could be added in drivers any time.

Bart.
James Bottomley April 24, 2019, 4:17 p.m. UTC | #11
On Wed, 2019-04-24 at 09:09 -0700, Bart Van Assche wrote:
> On Wed, 2019-04-24 at 08:49 -0700, James Bottomley wrote:
> > On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote:
> > > Another concern is whether this change can cause a livelock. If
> > > the system is running out of memory and the page cache submits a
> > > write request with a scatterlist with more than two elements, if
> > > the kmalloc() for the scatterlist fails, will that prevent the
> > > page cache from making any progress with writeback?
> > 
> > It's pool backed, as I said.  Is the concern there isn't enough
> > depth in the pools for a large write?
> 
> That memory pool is used by multiple drivers. Most but not all
> sg_alloc_table_chained() calls happen from inside .queue_rq()
> implementations. One sg_alloc_table_chained() call occurs in the NFS
> server code. I'm not sure whether it is guaranteed that an
> sg_alloc_table_chained() will succeed sooner or later under low
> memory conditions. Additionally, new sg_alloc_table_chained() could
> be added in drivers any time.

The number of users is irrelevant.  All we need is sequential forward
progress to guarantee freedom from memory allocation related live lock.
 Even if they make write progress one at a time (although the current
pool depth seems to be 2, so they make progress at least two at a
time), memory will be released by the write and reclaim will progress.

The guarantee required is ability to send or have outstanding at least
one write and also that that write will return eventually releasing
memory back to the pool for another write to proceed.

James
Ming Lei April 25, 2019, 12:45 a.m. UTC | #12
On Wed, Apr 24, 2019 at 04:38:33PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 04:41:46PM +0800, Ming Lei wrote:
> > > This removes the last use of the first_chunk paramter to
> > > sg_free_table_chained, please remove the paramter in an additional
> > > patch.
> > 
> > NVMe FC/RDMA still uses first_chunk.
> 
> Looks like they need a similar treatment then :)

Yeah, I agree.

But the issue could be less serious for NVMe FC/RDMA because the queue
depth is often much small.

thanks,
Ming
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index bdcf40851356..4fff95b14c91 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -45,6 +45,8 @@ 
  */
 #define  SCSI_INLINE_PROT_SG_CNT  1
 
+#define  SCSI_INLINE_SG_CNT  2
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -568,10 +570,18 @@  static inline bool scsi_prot_use_inline_sg(struct scsi_cmnd *cmd)
 		(struct scatterlist *)(cmd->prot_sdb + 1);
 }
 
+static bool scsi_use_inline_sg(struct scsi_cmnd *cmd)
+{
+	struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) +
+		cmd->device->host->hostt->cmd_size;
+
+	return cmd->sdb.table.sgl == sg;
+}
+
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 {
-	if (cmd->sdb.table.nents)
-		sg_free_table_chained(&cmd->sdb.table, true);
+	if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd))
+		sg_free_table_chained(&cmd->sdb.table, false);
 	if (scsi_prot_sg_count(cmd) && !scsi_prot_use_inline_sg(cmd))
 		sg_free_table_chained(&cmd->prot_sdb->table, false);
 }
@@ -1002,12 +1012,16 @@  static blk_status_t scsi_init_sgtable(struct request *req,
 		struct scsi_data_buffer *sdb)
 {
 	int count;
+	unsigned nr_segs = blk_rq_nr_phys_segments(req);
 
 	/*
 	 * If sg table allocation fails, requeue request later.
 	 */
-	if (unlikely(sg_alloc_table_chained(&sdb->table,
-			blk_rq_nr_phys_segments(req), sdb->table.sgl)))
+	if (nr_segs <= SCSI_INLINE_SG_CNT)
+		sdb->table.nents = sdb->table.orig_nents =
+			SCSI_INLINE_SG_CNT;
+	else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs,
+					NULL)))
 		return BLK_STS_RESOURCE;
 
 	/* 
@@ -1574,9 +1588,9 @@  static int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 }
 
 /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */
-static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost)
+static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost)
 {
-	return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) *
+	return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) *
 		sizeof(struct scatterlist);
 }
 
@@ -1766,7 +1780,7 @@  static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 	if (scsi_host_get_prot(shost)) {
 		sg = (void *)cmd + sizeof(struct scsi_cmnd) +
 			shost->hostt->cmd_size;
-		cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost);
+		cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost);
 	}
 
 	return 0;
@@ -1860,7 +1874,7 @@  int scsi_mq_setup_tags(struct Scsi_Host *shost)
 {
 	unsigned int cmd_size, sgl_size;
 
-	sgl_size = scsi_mq_sgl_size(shost);
+	sgl_size = scsi_mq_inline_sgl_size(shost);
 	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
 	if (scsi_host_get_prot(shost))
 		cmd_size += sizeof(struct scsi_data_buffer) +