Message ID | 20190423103240.29864-3-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | scis: core: avoid big pre-allocation for sg list | expand |
On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: > #define SCSI_INLINE_PROT_SG_CNT 1 > > +#define SCSI_INLINE_SG_CNT 2 So this patch inserts one kmalloc() and one kfree() call in the hot path for every SCSI request with more than two elements in its scatterlist? Isn't "2" too low? Or in other words, what is the performance impact of this patch for a real-world workload? Thanks, Bart.
On Tue, Apr 23, 2019 at 06:32:40PM +0800, Ming Lei wrote: > big, the whole pre-allocation for sg list can consume huge memory. > For example of lpfc, nr_hw_queues can be 70, each queue's depth > can be 3781, so the pre-allocation for data sg list can be 70*3781*2k > =517MB for single HBA. We should probably limit the number of queues to something actually useful, independent of your patch.. > +static bool scsi_use_inline_sg(struct scsi_cmnd *cmd) > +{ > + struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) + > + cmd->device->host->hostt->cmd_size; > + > + return cmd->sdb.table.sgl == sg; > +} It might make more sense to have a helper to calculate the inline sg address and use that for the comparism in scsi_mq_free_sgtables and any other place that wants the address. > + if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd)) > + sg_free_table_chained(&cmd->sdb.table, false); This removes the last use of the first_chunk paramter to sg_free_table_chained, please remove the paramter in an additional patch. > + if (nr_segs <= SCSI_INLINE_SG_CNT) > + sdb->table.nents = sdb->table.orig_nents = > + SCSI_INLINE_SG_CNT; Don't we need a sg_init_table here? > + else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs, > + NULL))) > return BLK_STS_RESOURCE; We should probably also be able to drop the last parameter to sg_alloc_table_chained now.
On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote: > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: > > #define SCSI_INLINE_PROT_SG_CNT 1 > > > > +#define SCSI_INLINE_SG_CNT 2 > > So this patch inserts one kmalloc() and one kfree() call in the hot path > for every SCSI request with more than two elements in its scatterlist? Isn't Slab or its variants are designed for fast path, and NVMe PCI uses slab for allocating sg list in fast path too. > "2" too low? Or in other words, what is the performance impact of this patch > for a real-world workload? 2 is used by NVMe PCI for a while, and even recently it is reduced to 1. I have run big BS(256k) fio test on scsi_debug(can_queue: 256, submit queues: 12 LUNs: 4, delay: 0, io sched: none), not see obvious performance difference by this patch. Thanks, Ming
On Wed, Apr 24, 2019 at 07:53:17AM +0200, Christoph Hellwig wrote: > On Tue, Apr 23, 2019 at 06:32:40PM +0800, Ming Lei wrote: > > big, the whole pre-allocation for sg list can consume huge memory. > > For example of lpfc, nr_hw_queues can be 70, each queue's depth > > can be 3781, so the pre-allocation for data sg list can be 70*3781*2k > > =517MB for single HBA. > > We should probably limit the number of queues to something actually > useful, independent of your patch.. > > > +static bool scsi_use_inline_sg(struct scsi_cmnd *cmd) > > +{ > > + struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) + > > + cmd->device->host->hostt->cmd_size; > > + > > + return cmd->sdb.table.sgl == sg; > > +} > > It might make more sense to have a helper to calculate the inline > sg address and use that for the comparism in scsi_mq_free_sgtables > and any other place that wants the address. Good idea, and the helper can be used in scsi_mq_prep_fn() too. > > > + if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd)) > > + sg_free_table_chained(&cmd->sdb.table, false); > > This removes the last use of the first_chunk paramter to > sg_free_table_chained, please remove the paramter in an additional > patch. NVMe FC/RDMA still uses first_chunk. > > > + if (nr_segs <= SCSI_INLINE_SG_CNT) > > + sdb->table.nents = sdb->table.orig_nents = > > + SCSI_INLINE_SG_CNT; > > Don't we need a sg_init_table here? Will do it in V2. > > > + else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs, > > + NULL))) > > return BLK_STS_RESOURCE; > > We should probably also be able to drop the last parameter to > sg_alloc_table_chained now. NVMe FC/RDMA still uses first_chunk. Thanks, Ming
On Wed, Apr 24, 2019 at 04:41:46PM +0800, Ming Lei wrote: > > This removes the last use of the first_chunk paramter to > > sg_free_table_chained, please remove the paramter in an additional > > patch. > > NVMe FC/RDMA still uses first_chunk. Looks like they need a similar treatment then :)
On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote: > On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote: > > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: > > > #define SCSI_INLINE_PROT_SG_CNT 1 > > > > > > +#define SCSI_INLINE_SG_CNT 2 > > > > So this patch inserts one kmalloc() and one kfree() call in the hot > > path for every SCSI request with more than two elements in its > > scatterlist? Isn't > > Slab or its variants are designed for fast path, and NVMe PCI uses > slab for allocating sg list in fast path too. Actually, that's not really true base kmalloc can do all sorts of things including kick off reclaim so it's not really something we like using in the fast path. The only fast and safe kmalloc you can rely on in the fast path is GFP_ATOMIC which will fail quickly if no memory can easily be found. *However* the sg_table allocation functions are all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC mechanism for kmalloc initially coupled with a backing pool in case of failure to ensure forward progress. So, I think you're both right: you shouldn't simply use kmalloc, but this implementation doesn't, it uses the sg_table allocation functions which correctly control kmalloc to be lightweight and efficient and able to make forward progress. James
On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote: > On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote: > > On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote: > > > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: > > > > #define SCSI_INLINE_PROT_SG_CNT 1 > > > > > > > > +#define SCSI_INLINE_SG_CNT 2 > > > > > > So this patch inserts one kmalloc() and one kfree() call in the hot > > > path for every SCSI request with more than two elements in its > > > scatterlist? Isn't > > > > Slab or its variants are designed for fast path, and NVMe PCI uses > > slab for allocating sg list in fast path too. > > Actually, that's not really true base kmalloc can do all sorts of > things including kick off reclaim so it's not really something we like > using in the fast path. The only fast and safe kmalloc you can rely on > in the fast path is GFP_ATOMIC which will fail quickly if no memory > can easily be found. *However* the sg_table allocation functions are > all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC > mechanism for kmalloc initially coupled with a backing pool in case of > failure to ensure forward progress. > > So, I think you're both right: you shouldn't simply use kmalloc, but > this implementation doesn't, it uses the sg_table allocation functions > which correctly control kmalloc to be lightweight and efficient and > able to make forward progress. Another concern is whether this change can cause a livelock. If the system is running out of memory and the page cache submits a write request with a scatterlist with more than two elements, if the kmalloc() for the scatterlist fails, will that prevent the page cache from making any progress with writeback? Bart.
On 4/24/19 9:32 AM, Bart Van Assche wrote: > On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote: >> On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote: >>> On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote: >>>> On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: >>>>> #define SCSI_INLINE_PROT_SG_CNT 1 >>>>> >>>>> +#define SCSI_INLINE_SG_CNT 2 >>>> >>>> So this patch inserts one kmalloc() and one kfree() call in the hot >>>> path for every SCSI request with more than two elements in its >>>> scatterlist? Isn't >>> >>> Slab or its variants are designed for fast path, and NVMe PCI uses >>> slab for allocating sg list in fast path too. >> >> Actually, that's not really true base kmalloc can do all sorts of >> things including kick off reclaim so it's not really something we like >> using in the fast path. The only fast and safe kmalloc you can rely on >> in the fast path is GFP_ATOMIC which will fail quickly if no memory >> can easily be found. *However* the sg_table allocation functions are >> all pool backed (lib/sg_pool.c), so they use the lightweight GFP_ATOMIC >> mechanism for kmalloc initially coupled with a backing pool in case of >> failure to ensure forward progress. >> >> So, I think you're both right: you shouldn't simply use kmalloc, but >> this implementation doesn't, it uses the sg_table allocation functions >> which correctly control kmalloc to be lightweight and efficient and >> able to make forward progress. > > Another concern is whether this change can cause a livelock. If the system > is running out of memory and the page cache submits a write request with > a scatterlist with more than two elements, if the kmalloc() for the > scatterlist fails, will that prevent the page cache from making any progress > with writeback? The mempool prevents that - as long as we have at least one reserved unit of the given size, we know we'll always make progress (eventually). The finite life times of the sg pool allocations guarantee that.
On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote: > On Wed, 2019-04-24 at 08:24 -0700, James Bottomley wrote: > > On Wed, 2019-04-24 at 15:52 +0800, Ming Lei wrote: > > > On Tue, Apr 23, 2019 at 08:37:15AM -0700, Bart Van Assche wrote: > > > > On Tue, 2019-04-23 at 18:32 +0800, Ming Lei wrote: > > > > > #define SCSI_INLINE_PROT_SG_CNT 1 > > > > > > > > > > +#define SCSI_INLINE_SG_CNT 2 > > > > > > > > So this patch inserts one kmalloc() and one kfree() call in the > > > > hot path for every SCSI request with more than two elements in > > > > its scatterlist? Isn't > > > > > > Slab or its variants are designed for fast path, and NVMe PCI > > > uses slab for allocating sg list in fast path too. > > > > Actually, that's not really true base kmalloc can do all sorts of > > things including kick off reclaim so it's not really something we > > like using in the fast path. The only fast and safe kmalloc you > > can rely on in the fast path is GFP_ATOMIC which will fail quickly > > if no memory can easily be found. *However* the sg_table > > allocation functions are all pool backed (lib/sg_pool.c), so they > > use the lightweight GFP_ATOMIC mechanism for kmalloc initially > > coupled with a backing pool in case of failure to ensure forward > > progress. > > > > So, I think you're both right: you shouldn't simply use kmalloc, > > but this implementation doesn't, it uses the sg_table allocation > > functions which correctly control kmalloc to be lightweight and > > efficient and able to make forward progress. > > Another concern is whether this change can cause a livelock. If the > system is running out of memory and the page cache submits a write > request with a scatterlist with more than two elements, if the > kmalloc() for the scatterlist fails, will that prevent the page cache > from making any progress with writeback? It's pool backed, as I said. Is the concern there isn't enough depth in the pools for a large write? James
On Wed, 2019-04-24 at 08:49 -0700, James Bottomley wrote: > On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote: > > Another concern is whether this change can cause a livelock. If the > > system is running out of memory and the page cache submits a write > > request with a scatterlist with more than two elements, if the > > kmalloc() for the scatterlist fails, will that prevent the page cache > > from making any progress with writeback? > > It's pool backed, as I said. Is the concern there isn't enough depth > in the pools for a large write? That memory pool is used by multiple drivers. Most but not all sg_alloc_table_chained() calls happen from inside .queue_rq() implementations. One sg_alloc_table_chained() call occurs in the NFS server code. I'm not sure whether it is guaranteed that an sg_alloc_table_chained() will succeed sooner or later under low memory conditions. Additionally, new sg_alloc_table_chained() could be added in drivers any time. Bart.
On Wed, 2019-04-24 at 09:09 -0700, Bart Van Assche wrote: > On Wed, 2019-04-24 at 08:49 -0700, James Bottomley wrote: > > On Wed, 2019-04-24 at 08:32 -0700, Bart Van Assche wrote: > > > Another concern is whether this change can cause a livelock. If > > > the system is running out of memory and the page cache submits a > > > write request with a scatterlist with more than two elements, if > > > the kmalloc() for the scatterlist fails, will that prevent the > > > page cache from making any progress with writeback? > > > > It's pool backed, as I said. Is the concern there isn't enough > > depth in the pools for a large write? > > That memory pool is used by multiple drivers. Most but not all > sg_alloc_table_chained() calls happen from inside .queue_rq() > implementations. One sg_alloc_table_chained() call occurs in the NFS > server code. I'm not sure whether it is guaranteed that an > sg_alloc_table_chained() will succeed sooner or later under low > memory conditions. Additionally, new sg_alloc_table_chained() could > be added in drivers any time. The number of users is irrelevant. All we need is sequential forward progress to guarantee freedom from memory allocation related live lock. Even if they make write progress one at a time (although the current pool depth seems to be 2, so they make progress at least two at a time), memory will be released by the write and reclaim will progress. The guarantee required is ability to send or have outstanding at least one write and also that that write will return eventually releasing memory back to the pool for another write to proceed. James
On Wed, Apr 24, 2019 at 04:38:33PM +0200, Christoph Hellwig wrote: > On Wed, Apr 24, 2019 at 04:41:46PM +0800, Ming Lei wrote: > > > This removes the last use of the first_chunk paramter to > > > sg_free_table_chained, please remove the paramter in an additional > > > patch. > > > > NVMe FC/RDMA still uses first_chunk. > > Looks like they need a similar treatment then :) Yeah, I agree. But the issue could be less serious for NVMe FC/RDMA because the queue depth is often much small. thanks, Ming
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index bdcf40851356..4fff95b14c91 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -45,6 +45,8 @@ */ #define SCSI_INLINE_PROT_SG_CNT 1 +#define SCSI_INLINE_SG_CNT 2 + static struct kmem_cache *scsi_sdb_cache; static struct kmem_cache *scsi_sense_cache; static struct kmem_cache *scsi_sense_isadma_cache; @@ -568,10 +570,18 @@ static inline bool scsi_prot_use_inline_sg(struct scsi_cmnd *cmd) (struct scatterlist *)(cmd->prot_sdb + 1); } +static bool scsi_use_inline_sg(struct scsi_cmnd *cmd) +{ + struct scatterlist *sg = (void *)cmd + sizeof(struct scsi_cmnd) + + cmd->device->host->hostt->cmd_size; + + return cmd->sdb.table.sgl == sg; +} + static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd) { - if (cmd->sdb.table.nents) - sg_free_table_chained(&cmd->sdb.table, true); + if (cmd->sdb.table.nents && !scsi_use_inline_sg(cmd)) + sg_free_table_chained(&cmd->sdb.table, false); if (scsi_prot_sg_count(cmd) && !scsi_prot_use_inline_sg(cmd)) sg_free_table_chained(&cmd->prot_sdb->table, false); } @@ -1002,12 +1012,16 @@ static blk_status_t scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb) { int count; + unsigned nr_segs = blk_rq_nr_phys_segments(req); /* * If sg table allocation fails, requeue request later. */ - if (unlikely(sg_alloc_table_chained(&sdb->table, - blk_rq_nr_phys_segments(req), sdb->table.sgl))) + if (nr_segs <= SCSI_INLINE_SG_CNT) + sdb->table.nents = sdb->table.orig_nents = + SCSI_INLINE_SG_CNT; + else if (unlikely(sg_alloc_table_chained(&sdb->table, nr_segs, + NULL))) return BLK_STS_RESOURCE; /* @@ -1574,9 +1588,9 @@ static int scsi_dispatch_cmd(struct scsi_cmnd *cmd) } /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */ -static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost) +static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost) { - return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) * + return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) * sizeof(struct scatterlist); } @@ -1766,7 +1780,7 @@ static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, if (scsi_host_get_prot(shost)) { sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size; - cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost); + cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost); } return 0; @@ -1860,7 +1874,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) { unsigned int cmd_size, sgl_size; - sgl_size = scsi_mq_sgl_size(shost); + sgl_size = scsi_mq_inline_sgl_size(shost); cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size; if (scsi_host_get_prot(shost)) cmd_size += sizeof(struct scsi_data_buffer) +
Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list, and the buffer size is scsi_mq_sgl_size() which depends on smaller value between shost->sg_tablesize and SG_CHUNK_SIZE. Modern HBA's DMA capabilty is often capable of deadling with very big segment number, so scsi_mq_sgl_size() is often big. Suppose the max sg number of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB. Then if one HBA has lots of queues, and each hw queue's depth is big, the whole pre-allocation for sg list can consume huge memory. For example of lpfc, nr_hw_queues can be 70, each queue's depth can be 3781, so the pre-allocation for data sg list can be 70*3781*2k =517MB for single HBA. Also there is Red Hat internal reprot that scsi_debug based tests can't be run any more since legacy io path is killed because too big pre-allocation. This patch switchs to runtime allocation for sg list, meantime pre-allocate 2 inline sg entries. This way has been applied to NVMe for a while, so it should be fine for SCSI too. Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/scsi/scsi_lib.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-)