diff mbox series

[V4,3/3] scsi: core: avoid to pre-allocate big chunk for sg list

Message ID 20190428073932.9898-4-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series scsi: core: avoid big pre-allocation for sg list | expand

Commit Message

Ming Lei April 28, 2019, 7:39 a.m. UTC
Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
and the buffer size is scsi_mq_sgl_size() which depends on smaller
value between shost->sg_tablesize and SG_CHUNK_SIZE.

Modern HBA's DMA is often capable of deadling with very big segment
number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.

Then if one HBA has lots of queues, and each hw queue's depth is
high, pre-allocation for sg list can consume huge memory.
For example of lpfc, nr_hw_queues can be 70, each queue's depth
can be 3781, so the pre-allocation for data sg list is 70*3781*2k
=517MB for single HBA.

There is Red Hat internal report that scsi_debug based tests can't
be run any more since legacy io path is killed because too big
pre-allocation.

So switch to runtime allocation for sg list, meantime pre-allocate 2
inline sg entries. This way has been applied to NVMe PCI for a while,
so it should be fine for SCSI too. Also runtime sg entries allocation
has verified and run always in the original legacy io path.

Not see performance effect in my big BS test on scsi_debug.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Bart Van Assche April 29, 2019, 6:16 p.m. UTC | #1
On Sun, 2019-04-28 at 15:39 +0800, Ming Lei wrote:
> Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> and the buffer size is scsi_mq_sgl_size() which depends on smaller
> value between shost->sg_tablesize and SG_CHUNK_SIZE.
> 
> Modern HBA's DMA is often capable of deadling with very big segment
> number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> 
> Then if one HBA has lots of queues, and each hw queue's depth is
> high, pre-allocation for sg list can consume huge memory.
> For example of lpfc, nr_hw_queues can be 70, each queue's depth
> can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> =517MB for single HBA.
> 
> There is Red Hat internal report that scsi_debug based tests can't
> be run any more since legacy io path is killed because too big
> pre-allocation.
> 
> So switch to runtime allocation for sg list, meantime pre-allocate 2
> inline sg entries. This way has been applied to NVMe PCI for a while,
> so it should be fine for SCSI too. Also runtime sg entries allocation
> has verified and run always in the original legacy io path.
> 
> Not see performance effect in my big BS test on scsi_debug.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Guenter Roeck June 3, 2019, 8:44 p.m. UTC | #2
On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> and the buffer size is scsi_mq_sgl_size() which depends on smaller
> value between shost->sg_tablesize and SG_CHUNK_SIZE.
> 
> Modern HBA's DMA is often capable of deadling with very big segment
> number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> 
> Then if one HBA has lots of queues, and each hw queue's depth is
> high, pre-allocation for sg list can consume huge memory.
> For example of lpfc, nr_hw_queues can be 70, each queue's depth
> can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> =517MB for single HBA.
> 
> There is Red Hat internal report that scsi_debug based tests can't
> be run any more since legacy io path is killed because too big
> pre-allocation.
> 
> So switch to runtime allocation for sg list, meantime pre-allocate 2
> inline sg entries. This way has been applied to NVMe PCI for a while,
> so it should be fine for SCSI too. Also runtime sg entries allocation
> has verified and run always in the original legacy io path.
> 
> Not see performance effect in my big BS test on scsi_debug.
> 

This patch causes a variety of boot failures in -next. Typical failure
pattern is scsi hangs or failure to find a root file system. For example,
on alpha, trying to boot from usb:

scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
sd 0:0:0:0: Power-on or device reset occurred
sd 0:0:0:0: [sda] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] Attached SCSI disk
usb 1-1: reset full-speed USB device number 2 using ohci-pci
sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00
sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 00 58 00 00 42 00
print_req_error: I/O error, dev sda, sector 88 flags 80000
EXT4-fs error (device sda): __ext4_get_inode_loc:4703: inode #2: block 44: comm
swapper: unable to read itable block
EXT4-fs (sda): get root inode failed
EXT4-fs (sda): mount failed
VFS: Cannot open root device "sda" or unknown-block(8,0): error -5

Similar problems are seen with other architectures.
Reverting the patch fixes the problem. Bisect log attached.

Guenter

---
# bad: [ae3cad8f39ccf8d31775d9737488bccf0e44d370] Add linux-next specific files for 20190603
# good: [f2c7c76c5d0a443053e94adb9f0918fa2fb85c3a] Linux 5.2-rc3
git bisect start 'HEAD' 'v5.2-rc3'
# good: [8ff6f4c6e067a9d3f3bbacf02c4ea5eb81fe2c6a] Merge remote-tracking branch 'crypto/master'
git bisect good 8ff6f4c6e067a9d3f3bbacf02c4ea5eb81fe2c6a
# good: [6c93755861ce6a6dd904df9cdae9f08671132dbe] Merge remote-tracking branch 'iommu/next'
git bisect good 6c93755861ce6a6dd904df9cdae9f08671132dbe
# good: [1a567956cb3be5754d94ce9380a2151e57e204a7] Merge remote-tracking branch 'cgroup/for-next'
git bisect good 1a567956cb3be5754d94ce9380a2151e57e204a7
# bad: [a6878ca73cf30b83efbdfb1ecc443d7cfb2d8193] Merge remote-tracking branch 'rtc/rtc-next'
git bisect bad a6878ca73cf30b83efbdfb1ecc443d7cfb2d8193
# bad: [25e7f57cf9f86076704b543628a0f02d3d733726] Merge remote-tracking branch 'gpio/for-next'
git bisect bad 25e7f57cf9f86076704b543628a0f02d3d733726
# bad: [28e85db94534c92fa00d787264e3ea1843d1dc42] scsi: lpfc: Fix nvmet handling of received ABTS for unmapped frames
git bisect bad 28e85db94534c92fa00d787264e3ea1843d1dc42
# bad: [cf9eddf616bb03387e597629142b0be34111e8d0] scsi: hpsa: check for tag collision
git bisect bad cf9eddf616bb03387e597629142b0be34111e8d0
# good: [4e74166c52a836f05d4bd8270835703908b34d3e] scsi: libsas: switch sas_ata.[ch] to SPDX tags
git bisect good 4e74166c52a836f05d4bd8270835703908b34d3e
# good: [f186090846c29fd9760917fb3d01f095c39262e0] scsi: lib/sg_pool.c: improve APIs for allocating sg pool
git bisect good f186090846c29fd9760917fb3d01f095c39262e0
# bad: [12b6b55806928690359bb21de3a19199412289fd] scsi: sd: Inline sd_probe_part2()
git bisect bad 12b6b55806928690359bb21de3a19199412289fd
# bad: [c3288dd8c2328a74cb2157dff18d13f6a75cedd2] scsi: core: avoid pre-allocating big SGL for data
git bisect bad c3288dd8c2328a74cb2157dff18d13f6a75cedd2
# good: [0f0e744eae6c8af361d227d3a2286973351e5a2a] scsi: core: avoid pre-allocating big SGL for protection information
git bisect good 0f0e744eae6c8af361d227d3a2286973351e5a2a
# first bad commit: [c3288dd8c2328a74cb2157dff18d13f6a75cedd2] scsi: core: avoid pre-allocating big SGL for data
Ming Lei June 4, 2019, 1 a.m. UTC | #3
On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > 
> > Modern HBA's DMA is often capable of deadling with very big segment
> > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > 
> > Then if one HBA has lots of queues, and each hw queue's depth is
> > high, pre-allocation for sg list can consume huge memory.
> > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > =517MB for single HBA.
> > 
> > There is Red Hat internal report that scsi_debug based tests can't
> > be run any more since legacy io path is killed because too big
> > pre-allocation.
> > 
> > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > inline sg entries. This way has been applied to NVMe PCI for a while,
> > so it should be fine for SCSI too. Also runtime sg entries allocation
> > has verified and run always in the original legacy io path.
> > 
> > Not see performance effect in my big BS test on scsi_debug.
> > 
> 
> This patch causes a variety of boot failures in -next. Typical failure
> pattern is scsi hangs or failure to find a root file system. For example,
> on alpha, trying to boot from usb:

I guess it is because alpha doesn't support sg chaining, and
CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
can only be arm, alpha and parisc.

Please test the following patch and see if it makes a difference:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6e81258471fa..9ef632963740 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -44,9 +44,13 @@
  * Size of integrity metadata is usually small, 1 inline sg should
  * cover normal cases.
  */
+#ifndef CONFIG_ARCH_NO_SG_CHAIN
 #define  SCSI_INLINE_PROT_SG_CNT  1
-
 #define  SCSI_INLINE_SG_CNT  2
+#else
+#define  SCSI_INLINE_PROT_SG_CNT  0
+#define  SCSI_INLINE_SG_CNT  0
+#endif
 
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;


Thanks,
Ming
Guenter Roeck June 4, 2019, 3:49 a.m. UTC | #4
On 6/3/19 6:00 PM, Ming Lei wrote:
> On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
>> On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
>>> Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
>>> and the buffer size is scsi_mq_sgl_size() which depends on smaller
>>> value between shost->sg_tablesize and SG_CHUNK_SIZE.
>>>
>>> Modern HBA's DMA is often capable of deadling with very big segment
>>> number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
>>> of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
>>>
>>> Then if one HBA has lots of queues, and each hw queue's depth is
>>> high, pre-allocation for sg list can consume huge memory.
>>> For example of lpfc, nr_hw_queues can be 70, each queue's depth
>>> can be 3781, so the pre-allocation for data sg list is 70*3781*2k
>>> =517MB for single HBA.
>>>
>>> There is Red Hat internal report that scsi_debug based tests can't
>>> be run any more since legacy io path is killed because too big
>>> pre-allocation.
>>>
>>> So switch to runtime allocation for sg list, meantime pre-allocate 2
>>> inline sg entries. This way has been applied to NVMe PCI for a while,
>>> so it should be fine for SCSI too. Also runtime sg entries allocation
>>> has verified and run always in the original legacy io path.
>>>
>>> Not see performance effect in my big BS test on scsi_debug.
>>>
>>
>> This patch causes a variety of boot failures in -next. Typical failure
>> pattern is scsi hangs or failure to find a root file system. For example,
>> on alpha, trying to boot from usb:
> 
> I guess it is because alpha doesn't support sg chaining, and
> CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> can only be arm, alpha and parisc.
> 

I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
sparc, ppc, and m68k as well, and possibly others (I didn't check all because
-next is in terrible shape right now). Error log is always a bit different
but similar.

On sparc:

scsi host0: Data transfer overflow.
scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
scsi host0: DMA length is zero!
scsi host0: cur adr[f000f000] len[00000000]
scsi host0: Data transfer overflow.
scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
scsi host0: DMA length is zero!

On ppc:

scsi host0: DMA length is zero!
scsi host0: cur adr[0fd21000] len[00000000]
scsi host0: Aborting command [(ptrval):28]
scsi host0: Current command [(ptrval):28]
scsi host0:  Active command [(ptrval):28]

On x86, x86_64 (after reverting a different crash-causing patch):

[   20.226809] scsi host0: DMA length is zero!
[   20.227459] scsi host0: cur adr[00000000] len[00000000]
[   50.588814] scsi host0: Aborting command [(____ptrval____):28]
[   50.589210] scsi host0: Current command [(____ptrval____):28]
[   50.589447] scsi host0:  Active command [(____ptrval____):28]
[   50.589674] scsi host0: Dumping command log

On m68k there is a crash.

Unable to handle kernel NULL pointer dereference at virtual address (ptrval)
Oops: 00000000
Modules linked in:
PC: [<00203a9e>] esp_maybe_execute_command+0x31e/0x46c
SR: 2704  SP: (ptrval)  a2: 07c1ea20
d0: 00000002    d1: 00000400    d2: 00000000    d3: 00002000
d4: 00000000    d5: 00000000    a0: 07db753c    a1: 00000000
Process kworker/0:0H (pid: 4, task=(ptrval))
Frame format=7 eff addr=00000020 ssw=0505 faddr=00000020
wb 1 stat/addr/data: 0000 00000000 00000000
wb 2 stat/addr/data: 0000 00000000 00000000
wb 3 stat/addr/data: 0000 00000020 00000000
push data: 00000000 00000000 00000000 00000000
Stack from 07c2be08:
...
Call Trace: [<00002000>] _start+0x0/0x8
  [<001f65ca>] scsi_mq_done+0x0/0x2c
  [<001887fe>] blk_mq_get_driver_tag+0x0/0xba
  [<00188800>] blk_mq_get_driver_tag+0x2/0xba
  [<00025aec>] create_worker+0x0/0x14e
  [<00203f64>] esp_queuecommand+0x9c/0xa2
  [<00203f3a>] esp_queuecommand+0x72/0xa2
  [<001f7fd4>] scsi_queue_rq+0x54e/0x5ba
  [<00188b4a>] blk_mq_dispatch_rq_list+0x292/0x38a
  [<00188656>] blk_mq_dequeue_from_ctx+0x0/0x1a8
  [<001888b8>] blk_mq_dispatch_rq_list+0x0/0x38a
  [<00025aec>] create_worker+0x0/0x14e
...

All those problems are fixed by reverting this patch.

Guenter
Ming Lei June 4, 2019, 4:10 a.m. UTC | #5
On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> On 6/3/19 6:00 PM, Ming Lei wrote:
> > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > 
> > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > 
> > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > high, pre-allocation for sg list can consume huge memory.
> > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > =517MB for single HBA.
> > > > 
> > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > be run any more since legacy io path is killed because too big
> > > > pre-allocation.
> > > > 
> > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > has verified and run always in the original legacy io path.
> > > > 
> > > > Not see performance effect in my big BS test on scsi_debug.
> > > > 
> > > 
> > > This patch causes a variety of boot failures in -next. Typical failure
> > > pattern is scsi hangs or failure to find a root file system. For example,
> > > on alpha, trying to boot from usb:
> > 
> > I guess it is because alpha doesn't support sg chaining, and
> > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > can only be arm, alpha and parisc.
> > 
> 
> I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> -next is in terrible shape right now). Error log is always a bit different
> but similar.
> 
> On sparc:
> 
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> scsi host0: cur adr[f000f000] len[00000000]
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> 
> On ppc:
> 
> scsi host0: DMA length is zero!
> scsi host0: cur adr[0fd21000] len[00000000]
> scsi host0: Aborting command [(ptrval):28]
> scsi host0: Current command [(ptrval):28]
> scsi host0:  Active command [(ptrval):28]
> 
> On x86, x86_64 (after reverting a different crash-causing patch):
> 
> [   20.226809] scsi host0: DMA length is zero!
> [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> [   50.589210] scsi host0: Current command [(____ptrval____):28]
> [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> [   50.589674] scsi host0: Dumping command log

OK, I did see one boot crash issue on x86_64 with -next, so could
you share us that patch which needs to be reverted? Meantime, please
provide me your steps for reproducing this issue? (rootfs image, kernel
config, qemu command)

BTW, the patch has been tested in RH QE lab, so far not see such reports
yet.

Thanks,
Ming
Ming Lei June 4, 2019, 6:55 a.m. UTC | #6
On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> On 6/3/19 6:00 PM, Ming Lei wrote:
> > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > 
> > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > 
> > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > high, pre-allocation for sg list can consume huge memory.
> > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > =517MB for single HBA.
> > > > 
> > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > be run any more since legacy io path is killed because too big
> > > > pre-allocation.
> > > > 
> > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > has verified and run always in the original legacy io path.
> > > > 
> > > > Not see performance effect in my big BS test on scsi_debug.
> > > > 
> > > 
> > > This patch causes a variety of boot failures in -next. Typical failure
> > > pattern is scsi hangs or failure to find a root file system. For example,
> > > on alpha, trying to boot from usb:
> > 
> > I guess it is because alpha doesn't support sg chaining, and
> > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > can only be arm, alpha and parisc.
> > 
> 
> I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> -next is in terrible shape right now). Error log is always a bit different
> but similar.
> 
> On sparc:
> 
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> scsi host0: cur adr[f000f000] len[00000000]
> scsi host0: Data transfer overflow.
> scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> scsi host0: DMA length is zero!
> 
> On ppc:
> 
> scsi host0: DMA length is zero!
> scsi host0: cur adr[0fd21000] len[00000000]
> scsi host0: Aborting command [(ptrval):28]
> scsi host0: Current command [(ptrval):28]
> scsi host0:  Active command [(ptrval):28]
> 
> On x86, x86_64 (after reverting a different crash-causing patch):
> 
> [   20.226809] scsi host0: DMA length is zero!
> [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> [   50.589210] scsi host0: Current command [(____ptrval____):28]
> [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> [   50.589674] scsi host0: Dumping command log
> 
> On m68k there is a crash.
> 
> Unable to handle kernel NULL pointer dereference at virtual address (ptrval)
> Oops: 00000000
> Modules linked in:
> PC: [<00203a9e>] esp_maybe_execute_command+0x31e/0x46c
> SR: 2704  SP: (ptrval)  a2: 07c1ea20
> d0: 00000002    d1: 00000400    d2: 00000000    d3: 00002000
> d4: 00000000    d5: 00000000    a0: 07db753c    a1: 00000000
> Process kworker/0:0H (pid: 4, task=(ptrval))
> Frame format=7 eff addr=00000020 ssw=0505 faddr=00000020
> wb 1 stat/addr/data: 0000 00000000 00000000
> wb 2 stat/addr/data: 0000 00000000 00000000
> wb 3 stat/addr/data: 0000 00000020 00000000
> push data: 00000000 00000000 00000000 00000000
> Stack from 07c2be08:
> ...
> Call Trace: [<00002000>] _start+0x0/0x8
>  [<001f65ca>] scsi_mq_done+0x0/0x2c
>  [<001887fe>] blk_mq_get_driver_tag+0x0/0xba
>  [<00188800>] blk_mq_get_driver_tag+0x2/0xba
>  [<00025aec>] create_worker+0x0/0x14e
>  [<00203f64>] esp_queuecommand+0x9c/0xa2
>  [<00203f3a>] esp_queuecommand+0x72/0xa2
>  [<001f7fd4>] scsi_queue_rq+0x54e/0x5ba
>  [<00188b4a>] blk_mq_dispatch_rq_list+0x292/0x38a
>  [<00188656>] blk_mq_dequeue_from_ctx+0x0/0x1a8
>  [<001888b8>] blk_mq_dispatch_rq_list+0x0/0x38a
>  [<00025aec>] create_worker+0x0/0x14e
> ...
> 
> All those problems are fixed by reverting this patch.

All above problem is esp_scsi specific, and esp_scsi does not support
sg chain.

I will try to figure out one patch to cover all.

esp_map_dma():
        if (esp->flags & ESP_FLAG_NO_DMA_MAP) {
                /*
                 * For pseudo DMA and PIO we need the virtual address instead of
                 * a dma address, so perform an identity mapping.
                 */
                spriv->num_sg = scsi_sg_count(cmd);
                for (i = 0; i < spriv->num_sg; i++) {
                        sg[i].dma_address = (uintptr_t)sg_virt(&sg[i]);
                        total += sg_dma_len(&sg[i]);
                }
        } else {
                spriv->num_sg = scsi_dma_map(cmd);
                for (i = 0; i < spriv->num_sg; i++)
                        total += sg_dma_len(&sg[i]);
        }

esp_advance_dma():
	p->cur_sg++;

esp_msgin_process():
	spriv->cur_sg--;

Thanks,
Ming
Guenter Roeck June 4, 2019, 2:51 p.m. UTC | #7
On Tue, Jun 04, 2019 at 12:10:00PM +0800, Ming Lei wrote:
> On Mon, Jun 03, 2019 at 08:49:10PM -0700, Guenter Roeck wrote:
> > On 6/3/19 6:00 PM, Ming Lei wrote:
> > > On Mon, Jun 03, 2019 at 01:44:22PM -0700, Guenter Roeck wrote:
> > > > On Sun, Apr 28, 2019 at 03:39:32PM +0800, Ming Lei wrote:
> > > > > Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
> > > > > and the buffer size is scsi_mq_sgl_size() which depends on smaller
> > > > > value between shost->sg_tablesize and SG_CHUNK_SIZE.
> > > > > 
> > > > > Modern HBA's DMA is often capable of deadling with very big segment
> > > > > number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
> > > > > of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.
> > > > > 
> > > > > Then if one HBA has lots of queues, and each hw queue's depth is
> > > > > high, pre-allocation for sg list can consume huge memory.
> > > > > For example of lpfc, nr_hw_queues can be 70, each queue's depth
> > > > > can be 3781, so the pre-allocation for data sg list is 70*3781*2k
> > > > > =517MB for single HBA.
> > > > > 
> > > > > There is Red Hat internal report that scsi_debug based tests can't
> > > > > be run any more since legacy io path is killed because too big
> > > > > pre-allocation.
> > > > > 
> > > > > So switch to runtime allocation for sg list, meantime pre-allocate 2
> > > > > inline sg entries. This way has been applied to NVMe PCI for a while,
> > > > > so it should be fine for SCSI too. Also runtime sg entries allocation
> > > > > has verified and run always in the original legacy io path.
> > > > > 
> > > > > Not see performance effect in my big BS test on scsi_debug.
> > > > > 
> > > > 
> > > > This patch causes a variety of boot failures in -next. Typical failure
> > > > pattern is scsi hangs or failure to find a root file system. For example,
> > > > on alpha, trying to boot from usb:
> > > 
> > > I guess it is because alpha doesn't support sg chaining, and
> > > CONFIG_ARCH_NO_SG_CHAIN is enabled. ARCHs not supporting sg chaining
> > > can only be arm, alpha and parisc.
> > > 
> > 
> > I don't think it is that simple. I do see the problem on x86 (32 and 64 bit)
> > sparc, ppc, and m68k as well, and possibly others (I didn't check all because
> > -next is in terrible shape right now). Error log is always a bit different
> > but similar.
> > 
> > On sparc:
> > 
> > scsi host0: Data transfer overflow.
> > scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> > scsi host0: DMA length is zero!
> > scsi host0: cur adr[f000f000] len[00000000]
> > scsi host0: Data transfer overflow.
> > scsi host0: cur_residue[0] tot_residue[-181604017] len[8192]
> > scsi host0: DMA length is zero!
> > 
> > On ppc:
> > 
> > scsi host0: DMA length is zero!
> > scsi host0: cur adr[0fd21000] len[00000000]
> > scsi host0: Aborting command [(ptrval):28]
> > scsi host0: Current command [(ptrval):28]
> > scsi host0:  Active command [(ptrval):28]
> > 
> > On x86, x86_64 (after reverting a different crash-causing patch):
> > 
> > [   20.226809] scsi host0: DMA length is zero!
> > [   20.227459] scsi host0: cur adr[00000000] len[00000000]
> > [   50.588814] scsi host0: Aborting command [(____ptrval____):28]
> > [   50.589210] scsi host0: Current command [(____ptrval____):28]
> > [   50.589447] scsi host0:  Active command [(____ptrval____):28]
> > [   50.589674] scsi host0: Dumping command log
> 
> OK, I did see one boot crash issue on x86_64 with -next, so could
> you share us that patch which needs to be reverted? Meantime, please
> provide me your steps for reproducing this issue? (rootfs image, kernel
> config, qemu command)
> 
The patch to be reverted is this one. I'll prepare the rest of the
information later today.

> BTW, the patch has been tested in RH QE lab, so far not see such reports
> yet.
> 
FWIW, I don't think the RE QE lab tests any of the affected configurations.

Guenter
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2eaba41655de..472d848f1778 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -45,6 +45,8 @@ 
  */
 #define  SCSI_INLINE_PROT_SG_CNT  1
 
+#define  SCSI_INLINE_SG_CNT  2
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -547,7 +549,8 @@  static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
+		sg_free_table_chained(&cmd->sdb.table,
+				SCSI_INLINE_SG_CNT);
 	if (scsi_prot_sg_count(cmd))
 		sg_free_table_chained(&cmd->prot_sdb->table,
 				SCSI_INLINE_PROT_SG_CNT);
@@ -984,7 +987,7 @@  static blk_status_t scsi_init_sgtable(struct request *req,
 	 */
 	if (unlikely(sg_alloc_table_chained(&sdb->table,
 			blk_rq_nr_phys_segments(req), sdb->table.sgl,
-			SG_CHUNK_SIZE)))
+			SCSI_INLINE_SG_CNT)))
 		return BLK_STS_RESOURCE;
 
 	/* 
@@ -1550,9 +1553,9 @@  static int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 }
 
 /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */
-static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost)
+static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost)
 {
-	return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) *
+	return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) *
 		sizeof(struct scatterlist);
 }
 
@@ -1730,7 +1733,7 @@  static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 	if (scsi_host_get_prot(shost)) {
 		sg = (void *)cmd + sizeof(struct scsi_cmnd) +
 			shost->hostt->cmd_size;
-		cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost);
+		cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost);
 	}
 
 	return 0;
@@ -1824,7 +1827,7 @@  int scsi_mq_setup_tags(struct Scsi_Host *shost)
 {
 	unsigned int cmd_size, sgl_size;
 
-	sgl_size = scsi_mq_sgl_size(shost);
+	sgl_size = scsi_mq_inline_sgl_size(shost);
 	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
 	if (scsi_host_get_prot(shost))
 		cmd_size += sizeof(struct scsi_data_buffer) +