scsi: SSDs can timeout during FS init because of too many unmaps

I encountered this issue putting XFS on several brands of SSDs on my
system. During initialization, I would see a bunch of timeouts on
WRITE_SAME_16 commands, which would get aborted, reissued, and complete.
The logs look like this:

kernel: sd 2:0:1:0: attempting task abort! scmd(ffff88086ca0c8c0)
kernel: sd 1:0:1:0: attempting task abort! scmd(ffff88086c7f2bc0)
kernel: sd 1:0:1:0: [sds] CDB: Write same(16) 93 08 00 00 00 00 24 04
07 b8 00 7f ff ff 00 00
kernel: sd 2:0:1:0: [sdt] CDB: Write same(16) 93 08 00 00 00 00 24 04
07 b8 00 7f ff ff 00 00

And so on (there are many, many more of these)...

The interesting thing to note as that these are WS16 commands with the
unmap bit set (this drive's version of UNMAP) with length 0x7fffff.
This is over 8.3 million blocks to be unmapped at once. Since there are
many concurrent "unmaps", the drive can get overwhelmed and time out.

----

Why does this happen? Initializing the device with a filesystem (in my
experience XFS) creates one huge discard for the SSD. This gets
broken apart into smaller unmap seqments, which get sent down to the
drive. For the SSDs that I've been working with (lbpws is always set), 
UNMAPs always gettranslated to WS16 with the unmap bit set.

The size of these segments is determined when the drive is set up
initially. Early on, in the routine sd_read_block_limits, we read the
Block Limits VPD page (page 0xb0). Among other things, this page gives
us the drive's MAX UNMAP LBA count as well as the MAX WRITE SAME LENGTH.
In my experience, every SSD returns zero for MAX WRITE SAME length but
does have a real value for MAX_UNMAP LBA count.

The way the code is structured, because we read in zero for
MAX WRITE SAME, we assume there is no limit for write same commands.
This *may* be true, but unmap/discard commands translate into write
same commands with the unmap bit set. Technically, this makes them
no longer write same commands.

Currently, the max discard size is actually based off of max_ws_blocks.
When configuring max discard size later, we default to
SD_MAX_WS16_BLOCKS (0x7fffff) because max_ws_blocks is currently
always zero:

             max_blocks = min_not_zero(sdkp->max_ws_blocks,
                                          (u32)SD_MAX_WS16_BLOCKS);

A reasonable fix for this would be to use the MAX UNMAP LBA count
(stored as max_unmap_blocks) instead of max_ws_blocks in the case where
we're defaulting to WS16 for unmaps.

After discussing this issue with an SSD vendor's firmware team, they
confirmed that this was a good way to proceed. That is, it made sense to
use the max_unmap_blocks count instead of the default SD_MAX_WS16_BLOCKS
value because 1) max_ws_blocks is always zero, 2) these are really
unmap commands we're issuing, and 3) the SSD can handle a few unmaps
the size of SD_MAX_WS16_BLOCKS but not necessarily a barrage of them.

The largest max unmap size I've seen from returned from a drive (from
the Block Limits VPD page) is 0x270000 or about 30% of SD_MAX_WS16_BLOCKS. 
Other sizes are much smaller, typically 0x80000 or about 6% of the 
previous max value.

I've also done performance testing for this change. The only impact I've
seen on SSDs is during filesystem initialization time, which would be 
expected since that's most likely the only time we'd be doing really large 
unmaps. Even so, the impact on FS init is fairly minimal, 10% for some 
models of SSDs, others no noticeable difference at all.

----

Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
---
 drivers/scsi/sd.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

scsi: SSDs can timeout during FS init because of too many unmaps

Commit Message

Comments

Patch