diff mbox series

sd_zbc: Fix report zones buffer allocation

Message ID 20190620034812.3254-1-damien.lemoal@wdc.com (mailing list archive)
State Changes Requested
Headers show
Series sd_zbc: Fix report zones buffer allocation | expand

Commit Message

Damien Le Moal June 20, 2019, 3:48 a.m. UTC
During disk revalidation done with sd_revalidate(), the zones of a
zoned disk zones are checked using the helper function
blk_revalidate_disk_zones() if a zone configuration change is detected
(change in the number of zones or zone size). The function
blk_revalidate_disk_zones() issues report_zones calls that are very
large, that is, to obtain zone information for all zones of the disk
with a single command. The size of the report zones command buffer
necessary for such large request generally is lower than the disk
max_hw_sectors and KMALLOC_MAX_SIZE (4MB) but still very large (e.g.
aboiut 3.5MB for a 15TB disk with 256MB zones). This large report zones
reply buffer allocation with kmalloc succeeds on boot, but frequently
fails at run time, especially for a system under memory pressure. This
causes the disk revalidation to fail and the disk capacity to be
changed to 0.

This problem can be avoided with a more intelligent report zones buffer
allocation. This patch introduces the arbitrary SD_ZBC_REPORT_SIZE
allocation limit of 1MB allowing to fit 16383 zone descriptor for every
report zone command execution, thus allowing a full zone report with 4
or 5 commands for most ZBC/ZAC disks today. This limit may be lowered to
satisfy the disk max_hw_sectors limit. Furthermore, further reduce the
likelyhood of a buffer allocation failure while guaranteeing progress
in the zone report by retrying the buffer allocation with a smaller
size in case kmalloc() fails.

Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
Fixes: e76239a3748c ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd_zbc.c | 54 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 45 insertions(+), 9 deletions(-)

Comments

Christoph Hellwig June 20, 2019, 6:28 a.m. UTC | #1
This looks like what we discussed last week, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche June 20, 2019, 3:24 p.m. UTC | #2
On 6/19/19 8:48 PM, Damien Le Moal wrote:
> +	/*
> +	 * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE
> +	 * size (1MB), allowing up to 16383 zone descriptors being reported with
> +	 * a single command. And make sure that this size does not exceed the
> +	 * hardware capabilities. To avoid disk revalidation failures due to
> +	 * memory allocation errors, retry the allocation with a smaller buffer
> +	 * size if the allocation fails.
> +	 */
> +	bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE);
> +	bufsize = min_t(size_t, bufsize,
> +			queue_max_hw_sectors(disk->queue) << 9);
> +	for (order = get_order(bufsize); order >= 0; order--) {
> +		page = alloc_pages(gfp_mask, order);
> +		if (page) {
> +			*buflen = PAGE_SIZE << order;
> +			return page_address(page);
> +		}
> +	}

Hi Damien,

As you know Linux memory fragmentation tends to increase over time. The 
above code has the very unfortunate property that the more memory is 
fragmented the smaller the allocated buffer will become. I don't think 
that's how kernel code should work. Have you considered to use vmalloc() 
+ blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an 
example of how to build a scatterlist for memory allocated by vmalloc().

Thanks,

Bart.
Damien Le Moal June 21, 2019, 12:58 a.m. UTC | #3
Bart,

On 2019/06/21 0:25, Bart Van Assche wrote:
> On 6/19/19 8:48 PM, Damien Le Moal wrote:
>> +	/*
>> +	 * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE
>> +	 * size (1MB), allowing up to 16383 zone descriptors being reported with
>> +	 * a single command. And make sure that this size does not exceed the
>> +	 * hardware capabilities. To avoid disk revalidation failures due to
>> +	 * memory allocation errors, retry the allocation with a smaller buffer
>> +	 * size if the allocation fails.
>> +	 */
>> +	bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE);
>> +	bufsize = min_t(size_t, bufsize,
>> +			queue_max_hw_sectors(disk->queue) << 9);
>> +	for (order = get_order(bufsize); order >= 0; order--) {
>> +		page = alloc_pages(gfp_mask, order);
>> +		if (page) {
>> +			*buflen = PAGE_SIZE << order;
>> +			return page_address(page);
>> +		}
>> +	}
> 
> Hi Damien,
> 
> As you know Linux memory fragmentation tends to increase over time. The 
> above code has the very unfortunate property that the more memory is 
> fragmented the smaller the allocated buffer will become. I don't think 
> that's how kernel code should work. Have you considered to use vmalloc() 
> + blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an 
> example of how to build a scatterlist for memory allocated by vmalloc().

The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a
vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq
calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the
list of pages of the buffer. Just would like to confirm I understand this correctly.

My concern with using vmalloc is that the worst case scenario will result in all
pages of the buffer being non contiguous. In this case, since the report zones
command cannot be split, we would need to limit the allocation to max_segments *
page size, and that can be pretty small for some HBAs.

Another reason I did not pursue the vmalloc route is that the processing of the
report zones reply to transform zone information into struct blkzone is really
painful to do with a vmalloced buffer as every page of the buffer needs to be
kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a
BIO/request, but it was a lot of code for not much to be done. Or is there a
more elegant solution for in-kernel mapping of a vmalloc buffer ?
Bart Van Assche June 21, 2019, 3:46 a.m. UTC | #4
On 6/20/19 5:58 PM, Damien Le Moal wrote:
> The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a
> vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq
> calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the
> list of pages of the buffer. Just would like to confirm I understand this correctly.
> 
> My concern with using vmalloc is that the worst case scenario will result in all
> pages of the buffer being non contiguous. In this case, since the report zones
> command cannot be split, we would need to limit the allocation to max_segments *
> page size, and that can be pretty small for some HBAs.
> 
> Another reason I did not pursue the vmalloc route is that the processing of the
> report zones reply to transform zone information into struct blkzone is really
> painful to do with a vmalloced buffer as every page of the buffer needs to be
> kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a
> BIO/request, but it was a lot of code for not much to be done. Or is there a
> more elegant solution for in-kernel mapping of a vmalloc buffer ?

Hi Damien,

I don't think that bio_rq_map_kern() works with vmalloc-ed buffers. How 
about using is_vmalloc_addr() inside scsi_execute_req() to determine 
whether or not the buffer passed to that function has been allocated 
with vmalloc()? There may be other scsi_execute_req() callers that can 
benefit from passing a vmalloc-ed buffer to that function.

Regarding the maximum segment size: is mpt3sas still the most popular 
HBA? Is the maximum segment size 128 for that driver? Is 128 * 4 KB = 
512 KB big enough for the report zones buffer?

Regarding the loop that calls sd_zbc_parse_report(): are you sure that 
that kmap()/kunmap() calls would be necessary in that loop?

Thanks,

Bart.
Damien Le Moal June 21, 2019, 8:03 a.m. UTC | #5
Bart,

On 2019/06/21 12:46, Bart Van Assche wrote:
> On 6/20/19 5:58 PM, Damien Le Moal wrote:
>> The REPORT_ZONES command is executed using scsi_execute_req(). Can we pass a
>> vmalloc-ed buffer to that function ? It does look like it since scsi_execute_rq
>> calls bio_rq_map_kern() which then calls bio_map_kern() which goes through the
>> list of pages of the buffer. Just would like to confirm I understand this correctly.
>>
>> My concern with using vmalloc is that the worst case scenario will result in all
>> pages of the buffer being non contiguous. In this case, since the report zones
>> command cannot be split, we would need to limit the allocation to max_segments *
>> page size, and that can be pretty small for some HBAs.
>>
>> Another reason I did not pursue the vmalloc route is that the processing of the
>> report zones reply to transform zone information into struct blkzone is really
>> painful to do with a vmalloced buffer as every page of the buffer needs to be
>> kmap()/kunmap(). The code was like that when REPORT ZONES was processed as a
>> BIO/request, but it was a lot of code for not much to be done. Or is there a
>> more elegant solution for in-kernel mapping of a vmalloc buffer ?
> 
> Hi Damien,
> 
> I don't think that bio_rq_map_kern() works with vmalloc-ed buffers. How 
> about using is_vmalloc_addr() inside scsi_execute_req() to determine 
> whether or not the buffer passed to that function has been allocated 
> with vmalloc()? There may be other scsi_execute_req() callers that can 
> benefit from passing a vmalloc-ed buffer to that function.

Sure, we could do that. But since most (if not all) users of scsi_execute_req()
need only a very small buffer, I am not sure if this would be very useful.

> Regarding the maximum segment size: is mpt3sas still the most popular 
> HBA? Is the maximum segment size 128 for that driver? Is 128 * 4 KB = 
> 512 KB big enough for the report zones buffer?

Sure, it works, but compared to the target 1MB allocation that my patch has,
that doubles the number of requests needed for reporting the zones of an entire
drive. Since this is however not used a lot, I guess it is okay. But still, I do
not like very much slowing down revalidate() nor regular blkdev_report_zones()
calls...

> Regarding the loop that calls sd_zbc_parse_report(): are you sure that 
> that kmap()/kunmap() calls would be necessary in that loop?

No I am not sure. I will check again.

Best regards.
Damien Le Moal June 22, 2019, 7:44 a.m. UTC | #6
On 2019/06/21 0:25, Bart Van Assche wrote:
> On 6/19/19 8:48 PM, Damien Le Moal wrote:
>> +	/*
>> +	 * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE
>> +	 * size (1MB), allowing up to 16383 zone descriptors being reported with
>> +	 * a single command. And make sure that this size does not exceed the
>> +	 * hardware capabilities. To avoid disk revalidation failures due to
>> +	 * memory allocation errors, retry the allocation with a smaller buffer
>> +	 * size if the allocation fails.
>> +	 */
>> +	bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE);
>> +	bufsize = min_t(size_t, bufsize,
>> +			queue_max_hw_sectors(disk->queue) << 9);
>> +	for (order = get_order(bufsize); order >= 0; order--) {
>> +		page = alloc_pages(gfp_mask, order);
>> +		if (page) {
>> +			*buflen = PAGE_SIZE << order;
>> +			return page_address(page);
>> +		}
>> +	}
> 
> Hi Damien,
> 
> As you know Linux memory fragmentation tends to increase over time. The 
> above code has the very unfortunate property that the more memory is 
> fragmented the smaller the allocated buffer will become. I don't think 
> that's how kernel code should work. Have you considered to use vmalloc() 
> + blk_rq_map_sg() instead? See also efa_vmalloc_buf_to_sg() for an 
> example of how to build a scatterlist for memory allocated by vmalloc().

Bart,

I have a fix along the lines you suggested, but since it modifies
bio_rq_map_kern(), I would rather not push that as a bug fix this late in RC.
Would you agree to accept the fix patch as is for now and I will send the more
complete fix for 5.3 ? Note that this more complete fix also reworks the similar
memory allocation for the struct blk_zone array used for zone revalidation. Put
all together, the use of report zones uses only vmalloc-ed buffers and data
structures, reduces pressure on the memory system and reducing chances of failures.

Best regards.
Bart Van Assche June 24, 2019, 3:08 p.m. UTC | #7
On 6/22/19 12:44 AM, Damien Le Moal wrote:
> I have a fix along the lines you suggested, but since it modifies
> bio_rq_map_kern(), I would rather not push that as a bug fix this late in RC.
> Would you agree to accept the fix patch as is for now and I will send the more
> complete fix for 5.3 ? Note that this more complete fix also reworks the similar
> memory allocation for the struct blk_zone array used for zone revalidation. Put
> all together, the use of report zones uses only vmalloc-ed buffers and data
> structures, reduces pressure on the memory system and reducing chances of failures.

Hi Damien,

I think it's up to Martin to decide how to proceed.

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 7334024b64f1..37469d77264e 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -103,6 +103,44 @@  static int sd_zbc_do_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
+/**
+ * Arbitrary maximum report zones buffer size of 1MB, fitting 16383 x 64B zone
+ * descriptors plus the 64B report header.
+ */
+#define SD_ZBC_REPORT_SIZE (16384U * 64U)
+
+/**
+ * Allocate a buffer for report zones.
+ */
+static void *sd_zbc_alloc_report_buffer(struct gendisk *disk, size_t *buflen,
+					gfp_t gfp_mask)
+{
+	struct page *page;
+	size_t bufsize;
+	int order;
+
+	/*
+	 * Limit the command buffer size to the arbitrary SD_ZBC_REPORT_SIZE
+	 * size (1MB), allowing up to 16383 zone descriptors being reported with
+	 * a single command. And make sure that this size does not exceed the
+	 * hardware capabilities. To avoid disk revalidation failures due to
+	 * memory allocation errors, retry the allocation with a smaller buffer
+	 * size if the allocation fails.
+	 */
+	bufsize = min_t(size_t, *buflen, SD_ZBC_REPORT_SIZE);
+	bufsize = min_t(size_t, bufsize,
+			queue_max_hw_sectors(disk->queue) << 9);
+	for (order = get_order(bufsize); order >= 0; order--) {
+		page = alloc_pages(gfp_mask, order);
+		if (page) {
+			*buflen = PAGE_SIZE << order;
+			return page_address(page);
+		}
+	}
+
+	return NULL;
+}
+
 /**
  * sd_zbc_report_zones - Disk report zones operation.
  * @disk: The target disk
@@ -118,9 +156,9 @@  int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 			gfp_t gfp_mask)
 {
 	struct scsi_disk *sdkp = scsi_disk(disk);
-	unsigned int i, buflen, nrz = *nr_zones;
+	unsigned int i, nrz = *nr_zones;
 	unsigned char *buf;
-	size_t offset = 0;
+	size_t buflen, offset = 0;
 	int ret = 0;
 
 	if (!sd_is_zoned(sdkp))
@@ -128,13 +166,11 @@  int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		return -EOPNOTSUPP;
 
 	/*
-	 * Get a reply buffer for the number of requested zones plus a header,
-	 * without exceeding the device maximum command size. For ATA disks,
-	 * buffers must be aligned to 512B.
+	 * Try to get a buffer that can fits the requested number of zones plus
+	 * the command reply header, all 64B in size.
 	 */
-	buflen = min(queue_max_hw_sectors(disk->queue) << 9,
-		     roundup((nrz + 1) * 64, 512));
-	buf = kmalloc(buflen, gfp_mask);
+	buflen = (nrz + 1) * 64;
+	buf = sd_zbc_alloc_report_buffer(disk, &buflen, gfp_mask);
 	if (!buf)
 		return -ENOMEM;
 
@@ -153,7 +189,7 @@  int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	*nr_zones = nrz;
 
 out_free_buf:
-	kfree(buf);
+	free_pages((unsigned long)buf, get_order(buflen));
 
 	return ret;
 }