Message ID | 20250320111328.2841690-3-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | nvme-pci: breaking the 512 KiB max IO boundary | expand |
On 3/20/25 4:13 AM, Luis Chamberlain wrote: > -/* > - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) > - * however we constrain this to what we can validate and test. > - */ > -#define BLK_MAX_BLOCK_SIZE SZ_64K > +#define BLK_MAX_BLOCK_SIZE 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) > > /* blk_validate_limits() validates bsize, so drivers don't usually need to */ > static inline int blk_validate_block_size(unsigned long bsize) All logical block sizes above 4 KiB trigger write amplification if there are applications that write 4 KiB at a time, isn't it? Isn't that issue even worse for logical block sizes above 64 KiB? Thanks, Bart.
On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote: > On 3/20/25 4:13 AM, Luis Chamberlain wrote: > > -/* > > - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) > > - * however we constrain this to what we can validate and test. > > - */ > > -#define BLK_MAX_BLOCK_SIZE SZ_64K > > +#define BLK_MAX_BLOCK_SIZE 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) > > /* blk_validate_limits() validates bsize, so drivers don't usually need to */ > > static inline int blk_validate_block_size(unsigned long bsize) > > All logical block sizes above 4 KiB trigger write amplification if there > are applications that write 4 KiB at a time, isn't it? Isn't that issue > even worse for logical block sizes above 64 KiB? I think everybody knows this Bart. You've raised it before, we've talked about it, and you're not bringing anything new to the discussion this time. Why bring it up again?
On 3/20/25 9:06 AM, Matthew Wilcox wrote: > On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote: >> On 3/20/25 4:13 AM, Luis Chamberlain wrote: >>> -/* >>> - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) >>> - * however we constrain this to what we can validate and test. >>> - */ >>> -#define BLK_MAX_BLOCK_SIZE SZ_64K >>> +#define BLK_MAX_BLOCK_SIZE 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) >>> /* blk_validate_limits() validates bsize, so drivers don't usually need to */ >>> static inline int blk_validate_block_size(unsigned long bsize) >> >> All logical block sizes above 4 KiB trigger write amplification if there >> are applications that write 4 KiB at a time, isn't it? Isn't that issue >> even worse for logical block sizes above 64 KiB? > > I think everybody knows this Bart. You've raised it before, we've > talked about it, and you're not bringing anything new to the discussion > this time. Why bring it up again? The patch description mentions what has been changed but does not mention why. Shouldn't the description of this patch explain why this change has been made? Shouldn't the description of this patch explain for which applications this change is useful? Thanks, Bart.
On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote: > The patch description mentions what has been changed but does not > mention why. Shouldn't the description of this patch explain why this > change has been made? Shouldn't the description of this patch explain > for which applications this change is useful? The manufacturer chooses the block size. If they've made a bad decision, their device will presumably not sell well. We don't need to justify their decision in the commit message.
On 3/20/25 9:27 AM, Matthew Wilcox wrote: > On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote: >> The patch description mentions what has been changed but does not >> mention why. Shouldn't the description of this patch explain why this >> change has been made? Shouldn't the description of this patch explain >> for which applications this change is useful? > > The manufacturer chooses the block size. If they've made a bad decision, > their device will presumably not sell well. We don't need to justify > their decision in the commit message. The fact that this change is proposed because there are device manufacturers that want to produce devices with block sizes larger than 64 KiB would be useful information for the commit message. Thanks, Bart.
On Thu, Mar 20, 2025 at 09:34:50AM -0700, Bart Van Assche wrote: > The fact that this change is proposed because there are device > manufacturers that want to produce devices with block sizes larger than > 64 KiB would be useful information for the commit message. I think it's just the proposal that is very confused. Keith pointed out that number of segments mostly matters when you have 4k non-contiguous pages and that's where the quoted limit comes from. The LBA size is a lower bound to the folio size in the page cache, so the limit automatically increases with that, although in practice the file system will usually allocate larger folios even with a smaller block size, and I assume most systems (or at least the ones that care about performance) actually use transparent huge pages/folios for anonymous memory as well.
On 3/20/25 12:27 PM, Matthew Wilcox wrote: > On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote: >> The patch description mentions what has been changed but does not >> mention why. Shouldn't the description of this patch explain why this >> change has been made? Shouldn't the description of this patch explain >> for which applications this change is useful? > > The manufacturer chooses the block size. If they've made a bad decision, > their device will presumably not sell well. We don't need to justify > their decision in the commit message. From a 2023 presentation by Luis (https://lpc.events/event/17/contributions/1508/attachments/1298/2608/LBS_LPC2023.pdf): - SSD manufacturers want to increase the indirection unit (IU) size. - Increasing the IU size reduces SSD DRAM costs. - LBS is not suitable for all workloads because smaller IOs with LBS can cause write amplification (WAF) due to read modify writes. - Some database software benefits of a 16 KiB logical block size. If the goal is to reduce DRAM costs then I recommend SSD manufacturers to implement zoned storage (ZNS) instead of only increasing the logical block size. A big advantage of zoned storage is that the DRAM cost is reduced significantly even if the block size is not increased. Are there any applications that benefit from a block size larger than 64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree that this question should be answered in the patch description? Thanks, Bart.
On Mon, Mar 24, 2025 at 06:58:26AM -0400, Bart Van Assche wrote: > If the goal is to reduce DRAM costs then I recommend SSD manufacturers > to implement zoned storage (ZNS) instead of only increasing the logical > block size. A big advantage of zoned storage is that the DRAM cost is > reduced significantly even if the block size is not increased. > > Are there any applications that benefit from a block size larger than > 64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree > that this question should be answered in the patch description? Do I agree that we should use the commit message to enter into a philosophical debate about whether ZNS or large block sizes are better? No, I do not. I don't even think we should have this discussion any more on this mailing list; I think everyone is aware that both alternatives exist. You don't like it, and that's your prerogative. But at some point you have to stop being an awkward cuss about it. I think CXL is an abomination; I've made this point often enough that everybody is aware of it. I don't make it any more. All I do is NACK the inclusion of patches that are only for the benefit of CXL until CXL has actually demonstrated its utility.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1c0cf6af392c..9e1b3e7526d9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -26,6 +26,7 @@ #include <linux/xarray.h> #include <linux/file.h> #include <linux/lockdep.h> +#include <linux/pagemap.h> struct module; struct request_queue; @@ -268,11 +269,7 @@ static inline dev_t disk_devt(struct gendisk *disk) return MKDEV(disk->major, disk->first_minor); } -/* - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) - * however we constrain this to what we can validate and test. - */ -#define BLK_MAX_BLOCK_SIZE SZ_64K +#define BLK_MAX_BLOCK_SIZE 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER) /* blk_validate_limits() validates bsize, so drivers don't usually need to */ static inline int blk_validate_block_size(unsigned long bsize)
It's a brave new world. This is now part of the validation to get to at least the page cache limit. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- include/linux/blkdev.h | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)