diff mbox series

[RFC,2/4] blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit

Message ID 20250320111328.2841690-3-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series nvme-pci: breaking the 512 KiB max IO boundary | expand

Commit Message

Luis Chamberlain March 20, 2025, 11:13 a.m. UTC
It's a brave new world. This is now part of the validation to get
to at least the page cache limit.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/blkdev.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

Bart Van Assche March 20, 2025, 4:01 p.m. UTC | #1
On 3/20/25 4:13 AM, Luis Chamberlain wrote:
> -/*
> - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> - * however we constrain this to what we can validate and test.
> - */
> -#define BLK_MAX_BLOCK_SIZE      SZ_64K
> +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>   
>   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>   static inline int blk_validate_block_size(unsigned long bsize)

All logical block sizes above 4 KiB trigger write amplification if there
are applications that write 4 KiB at a time, isn't it? Isn't that issue
even worse for logical block sizes above 64 KiB?

Thanks,

Bart.
Matthew Wilcox March 20, 2025, 4:06 p.m. UTC | #2
On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote:
> On 3/20/25 4:13 AM, Luis Chamberlain wrote:
> > -/*
> > - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> > - * however we constrain this to what we can validate and test.
> > - */
> > -#define BLK_MAX_BLOCK_SIZE      SZ_64K
> > +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
> >   /* blk_validate_limits() validates bsize, so drivers don't usually need to */
> >   static inline int blk_validate_block_size(unsigned long bsize)
> 
> All logical block sizes above 4 KiB trigger write amplification if there
> are applications that write 4 KiB at a time, isn't it? Isn't that issue
> even worse for logical block sizes above 64 KiB?

I think everybody knows this Bart.  You've raised it before, we've
talked about it, and you're not bringing anything new to the discussion
this time.  Why bring it up again?
Bart Van Assche March 20, 2025, 4:15 p.m. UTC | #3
On 3/20/25 9:06 AM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:01:46AM -0700, Bart Van Assche wrote:
>> On 3/20/25 4:13 AM, Luis Chamberlain wrote:
>>> -/*
>>> - * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>>> - * however we constrain this to what we can validate and test.
>>> - */
>>> -#define BLK_MAX_BLOCK_SIZE      SZ_64K
>>> +#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
>>>    /* blk_validate_limits() validates bsize, so drivers don't usually need to */
>>>    static inline int blk_validate_block_size(unsigned long bsize)
>>
>> All logical block sizes above 4 KiB trigger write amplification if there
>> are applications that write 4 KiB at a time, isn't it? Isn't that issue
>> even worse for logical block sizes above 64 KiB?
> 
> I think everybody knows this Bart.  You've raised it before, we've
> talked about it, and you're not bringing anything new to the discussion
> this time.  Why bring it up again?

The patch description mentions what has been changed but does not
mention why. Shouldn't the description of this patch explain why this
change has been made? Shouldn't the description of this patch explain
for which applications this change is useful?

Thanks,

Bart.
Matthew Wilcox March 20, 2025, 4:27 p.m. UTC | #4
On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
> The patch description mentions what has been changed but does not
> mention why. Shouldn't the description of this patch explain why this
> change has been made? Shouldn't the description of this patch explain
> for which applications this change is useful?

The manufacturer chooses the block size.  If they've made a bad decision,
their device will presumably not sell well.  We don't need to justify
their decision in the commit message.
Bart Van Assche March 20, 2025, 4:34 p.m. UTC | #5
On 3/20/25 9:27 AM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
>> The patch description mentions what has been changed but does not
>> mention why. Shouldn't the description of this patch explain why this
>> change has been made? Shouldn't the description of this patch explain
>> for which applications this change is useful?
> 
> The manufacturer chooses the block size.  If they've made a bad decision,
> their device will presumably not sell well.  We don't need to justify
> their decision in the commit message.

The fact that this change is proposed because there are device
manufacturers that want to produce devices with block sizes larger than
64 KiB would be useful information for the commit message.

Thanks,

Bart.
Christoph Hellwig March 20, 2025, 4:44 p.m. UTC | #6
On Thu, Mar 20, 2025 at 09:34:50AM -0700, Bart Van Assche wrote:
> The fact that this change is proposed because there are device
> manufacturers that want to produce devices with block sizes larger than
> 64 KiB would be useful information for the commit message.

I think it's just the proposal that is very confused.  Keith pointed
out that number of segments mostly matters when you have 4k
non-contiguous pages and that's where the quoted limit comes from.

The LBA size is a lower bound to the folio size in the page cache,
so the limit automatically increases with that, although in practice
the file system will usually allocate larger folios even with a smaller
block size, and I assume most systems (or at least the ones that care
about performance) actually use transparent huge pages/folios for
anonymous memory as well.
Bart Van Assche March 24, 2025, 10:58 a.m. UTC | #7
On 3/20/25 12:27 PM, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 09:15:23AM -0700, Bart Van Assche wrote:
>> The patch description mentions what has been changed but does not
>> mention why. Shouldn't the description of this patch explain why this
>> change has been made? Shouldn't the description of this patch explain
>> for which applications this change is useful?
> 
> The manufacturer chooses the block size.  If they've made a bad decision,
> their device will presumably not sell well.  We don't need to justify
> their decision in the commit message.

 From a 2023 presentation by Luis 
(https://lpc.events/event/17/contributions/1508/attachments/1298/2608/LBS_LPC2023.pdf):
- SSD manufacturers want to increase the indirection unit (IU) size.
- Increasing the IU size reduces SSD DRAM costs.
- LBS is not suitable for all workloads because smaller IOs with LBS can
   cause write amplification (WAF) due to read modify writes.
- Some database software benefits of a 16 KiB logical block size.

If the goal is to reduce DRAM costs then I recommend SSD manufacturers
to implement zoned storage (ZNS) instead of only increasing the logical
block size. A big advantage of zoned storage is that the DRAM cost is
reduced significantly even if the block size is not increased.

Are there any applications that benefit from a block size larger than
64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree
that this question should be answered in the patch description?

Thanks,

Bart.
Matthew Wilcox March 24, 2025, 3:02 p.m. UTC | #8
On Mon, Mar 24, 2025 at 06:58:26AM -0400, Bart Van Assche wrote:
> If the goal is to reduce DRAM costs then I recommend SSD manufacturers
> to implement zoned storage (ZNS) instead of only increasing the logical
> block size. A big advantage of zoned storage is that the DRAM cost is
> reduced significantly even if the block size is not increased.
> 
> Are there any applications that benefit from a block size larger than
> 64 KiB? If not, why to increase BLK_MAX_BLOCK_SIZE further? Do you agree
> that this question should be answered in the patch description?

Do I agree that we should use the commit message to enter into a
philosophical debate about whether ZNS or large block sizes are better?
No, I do not.  I don't even think we should have this discussion
any more on this mailing list; I think everyone is aware that both
alternatives exist.  You don't like it, and that's your prerogative.
But at some point you have to stop being an awkward cuss about it.

I think CXL is an abomination; I've made this point often enough that
everybody is aware of it.  I don't make it any more.  All I do is NACK
the inclusion of patches that are only for the benefit of CXL until
CXL has actually demonstrated its utility.
diff mbox series

Patch

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1c0cf6af392c..9e1b3e7526d9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -26,6 +26,7 @@ 
 #include <linux/xarray.h>
 #include <linux/file.h>
 #include <linux/lockdep.h>
+#include <linux/pagemap.h>
 
 struct module;
 struct request_queue;
@@ -268,11 +269,7 @@  static inline dev_t disk_devt(struct gendisk *disk)
 	return MKDEV(disk->major, disk->first_minor);
 }
 
-/*
- * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
- * however we constrain this to what we can validate and test.
- */
-#define BLK_MAX_BLOCK_SIZE      SZ_64K
+#define BLK_MAX_BLOCK_SIZE      1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
 
 /* blk_validate_limits() validates bsize, so drivers don't usually need to */
 static inline int blk_validate_block_size(unsigned long bsize)