diff mbox series

[3/5] brd: make sector size configurable

Message ID 20230306120127.21375-4-hare@suse.de (mailing list archive)
State New, archived
Headers show
Series brd: Allow to change block sizes | expand

Commit Message

Hannes Reinecke March 6, 2023, 12:01 p.m. UTC
Add a module option 'rd_blksize' to allow the user to change
the sector size of the RAM disks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

Comments

Luis Chamberlain March 9, 2023, 3:12 a.m. UTC | #1
On Mon, Mar 06, 2023 at 01:01:25PM +0100, Hannes Reinecke wrote:
> Add a module option 'rd_blksize' to allow the user to change
> the sector size of the RAM disks.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/brd.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index fc41ea641dfb..11bac3c3f1b6 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -46,6 +46,8 @@ struct brd_device {
>  	spinlock_t		brd_lock;
>  	struct radix_tree_root	brd_folios;
>  	u64			brd_nr_folios;
> +	unsigned int		brd_sector_shift;
> +	unsigned int		brd_sector_size;
>  };

Why not just do this first and initialize this to the defaults set
without the module parameter. Then you don't need to declare over
and over unsigned int rd_sectors, etc in tons of routines and just
pass the brd which most routines get.

Then most of this and the prior patch can be squeezed into one.

The functional changes would be the addition of the module parameter.

> @@ -410,7 +418,7 @@ static int brd_alloc(int i)
>  	 *  otherwise fdisk will align on 1M. Regardless this call
>  	 *  is harmless)
>  	 */
> -	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
> +	blk_queue_physical_block_size(disk->queue, rd_blksize);

And this was added for DAX support, but DAX support was long removed
from brd see commit 7a862fbbdec6 ("brd: remove dax support"). This
nugget was added by Boaz via commit c8fa31730fc72a8 ("brd: Request from
fdisk 4k alignment"), but if we don't support DAX, can't we just kill
that line?

Current doc for blk_queue_physical_block_size() says:

*   This should be set to the lowest possible sector size that the             
*   hardware can operate on without reverting to read-modify-write             
*   operations.  

But since we're working directly with RAM, do we care?

The comment above that line referring to direct_access should be killed
at the very least.

While you're at it, can you then also update Documentation/filesystems/dax.rst
to remove the brd as an example driver with DAX support.

That leaves us with only a few block drivers with DAX:

- dcssblk: s390 dcss block device driver                                        
- pmem: NVDIMM persistent memory driver  
- some dm targets
- fuse virtio_fs

Wonder which is the right example these days for DAX, now that
persistant memory has End of Life'd with Optane dead, curious also of
the value of the above and DAX in general other than support for
whatever made it out.

Should we EOL DAX too?

  Luis
Luis Chamberlain March 20, 2023, 10:52 p.m. UTC | #2
On Wed, Mar 08, 2023 at 07:12:40PM -0800, Luis Chamberlain wrote:
> On Mon, Mar 06, 2023 at 01:01:25PM +0100, Hannes Reinecke wrote:
> > Add a module option 'rd_blksize' to allow the user to change
> > the sector size of the RAM disks.
> > 
> > Signed-off-by: Hannes Reinecke <hare@suse.de>
> > ---
> >  drivers/block/brd.c | 30 +++++++++++++++++++-----------
> >  1 file changed, 19 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> > index fc41ea641dfb..11bac3c3f1b6 100644
> > --- a/drivers/block/brd.c
> > +++ b/drivers/block/brd.c
> > @@ -46,6 +46,8 @@ struct brd_device {
> >  	spinlock_t		brd_lock;
> >  	struct radix_tree_root	brd_folios;
> >  	u64			brd_nr_folios;
> > +	unsigned int		brd_sector_shift;
> > +	unsigned int		brd_sector_size;
> >  };
> 
> Why not just do this first and initialize this to the defaults set
> without the module parameter. Then you don't need to declare over
> and over unsigned int rd_sectors, etc in tons of routines and just
> pass the brd which most routines get.
> 
> Then most of this and the prior patch can be squeezed into one.
> 
> The functional changes would be the addition of the module parameter.
> 
> > @@ -410,7 +418,7 @@ static int brd_alloc(int i)
> >  	 *  otherwise fdisk will align on 1M. Regardless this call
> >  	 *  is harmless)
> >  	 */
> > -	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
> > +	blk_queue_physical_block_size(disk->queue, rd_blksize);
> 
> And this was added for DAX support, but DAX support was long removed
> from brd see commit 7a862fbbdec6 ("brd: remove dax support"). This
> nugget was added by Boaz via commit c8fa31730fc72a8 ("brd: Request from
> fdisk 4k alignment"), but if we don't support DAX, can't we just kill
> that line?
> 
> Current doc for blk_queue_physical_block_size() says:
> 
> *   This should be set to the lowest possible sector size that the             
> *   hardware can operate on without reverting to read-modify-write             
> *   operations.  
> 
> But since we're working directly with RAM, do we care?
> 
> The comment above that line referring to direct_access should be killed
> at the very least.

I was curious whether or not Martin's comments about "mkfs already
considers the reported queue limits (for the filesystems most people use,
anyway)" applies to say XFS [0] and whether or not the above helps guide
mkfs that way, so I took a look now.

[0] https://lore.kernel.org/all/yq1ttytbox9.fsf@ca-mkp.ca.oracle.com/

There's a default struct mkfs_default_params which sets the blocksize
to 1 << XFS_DFL_BLOCKSIZE_LOG and so at least validate_blocksize() will
use that if no parameter was passed on the cli. We also set the default
sector size to XFS_MIN_SECTORSIZE.

There are two aspects to what we use for the final mkfs, first we use
validate_blocksize(), then validate_sectorsize(). The first will set
the default block size to  1 << XFS_DFL_BLOCKSIZE_LOG (4096 bytes) if
the user did not specify anything. Then validate_sectorsize() uses
get_topology() to get the physical / logical block sizes and both:

  * blkid_topology_get_minimum_io_size()
  * blkid_topology_get_optimal_io_size()

These come from utilil-linux:

https://github.com/util-linux/util-linux.git

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block

So that's just

/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size

Then:

/sys/block/<disk>/queue/minimum_io_size

The documentation suggests " For disk drives this is often the physical block size."

/sys/block/<disk>/queue/optimal_io_size

The documentation suggests this is "This is rarely reported for disk drives."

From my review of xfs's mkfs is we essentially use the physical block
size as a default sector size if set, otherwise we use the device's logical block
size if set otherwise xfsprog's default and so 4096.

And so the above comment referring to DAX could be killed and replaced
with allowing userspace mkfs utils to use a more appropriate block size.
Or just kill the comment all together and we update the docs for
blk_queue_physical_block_size() instead.

> While you're at it, can you then also update Documentation/filesystems/dax.rst
> to remove the brd as an example driver with DAX support.
> 
> That leaves us with only a few block drivers with DAX:
> 
> - dcssblk: s390 dcss block device driver                                        
> - pmem: NVDIMM persistent memory driver  
> - some dm targets
> - fuse virtio_fs
> 
> Wonder which is the right example these days for DAX, now that
> persistant memory has End of Life'd with Optane dead, curious also of
> the value of the above and DAX in general other than support for
> whatever made it out.

Looking at this:

https://lore.kernel.org/all/62ef05515b085_1b3c29434@dwillia2-xfh.jf.intel.com.notmuch/

> Should we EOL DAX too?

I think the answer is no as a generic kernel thing, however, it probably
means we should really update the DAX documentation to reflect what our
real expecations are for its current and future uses. And that is to
highlight the non-PMEM use cases as *those* probably are EOL'd now.

  Luis
Martin K. Petersen March 20, 2023, 11:40 p.m. UTC | #3
Luis,

> /sys/block/<disk>/queue/minimum_io_size
>
> The documentation suggests " For disk drives this is often the
> physical block size."
>
> /sys/block/<disk>/queue/optimal_io_size
>
> The documentation suggests this is "This is rarely reported for disk
> drives."

min_io and opt_io are used to key mkfs.xfs' sunit/swidth. So if you're
using a hardware RAID, MD, or DM, we'll attempt to align allocations on
stripe boundaries.

Back when that "rarely reported" blurb was written (2009), we did not
have any individual disk drives which reported min_io/opt_io. Reporting
those parameters was mostly a storage array thing. These days it's
fairly common for both disk drives and SSDs to fill out these fields.

> From my review of xfs's mkfs is we essentially use the physical block
> size as a default sector size if set, otherwise we use the device's
> logical block size if set otherwise xfsprog's default and so 4096.

Yep.
Luis Chamberlain March 21, 2023, 12:14 a.m. UTC | #4
On Mon, Mar 20, 2023 at 07:40:44PM -0400, Martin K. Petersen wrote:
> 
> Luis,
> 
> > /sys/block/<disk>/queue/minimum_io_size
> >
> > The documentation suggests " For disk drives this is often the
> > physical block size."
> >
> > /sys/block/<disk>/queue/optimal_io_size
> >
> > The documentation suggests this is "This is rarely reported for disk
> > drives."
> 
> min_io and opt_io are used to key mkfs.xfs' sunit/swidth. So if you're
> using a hardware RAID, MD, or DM, we'll attempt to align allocations on
> stripe boundaries.
> 
> Back when that "rarely reported" blurb was written (2009), we did not
> have any individual disk drives which reported min_io/opt_io. Reporting
> those parameters was mostly a storage array thing. These days it's
> fairly common for both disk drives and SSDs to fill out these fields.

Glad you mentioned this, I followed up in my review of these and I see
even the names, swidth, sunit, are all "stripe" indicative. Based on
what you are saying, it seems we may need to update docs to reflect
actual / new uses.

> > From my review of xfs's mkfs is we essentially use the physical block
> > size as a default sector size if set, otherwise we use the device's
> > logical block size if set otherwise xfsprog's default and so 4096.
> 
> Yep.

The use case for swidth / sunit on mkfs.xfs seemed pretty tied to
striping, and it was not obvious or clear / if using it for hints could be used
today as perhaps intended. At least all the naming and validation stuff
seems to make it very "stripy" still.

  Luis
diff mbox series

Patch

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index fc41ea641dfb..11bac3c3f1b6 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -30,7 +30,7 @@ 
 /*
  * Each block ramdisk device has a radix_tree brd_folios of folios that stores
  * the folios containing the block device's contents. A brd folio's ->index is
- * its offset in PAGE_SIZE units. This is similar to, but in no way connected
+ * its offset in brd_blksize units. This is similar to, but in no way connected
  * with, the kernel's pagecache or buffer cache (which sit above our block
  * device).
  */
@@ -46,6 +46,8 @@  struct brd_device {
 	spinlock_t		brd_lock;
 	struct radix_tree_root	brd_folios;
 	u64			brd_nr_folios;
+	unsigned int		brd_sector_shift;
+	unsigned int		brd_sector_size;
 };
 
 /*
@@ -55,7 +57,7 @@  static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector)
 {
 	pgoff_t idx;
 	struct folio *folio;
-	unsigned int rd_sector_shift = PAGE_SHIFT - SECTOR_SHIFT;
+	unsigned int rd_sector_shift = brd->brd_sector_shift - SECTOR_SHIFT;
 
 	/*
 	 * The folio lifetime is protected by the fact that we have opened the
@@ -85,8 +87,8 @@  static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
 	struct folio *folio;
-	unsigned int rd_sector_shift = PAGE_SHIFT - SECTOR_SHIFT;
-	unsigned int rd_sector_order = get_order(PAGE_SIZE);
+	unsigned int rd_sector_shift = brd->brd_sector_shift - SECTOR_SHIFT;
+	unsigned int rd_sector_order = get_order(brd->brd_sector_size);
 	int ret = 0;
 
 	folio = brd_lookup_folio(brd, sector);
@@ -170,9 +172,9 @@  static void brd_free_folios(struct brd_device *brd)
 static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 			     gfp_t gfp)
 {
-	unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT;
+	unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT;
 	unsigned int rd_sectors = 1 << rd_sectors_shift;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT;
 	size_t copy;
 	int ret;
@@ -196,9 +198,9 @@  static void copy_to_brd(struct brd_device *brd, const void *src,
 {
 	struct folio *folio;
 	void *dst;
-	unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT;
+	unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT;
 	unsigned int rd_sectors = 1 << rd_sectors_shift;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT;
 	size_t copy;
 
@@ -231,9 +233,9 @@  static void copy_from_brd(void *dst, struct brd_device *brd,
 {
 	struct folio *folio;
 	void *src;
-	unsigned int rd_sectors_shift = PAGE_SHIFT - SECTOR_SHIFT;
+	unsigned int rd_sectors_shift = brd->brd_sector_shift - SECTOR_SHIFT;
 	unsigned int rd_sectors = 1 << rd_sectors_shift;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = (sector & (rd_sectors - 1)) << SECTOR_SHIFT;
 	size_t copy;
 
@@ -346,6 +348,10 @@  static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
+static unsigned int rd_blksize = PAGE_SIZE;
+module_param(rd_blksize, uint, 0444);
+MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -382,6 +388,8 @@  static int brd_alloc(int i)
 		return -ENOMEM;
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
+	brd->brd_sector_shift = ilog2(rd_blksize);
+	brd->brd_sector_size = rd_blksize;
 
 	spin_lock_init(&brd->brd_lock);
 	INIT_RADIX_TREE(&brd->brd_folios, GFP_ATOMIC);
@@ -410,7 +418,7 @@  static int brd_alloc(int i)
 	 *  otherwise fdisk will align on 1M. Regardless this call
 	 *  is harmless)
 	 */
-	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+	blk_queue_physical_block_size(disk->queue, rd_blksize);
 
 	/* Tell the block layer that this is not a rotational device */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);