[0/4] block: Add support for REQ_OP_ASSIGN_RANGE

Message ID	20200329174714.32416-1-chaitanya.kulkarni@wdc.com (mailing list archive)
Headers	show Return-Path: <SRS0=N2dH=5O=vger.kernel.org=linux-block-owner@kernel.org> IronPort-SDR: f6x+IcZNGSRbBJSXzeLA7xSFFSdZ7ftW+SkD+9syyzGv04tfQxZ6c2ybRqgTpjYVq3kr4gZYKJ awmvZGJu/KWptVbIp4xJAtk+SxpRKUC2aDEeVdafiRwSwRasWVps2mTCBVbdi9pzYuvJAm0E+h TMRqTxLpqyZXwjN7dYz/aHBMb+ZAclV/Zj3pOPhHkhPTufln6x68rQVBsge4N6DUMNUjHLl0dZ AhZHdCwGzQ050apFCvosoNEItdgjtJZbTmDU+LOlYy0eAal7hB98j7gQbBiV1BSofk1N9U8Fd7 IcM= IronPort-SDR: d4abIkB4zDkNnaVYZqtFromW1bLgR1en9uH2NL4qNi4JCLzaCCkST8NrLQITElLewOyOOBuEEZ bmh7gKwO7Rho7TVqBNRqrf2yUWQdg3DL2EYhKv2gMsN3cHzv047sCgweW/RVzZhqohsRVHzaFK k6XgEC97RaWS/QgTdIYUi4N+wP6LNpHF0FMu2c6gqXaNiH1CJyhw5OSuqMXRMvFwacFzcYl5oZ RocdaLUDnK4mR5ZDi/KJpB9skG5akK6nlGZuUzApOhUGmd5mds+fOzaZKKlykKYOvN6GlbxGLy XkD4n48WZ1bVnPpLnfRQOzBV IronPort-SDR: zhhYJae49cXt10BJLdRTL0pLs+9IMASreQn6kPE+M6r3ns/jfGtlslJ3qL/2dStmksSdEZs7nO 479ttkBMIGEhby2y6irfiVRycac7c3OIFowYfjIgns/rYMLCBxpjQ274/pYFcd+lz4qXcBNXzG GYJEC8/wPVQJa04EaR9hS23GHbl6+99+z2F5y0OszRRu2zFgvhQvfCea3geNatYSuzn6oK+k3X UqLlR0FKfTEv3rjHPNNLW7CZCs3X9uIT06QIYdVpD1glbrr2PElae2oVVkgXfutEoDswup5ZHw Xk0= WDCIronportException: Internal From: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> To: hch@lst.de, martin.petersen@oracle.com Cc: darrick.wong@oracle.com, axboe@kernel.dk, tytso@mit.edu, adilger.kernel@dilger.ca, ming.lei@redhat.com, jthumshirn@suse.de, minwoo.im.dev@gmail.com, chaitanya.kulkarni@wdc.com, damien.lemoal@wdc.com, andrea.parri@amarulasolutions.com, hare@suse.com, tj@kernel.org, hannes@cmpxchg.org, khlebnikov@yandex-team.ru, ajay.joshi@wdc.com, bvanassche@acm.org, arnd@arndb.de, houtao1@huawei.com, asml.silence@gmail.com, linux-block@vger.kernel.org, linux-ext4@vger.kernel.org Subject: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE Date: Sun, 29 Mar 2020 10:47:10 -0700 Message-Id: <20200329174714.32416-1-chaitanya.kulkarni@wdc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=y Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	block: Add support for REQ_OP_ASSIGN_RANGE \| expand [0/4] block: Add support for REQ_OP_ASSIGN_RANGE [1/4] block: create payloadless issue bio helper [2/4] block: Add support for REQ_OP_ASSIGN_RANGE [3/4] loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0) [4/4] ext4: Notify block device about alloc-assigned blk

Chaitanya Kulkarni March 29, 2020, 5:47 p.m. UTC

Hi,

This patch-series is based on the original RFC patch series:-
https://www.spinics.net/lists/linux-block/msg47933.html.

I've designed a rough testcase based on the information present
in the mailing list archive for original RFC, it may need
some corrections from the author.

If anyone is interested, test results are at the end of this patch.

Following is the original cover-letter :-

Information about continuous extent placement may be useful
for some block devices. Say, distributed network filesystems,
which provide block device interface, may use this information
for better blocks placement over the nodes in their cluster,
and for better performance. Block devices, which map a file
on another filesystem (loop), may request the same length extent
on underlining filesystem for less fragmentation and for batching
allocation requests. Also, hypervisors like QEMU may use this
information for optimization of cluster allocations.

This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
to be used for forwarding user's fallocate(0) requests into
block device internals. It rather similar to existing
REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
exported primitive is called blkdev_issue_assign_range().
See [1/3] for the details.

Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
requests by calling fallocate(0).

Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Here is a simple test I did:
https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856

I attached a file on ext4 to loop. Then, created ext4 partition
on loop device and started the test in the partition. Direct-io
is enabled on loop.

The test fallocates 4G file and writes from some offset with
given step, then it chooses another offset and repeats. After
the test all the blocks in the file become written.

The results shows that batching extents-assigning requests improves
the performance:

Before patchset: real ~ 1min 27sec
After patchset:  real ~ 1min 16sec (18% better)

Ordinary fallocate() before writes improves the performance
by batching the requests. These results just show, the same
is in case of forwarding extents information to underlining
filesystem.

Regards,
Chaitanya

Changes from RFC:-

1. Add missing plumbing for REQ_OP_ASSIGN_RANGE similar to write-zeores.
2. Add a prep patch to create a helper to submit payloadless bios.
3. Design a testcases around the description present in the
   cover-letter.

Chaitanya Kulkarni (1):
  block: create payloadless issue bio helper

Kirill Tkhai (3):
  block: Add support for REQ_OP_ASSIGN_RANGE
  loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
  ext4: Notify block device about alloc-assigned blk

 block/blk-core.c          |   5 ++
 block/blk-lib.c           | 115 +++++++++++++++++++++++++++++++-------
 block/blk-merge.c         |  21 +++++++
 block/blk-settings.c      |  19 +++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 drivers/block/loop.c      |   5 ++
 fs/ext4/ext4.h            |   2 +
 fs/ext4/extents.c         |  12 +++-
 include/linux/bio.h       |   9 ++-
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h    |  34 +++++++++++
 12 files changed, 201 insertions(+), 25 deletions(-)

1. Setup :-
-----------
# git log --oneline -5 
c64a4c781915 (HEAD -> req-op-assign-range) ext4: Notify block device about alloc-assigned blk
000cbc6720a4 loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
89ceed8cac80 block: Add support for REQ_OP_ASSIGN_RANGE
a798743e87e7 block: create payloadless issue bio helper
b53df2e7442c (tag: block-5.6-2020-03-13) block: Fix partition support for host aware zoned block devices

# cat /proc/kallsyms | grep -i blkdev_issue_assign_range
ffffffffa3264a80 T blkdev_issue_assign_range
ffffffffa4027184 r __ksymtab_blkdev_issue_assign_range
ffffffffa40524be r __kstrtabns_blkdev_issue_assign_range
ffffffffa405a8eb r __kstrtab_blkdev_issue_assign_range

2. Test program, will be moved to blktest once code is upstream :-
-----------------
#define _GNU_SOURCE
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

#define BLOCK_SIZE 4096
#define STEP (BLOCK_SIZE * 16)
#define SIZE (1024 * 1024 * 1024ULL)

int main(int argc, char *argv[])
{
	int fd, step, ret = 0;
	unsigned long i;
	void *buf;

	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
		perror("alloc");
		exit(1);
	}

	fd = open("/mnt/loop0/file.img", O_RDWR | O_CREAT | O_DIRECT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	if (ftruncate(fd, SIZE)) {
		perror("ftruncate");
		exit(1);
	}

	ret = fallocate(fd, 0, 0, SIZE);
	if (ret) {
		perror("fallocate");
		exit(1);
	}
	
	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
		printf("step=%u\n", step);
		for (i = step; i < SIZE; i += STEP) {
			errno = 0;
			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
				perror("pwrite");
				exit(1);
			}
		}

		if (fsync(fd)) {
			perror("fsync");
			exit(1);
		}
	}
	return 0;
}

3. Test script, will be moved to blktests once code is upstream :-
------------------------------------------------------------------
# cat req_op_assign_test.sh 
#!/bin/bash -x

NULLB_FILE="/mnt/backend/data"
NULLB_MNT="/mnt/backend"
LOOP_MNT="/mnt/loop0"

delete_loop()
{
	umount ${LOOP_MNT}
	losetup -D
	sleep 3
}

delete_nullb()
{
	umount ${NULLB_MNT}
	echo 1 > config/nullb/nullb0/power
	rmdir config/nullb/nullb0
	sleep 3
}

unload_modules()
{
	rmmod drivers/block/loop.ko
	rmmod fs/ext4/ext4.ko
	rmmod drivers/block/null_blk.ko
	lsmod | grep -e ext4 -e loop -e null_blk
}

unload()
{
	delete_loop
	delete_nullb
	unload_modules
}

load_ext4()
{
	make -j $(nproc) M=fs/ext4 modules
	local src=fs/ext4/
	local dest=/lib/modules/`uname -r`/kernel/fs/ext4
	\cp ${src}/ext4.ko ${dest}/

	modprobe mbcache
	modprobe jbd2
	sleep 1
	insmod fs/ext4/ext4.ko
	sleep 1
}

load_nullb()
{
	local src=drivers/block/
	local dest=/lib/modules/`uname -r`/kernel/drivers/block
	\cp ${src}/null_blk.ko ${dest}/

	modprobe null_blk nr_devices=0
	sleep 1

	mkdir config/nullb/nullb0
	tree config/nullb/nullb0

	echo 1 > config/nullb/nullb0/memory_backed
	echo 512 > config/nullb/nullb0/blocksize 

	# 20 GB
	echo 20480 > config/nullb/nullb0/size 
	echo 1 > config/nullb/nullb0/power
	sleep 2
	IDX=`cat config/nullb/nullb0/index`
	lsblk | grep null${IDX}
	sleep 1

	mkfs.ext4 /dev/nullb0 
	mount /dev/nullb0 ${NULLB_MNT}
	sleep 1
	mount | grep nullb

	# 10 GB
	dd if=/dev/zero of=${NULLB_FILE} count=2621440 bs=4096
}

load_loop()
{
	local src=drivers/block/
	local dest=/lib/modules/`uname -r`/kernel/drivers/block
	\cp ${src}/loop.ko ${dest}/

	insmod drivers/block/loop.ko max_loop=1
	sleep 3
	/root/util-linux/losetup --direct-io=off /dev/loop0 ${NULLB_FILE}
	sleep 3
	/root/util-linux/losetup
	ls -l /dev/loop*
	dmesg -c 
	mkfs.ext4 /dev/loop0
	mount /dev/loop0 ${LOOP_MNT}
	mount | grep loop0
}

load()
{
	make -j $(nproc) M=drivers/block modules

	load_ext4
	load_nullb
	load_loop
	sleep 1
	sync
	sync
	sync
}

unload
load
time ./test

4. Test Results :-
------------------

# ./req_op_assign_test.sh 
+ NULLB_FILE=/mnt/backend/data
+ NULLB_MNT=/mnt/backend
+ LOOP_MNT=/mnt/loop0
+ unload
+ delete_loop
+ umount /mnt/loop0
+ losetup -D
+ sleep 3
+ delete_nullb
+ umount /mnt/backend
+ echo 1
+ rmdir config/nullb/nullb0
+ sleep 3
+ unload_modules
+ rmmod drivers/block/loop.ko
+ rmmod fs/ext4/ext4.ko
+ rmmod drivers/block/null_blk.ko
+ lsmod
+ grep -e ext4 -e loop -e null_blk
+ load
++ nproc
+ make -j 32 M=drivers/block modules
  CC [M]  drivers/block/loop.o
  MODPOST 11 modules
  CC [M]  drivers/block/loop.mod.o
  LD [M]  drivers/block/loop.ko
+ load_ext4
++ nproc
+ make -j 32 M=fs/ext4 modules
  CC [M]  fs/ext4/balloc.o
  CC [M]  fs/ext4/bitmap.o
  CC [M]  fs/ext4/block_validity.o
  CC [M]  fs/ext4/dir.o
  CC [M]  fs/ext4/ext4_jbd2.o
  CC [M]  fs/ext4/extents.o
  CC [M]  fs/ext4/extents_status.o
  CC [M]  fs/ext4/file.o
  CC [M]  fs/ext4/fsmap.o
  CC [M]  fs/ext4/fsync.o
  CC [M]  fs/ext4/hash.o
  CC [M]  fs/ext4/ialloc.o
  CC [M]  fs/ext4/indirect.o
  CC [M]  fs/ext4/inline.o
  CC [M]  fs/ext4/inode.o
  CC [M]  fs/ext4/ioctl.o
  CC [M]  fs/ext4/mballoc.o
  CC [M]  fs/ext4/migrate.o
  CC [M]  fs/ext4/mmp.o
  CC [M]  fs/ext4/move_extent.o
  CC [M]  fs/ext4/namei.o
  CC [M]  fs/ext4/page-io.o
  CC [M]  fs/ext4/readpage.o
  CC [M]  fs/ext4/resize.o
  CC [M]  fs/ext4/super.o
  CC [M]  fs/ext4/symlink.o
  CC [M]  fs/ext4/sysfs.o
  CC [M]  fs/ext4/xattr.o
  CC [M]  fs/ext4/xattr_trusted.o
  CC [M]  fs/ext4/xattr_user.o
  CC [M]  fs/ext4/acl.o
  CC [M]  fs/ext4/xattr_security.o
  LD [M]  fs/ext4/ext4.o
  MODPOST 1 modules
  LD [M]  fs/ext4/ext4.ko
+ local src=fs/ext4/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4
+ cp fs/ext4//ext4.ko /lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4/
+ modprobe mbcache
+ modprobe jbd2
+ sleep 1
+ insmod fs/ext4/ext4.ko
+ sleep 1
+ load_nullb
+ local src=drivers/block/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
+ cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
+ modprobe null_blk nr_devices=0
+ sleep 1
+ mkdir config/nullb/nullb0
+ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── zoned
├── zone_nr_conv
└── zone_size

0 directories, 20 files
+ echo 1
+ echo 512
+ echo 20480
+ echo 1
+ sleep 2
++ cat config/nullb/nullb0/index
+ IDX=0
+ lsblk
+ grep null0
+ sleep 1
+ mkfs.ext4 /dev/nullb0
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1310720 inodes, 5242880 blocks
262144 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2153775104
160 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done   

+ mount /dev/nullb0 /mnt/backend
+ sleep 1
+ mount
+ grep nullb
/dev/nullb0 on /mnt/backend type ext4 (rw,relatime,seclabel)
+ dd if=/dev/zero of=/mnt/backend/data count=2621440 bs=4096
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 27.4579 s, 391 MB/s
+ load_loop
+ local src=drivers/block/
++ uname -r
+ local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
+ cp drivers/block//loop.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
+ insmod drivers/block/loop.ko max_loop=1
+ sleep 3
+ /root/util-linux/losetup --direct-io=off /dev/loop0 /mnt/backend/data
+ sleep 3
+ /root/util-linux/losetup
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
/dev/loop0         0      0         0  0 /mnt/backend/data   0     512
+ ls -l /dev/loop0 /dev/loop-control
brw-rw----. 1 root disk  7,   0 Mar 29 10:28 /dev/loop0
crw-rw----. 1 root disk 10, 237 Mar 29 10:28 /dev/loop-control
+ dmesg -c
[42963.967060] null_blk: module loaded
[42968.419481] EXT4-fs (nullb0): mounted filesystem with ordered data mode. Opts: (null)
[42996.928141] loop: module loaded
+ mkfs.ext4 /dev/loop0
mke2fs 1.42.9 (28-Dec-2013)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
655360 inodes, 2621440 blocks
131072 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2151677952
80 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

+ mount /dev/loop0 /mnt/loop0
+ mount
+ grep loop0
/dev/loop0 on /mnt/loop0 type ext4 (rw,relatime,seclabel)
+ sleep 1
+ sync
+ sync
+ sync
+ ./test
step=61440
step=57344
step=53248
step=49152
step=45056
step=40960
step=36864
step=32768
step=28672
step=24576
step=20480
step=16384
step=12288
step=8192
step=4096
step=0

real	9m34.472s
user	0m0.062s
sys	0m5.783s

Martin K. Petersen April 1, 2020, 2:29 a.m. UTC | #1

Chaitanya,

> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals.

s/assign_range/allocate/g

Chaitanya Kulkarni April 1, 2020, 4:53 a.m. UTC | #2

On 3/31/20 7:31 PM, Martin K. Petersen wrote:
> Chaitanya,
>
>> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
>> to be used for forwarding user's fallocate(0) requests into
>> block device internals.
> s/assign_range/allocate/g
>
Okay, will send out the V2.

Konstantin Khlebnikov April 1, 2020, 6:22 a.m. UTC | #3

On 29/03/2020 20.47, Chaitanya Kulkarni wrote:
> Hi,
> 
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
> 
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
> 
> If anyone is interested, test results are at the end of this patch.
> 
> Following is the original cover-letter :-
> 
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().

What exact semantics of that?

It may/must preserve present data or may/must discard them, or may fill range with random garbage?

Obviously I prefer weakest one - may discard data, may return garbage, may do nothing.

I.e. lower layer could reuse blocks without zeroing, for encrypted storage this is even safe.

So, this works as third type of dicasrd in addtion to REQ_OP_DISCARD and REQ_OP_SECURE_ERASE.

> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).
> 
> Here is a simple test I did:
> https://gist.github.com/tkhai/5b788651cdb74c1dbff3500745878856
> 
> I attached a file on ext4 to loop. Then, created ext4 partition
> on loop device and started the test in the partition. Direct-io
> is enabled on loop.
> 
> The test fallocates 4G file and writes from some offset with
> given step, then it chooses another offset and repeats. After
> the test all the blocks in the file become written.
> 
> The results shows that batching extents-assigning requests improves
> the performance:
> 
> Before patchset: real ~ 1min 27sec
> After patchset:  real ~ 1min 16sec (18% better)
> 
> Ordinary fallocate() before writes improves the performance
> by batching the requests. These results just show, the same
> is in case of forwarding extents information to underlining
> filesystem.
> 
> Regards,
> Chaitanya
> 
> Changes from RFC:-
> 
> 1. Add missing plumbing for REQ_OP_ASSIGN_RANGE similar to write-zeores.
> 2. Add a prep patch to create a helper to submit payloadless bios.
> 3. Design a testcases around the description present in the
>     cover-letter.
> 
> Chaitanya Kulkarni (1):
>    block: create payloadless issue bio helper
> 
> Kirill Tkhai (3):
>    block: Add support for REQ_OP_ASSIGN_RANGE
>    loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
>    ext4: Notify block device about alloc-assigned blk
> 
>   block/blk-core.c          |   5 ++
>   block/blk-lib.c           | 115 +++++++++++++++++++++++++++++++-------
>   block/blk-merge.c         |  21 +++++++
>   block/blk-settings.c      |  19 +++++++
>   block/blk-zoned.c         |   1 +
>   block/bounce.c            |   1 +
>   drivers/block/loop.c      |   5 ++
>   fs/ext4/ext4.h            |   2 +
>   fs/ext4/extents.c         |  12 +++-
>   include/linux/bio.h       |   9 ++-
>   include/linux/blk_types.h |   2 +
>   include/linux/blkdev.h    |  34 +++++++++++
>   12 files changed, 201 insertions(+), 25 deletions(-)
> 
> 1. Setup :-
> -----------
> # git log --oneline -5
> c64a4c781915 (HEAD -> req-op-assign-range) ext4: Notify block device about alloc-assigned blk
> 000cbc6720a4 loop: Forward REQ_OP_ASSIGN_RANGE into fallocate(0)
> 89ceed8cac80 block: Add support for REQ_OP_ASSIGN_RANGE
> a798743e87e7 block: create payloadless issue bio helper
> b53df2e7442c (tag: block-5.6-2020-03-13) block: Fix partition support for host aware zoned block devices
> 
> # cat /proc/kallsyms | grep -i blkdev_issue_assign_range
> ffffffffa3264a80 T blkdev_issue_assign_range
> ffffffffa4027184 r __ksymtab_blkdev_issue_assign_range
> ffffffffa40524be r __kstrtabns_blkdev_issue_assign_range
> ffffffffa405a8eb r __kstrtab_blkdev_issue_assign_range
> 
> 2. Test program, will be moved to blktest once code is upstream :-
> -----------------
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <errno.h>
> 
> #define BLOCK_SIZE 4096
> #define STEP (BLOCK_SIZE * 16)
> #define SIZE (1024 * 1024 * 1024ULL)
> 
> int main(int argc, char *argv[])
> {
> 	int fd, step, ret = 0;
> 	unsigned long i;
> 	void *buf;
> 
> 	if (posix_memalign(&buf, BLOCK_SIZE, SIZE)) {
> 		perror("alloc");
> 		exit(1);
> 	}
> 
> 	fd = open("/mnt/loop0/file.img", O_RDWR | O_CREAT | O_DIRECT);
> 	if (fd < 0) {
> 		perror("open");
> 		exit(1);
> 	}
> 
> 	if (ftruncate(fd, SIZE)) {
> 		perror("ftruncate");
> 		exit(1);
> 	}
> 
> 	ret = fallocate(fd, 0, 0, SIZE);
> 	if (ret) {
> 		perror("fallocate");
> 		exit(1);
> 	}
> 	
> 	for (step = STEP - BLOCK_SIZE; step >= 0; step -= BLOCK_SIZE) {
> 		printf("step=%u\n", step);
> 		for (i = step; i < SIZE; i += STEP) {
> 			errno = 0;
> 			if (pwrite(fd, buf, BLOCK_SIZE, i) != BLOCK_SIZE) {
> 				perror("pwrite");
> 				exit(1);
> 			}
> 		}
> 
> 		if (fsync(fd)) {
> 			perror("fsync");
> 			exit(1);
> 		}
> 	}
> 	return 0;
> }
> 
> 3. Test script, will be moved to blktests once code is upstream :-
> ------------------------------------------------------------------
> # cat req_op_assign_test.sh
> #!/bin/bash -x
> 
> NULLB_FILE="/mnt/backend/data"
> NULLB_MNT="/mnt/backend"
> LOOP_MNT="/mnt/loop0"
> 
> delete_loop()
> {
> 	umount ${LOOP_MNT}
> 	losetup -D
> 	sleep 3
> }
> 
> delete_nullb()
> {
> 	umount ${NULLB_MNT}
> 	echo 1 > config/nullb/nullb0/power
> 	rmdir config/nullb/nullb0
> 	sleep 3
> }
> 
> unload_modules()
> {
> 	rmmod drivers/block/loop.ko
> 	rmmod fs/ext4/ext4.ko
> 	rmmod drivers/block/null_blk.ko
> 	lsmod | grep -e ext4 -e loop -e null_blk
> }
> 
> unload()
> {
> 	delete_loop
> 	delete_nullb
> 	unload_modules
> }
> 
> load_ext4()
> {
> 	make -j $(nproc) M=fs/ext4 modules
> 	local src=fs/ext4/
> 	local dest=/lib/modules/`uname -r`/kernel/fs/ext4
> 	\cp ${src}/ext4.ko ${dest}/
> 
> 	modprobe mbcache
> 	modprobe jbd2
> 	sleep 1
> 	insmod fs/ext4/ext4.ko
> 	sleep 1
> }
> 
> load_nullb()
> {
> 	local src=drivers/block/
> 	local dest=/lib/modules/`uname -r`/kernel/drivers/block
> 	\cp ${src}/null_blk.ko ${dest}/
> 
> 	modprobe null_blk nr_devices=0
> 	sleep 1
> 
> 	mkdir config/nullb/nullb0
> 	tree config/nullb/nullb0
> 
> 	echo 1 > config/nullb/nullb0/memory_backed
> 	echo 512 > config/nullb/nullb0/blocksize
> 
> 	# 20 GB
> 	echo 20480 > config/nullb/nullb0/size
> 	echo 1 > config/nullb/nullb0/power
> 	sleep 2
> 	IDX=`cat config/nullb/nullb0/index`
> 	lsblk | grep null${IDX}
> 	sleep 1
> 
> 	mkfs.ext4 /dev/nullb0
> 	mount /dev/nullb0 ${NULLB_MNT}
> 	sleep 1
> 	mount | grep nullb
> 
> 	# 10 GB
> 	dd if=/dev/zero of=${NULLB_FILE} count=2621440 bs=4096
> }
> 
> load_loop()
> {
> 	local src=drivers/block/
> 	local dest=/lib/modules/`uname -r`/kernel/drivers/block
> 	\cp ${src}/loop.ko ${dest}/
> 
> 	insmod drivers/block/loop.ko max_loop=1
> 	sleep 3
> 	/root/util-linux/losetup --direct-io=off /dev/loop0 ${NULLB_FILE}
> 	sleep 3
> 	/root/util-linux/losetup
> 	ls -l /dev/loop*
> 	dmesg -c
> 	mkfs.ext4 /dev/loop0
> 	mount /dev/loop0 ${LOOP_MNT}
> 	mount | grep loop0
> }
> 
> load()
> {
> 	make -j $(nproc) M=drivers/block modules
> 
> 	load_ext4
> 	load_nullb
> 	load_loop
> 	sleep 1
> 	sync
> 	sync
> 	sync
> }
> 
> unload
> load
> time ./test
> 
> 4. Test Results :-
> ------------------
> 
> # ./req_op_assign_test.sh
> + NULLB_FILE=/mnt/backend/data
> + NULLB_MNT=/mnt/backend
> + LOOP_MNT=/mnt/loop0
> + unload
> + delete_loop
> + umount /mnt/loop0
> + losetup -D
> + sleep 3
> + delete_nullb
> + umount /mnt/backend
> + echo 1
> + rmdir config/nullb/nullb0
> + sleep 3
> + unload_modules
> + rmmod drivers/block/loop.ko
> + rmmod fs/ext4/ext4.ko
> + rmmod drivers/block/null_blk.ko
> + lsmod
> + grep -e ext4 -e loop -e null_blk
> + load
> ++ nproc
> + make -j 32 M=drivers/block modules
>    CC [M]  drivers/block/loop.o
>    MODPOST 11 modules
>    CC [M]  drivers/block/loop.mod.o
>    LD [M]  drivers/block/loop.ko
> + load_ext4
> ++ nproc
> + make -j 32 M=fs/ext4 modules
>    CC [M]  fs/ext4/balloc.o
>    CC [M]  fs/ext4/bitmap.o
>    CC [M]  fs/ext4/block_validity.o
>    CC [M]  fs/ext4/dir.o
>    CC [M]  fs/ext4/ext4_jbd2.o
>    CC [M]  fs/ext4/extents.o
>    CC [M]  fs/ext4/extents_status.o
>    CC [M]  fs/ext4/file.o
>    CC [M]  fs/ext4/fsmap.o
>    CC [M]  fs/ext4/fsync.o
>    CC [M]  fs/ext4/hash.o
>    CC [M]  fs/ext4/ialloc.o
>    CC [M]  fs/ext4/indirect.o
>    CC [M]  fs/ext4/inline.o
>    CC [M]  fs/ext4/inode.o
>    CC [M]  fs/ext4/ioctl.o
>    CC [M]  fs/ext4/mballoc.o
>    CC [M]  fs/ext4/migrate.o
>    CC [M]  fs/ext4/mmp.o
>    CC [M]  fs/ext4/move_extent.o
>    CC [M]  fs/ext4/namei.o
>    CC [M]  fs/ext4/page-io.o
>    CC [M]  fs/ext4/readpage.o
>    CC [M]  fs/ext4/resize.o
>    CC [M]  fs/ext4/super.o
>    CC [M]  fs/ext4/symlink.o
>    CC [M]  fs/ext4/sysfs.o
>    CC [M]  fs/ext4/xattr.o
>    CC [M]  fs/ext4/xattr_trusted.o
>    CC [M]  fs/ext4/xattr_user.o
>    CC [M]  fs/ext4/acl.o
>    CC [M]  fs/ext4/xattr_security.o
>    LD [M]  fs/ext4/ext4.o
>    MODPOST 1 modules
>    LD [M]  fs/ext4/ext4.ko
> + local src=fs/ext4/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4
> + cp fs/ext4//ext4.ko /lib/modules/5.6.0-rc3lbk+/kernel/fs/ext4/
> + modprobe mbcache
> + modprobe jbd2
> + sleep 1
> + insmod fs/ext4/ext4.ko
> + sleep 1
> + load_nullb
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + modprobe null_blk nr_devices=0
> + sleep 1
> + mkdir config/nullb/nullb0
> + tree config/nullb/nullb0
> config/nullb/nullb0
> ├── badblocks
> ├── blocking
> ├── blocksize
> ├── cache_size
> ├── completion_nsec
> ├── discard
> ├── home_node
> ├── hw_queue_depth
> ├── index
> ├── irqmode
> ├── mbps
> ├── memory_backed
> ├── power
> ├── queue_mode
> ├── size
> ├── submit_queues
> ├── use_per_node_hctx
> ├── zoned
> ├── zone_nr_conv
> └── zone_size
> 
> 0 directories, 20 files
> + echo 1
> + echo 512
> + echo 20480
> + echo 1
> + sleep 2
> ++ cat config/nullb/nullb0/index
> + IDX=0
> + lsblk
> + grep null0
> + sleep 1
> + mkfs.ext4 /dev/nullb0
> mke2fs 1.42.9 (28-Dec-2013)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 1310720 inodes, 5242880 blocks
> 262144 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2153775104
> 160 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 	4096000
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> + mount /dev/nullb0 /mnt/backend
> + sleep 1
> + mount
> + grep nullb
> /dev/nullb0 on /mnt/backend type ext4 (rw,relatime,seclabel)
> + dd if=/dev/zero of=/mnt/backend/data count=2621440 bs=4096
> 2621440+0 records in
> 2621440+0 records out
> 10737418240 bytes (11 GB) copied, 27.4579 s, 391 MB/s
> + load_loop
> + local src=drivers/block/
> ++ uname -r
> + local dest=/lib/modules/5.6.0-rc3lbk+/kernel/drivers/block
> + cp drivers/block//loop.ko /lib/modules/5.6.0-rc3lbk+/kernel/drivers/block/
> + insmod drivers/block/loop.ko max_loop=1
> + sleep 3
> + /root/util-linux/losetup --direct-io=off /dev/loop0 /mnt/backend/data
> + sleep 3
> + /root/util-linux/losetup
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
> /dev/loop0         0      0         0  0 /mnt/backend/data   0     512
> + ls -l /dev/loop0 /dev/loop-control
> brw-rw----. 1 root disk  7,   0 Mar 29 10:28 /dev/loop0
> crw-rw----. 1 root disk 10, 237 Mar 29 10:28 /dev/loop-control
> + dmesg -c
> [42963.967060] null_blk: module loaded
> [42968.419481] EXT4-fs (nullb0): mounted filesystem with ordered data mode. Opts: (null)
> [42996.928141] loop: module loaded
> + mkfs.ext4 /dev/loop0
> mke2fs 1.42.9 (28-Dec-2013)
> Discarding device blocks: done
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=0 blocks, Stripe width=0 blocks
> 655360 inodes, 2621440 blocks
> 131072 blocks (5.00%) reserved for the super user
> First data block=0
> Maximum filesystem blocks=2151677952
> 80 block groups
> 32768 blocks per group, 32768 fragments per group
> 8192 inodes per group
> Superblock backups stored on blocks:
> 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> + mount /dev/loop0 /mnt/loop0
> + mount
> + grep loop0
> /dev/loop0 on /mnt/loop0 type ext4 (rw,relatime,seclabel)
> + sleep 1
> + sync
> + sync
> + sync
> + ./test
> step=61440
> step=57344
> step=53248
> step=49152
> step=45056
> step=40960
> step=36864
> step=32768
> step=28672
> step=24576
> step=20480
> step=16384
> step=12288
> step=8192
> step=4096
> step=0
> 
> real	9m34.472s
> user	0m0.062s
> sys	0m5.783s
>

Martin K. Petersen April 2, 2020, 2:29 a.m. UTC | #4

Konstantin,

>> The corresponding exported primitive is called
>> blkdev_issue_assign_range().
>
> What exact semantics of that?

REQ_OP_ALLOCATE will be used to compel a device to allocate a block
range. What a given block contains after successful allocation is
undefined (depends on the device implementation).

For block allocation with deterministic zeroing, one must keep using
REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.

Konstantin Khlebnikov April 2, 2020, 9:49 a.m. UTC | #5

On 02/04/2020 05.29, Martin K. Petersen wrote:
> 
> Konstantin,
> 
>>> The corresponding exported primitive is called
>>> blkdev_issue_assign_range().
>>
>> What exact semantics of that?
> 
> REQ_OP_ALLOCATE will be used to compel a device to allocate a block
> range. What a given block contains after successful allocation is
> undefined (depends on the device implementation).

Ok. Then REQ_OP_ALLOCATE should be accounted as discard rather than write.
That's decided by helper op_is_discard() which is used only by statistics.
It seems REQ_OP_SECURE_ERASE also should be accounted in this way.

> 
> For block allocation with deterministic zeroing, one must keep using
> REQ_OP_WRITE_ZEROES with the NOUNMAP flag set.
>

Dave Chinner April 2, 2020, 10:41 p.m. UTC | #6

On Sun, Mar 29, 2020 at 10:47:10AM -0700, Chaitanya Kulkarni wrote:
> Hi,
> 
> This patch-series is based on the original RFC patch series:-
> https://www.spinics.net/lists/linux-block/msg47933.html.
> 
> I've designed a rough testcase based on the information present
> in the mailing list archive for original RFC, it may need
> some corrections from the author.
> 
> If anyone is interested, test results are at the end of this patch.
> 
> Following is the original cover-letter :-
> 
> Information about continuous extent placement may be useful
> for some block devices. Say, distributed network filesystems,
> which provide block device interface, may use this information
> for better blocks placement over the nodes in their cluster,
> and for better performance. Block devices, which map a file
> on another filesystem (loop), may request the same length extent
> on underlining filesystem for less fragmentation and for batching
> allocation requests. Also, hypervisors like QEMU may use this
> information for optimization of cluster allocations.
> 
> This patchset introduces REQ_OP_ASSIGN_RANGE, which is going
> to be used for forwarding user's fallocate(0) requests into
> block device internals. It rather similar to existing
> REQ_OP_DISCARD, REQ_OP_WRITE_ZEROES, etc. The corresponding
> exported primitive is called blkdev_issue_assign_range().
> See [1/3] for the details.
> 
> Patch [2/3] teaches loop driver to handle REQ_OP_ASSIGN_RANGE
> requests by calling fallocate(0).
> 
> Patch [3/3] makes ext4 to notify a block device about fallocate(0).

Ok, so ext4 has a very limited max allocation size for an extent, so
I expect this won't cause huge latency problems. However, what
happens when we use XFS, have a 64kB block size, and fallocate() is
allocating disk space in continguous 100GB extents and passing those
down to the block device?

How does this get split by dm devices? Are raid stripes going to
dice this into separate stripe unit sized bios, so instead of single
large requests we end up with hundreds or thousands or tiny
allocation requests being issued?

I know that for the loop device, it is going to serialise all IO to
the backing file while fallocate is run on it. Hence if you have
concurrent IO running, any REQ_OP_ASSIGN_RANGE is going to cause an
significant, measurable latency hit to all those IOs in flight.

How are we expecting hardware to behave here? Is this a queued
command in the scsi/nvme/sata protocols? Or is this, for the moment,
just a special snowflake that we can't actually use in production
because the hardware just can't handle what we throw at it?

IOWs, what sort of latency issues is this operation going to cause
on real hardware? Is this going to be like discard? i.e. where we
end up not using it at all because so few devices actually handle
the massive stream of operations the filesystem will end up sending
the device(s) in the course of normal operations?

Cheers,

Dave.

Martin K. Petersen April 3, 2020, 1:34 a.m. UTC | #7

Hi Dave!

> Ok, so ext4 has a very limited max allocation size for an extent, so
> I expect this won't cause huge latency problems. However, what
> happens when we use XFS, have a 64kB block size, and fallocate() is
> allocating disk space in continguous 100GB extents and passing those
> down to the block device?

Depends on the device.

> How does this get split by dm devices? Are raid stripes going to dice
> this into separate stripe unit sized bios, so instead of single large
> requests we end up with hundreds or thousands or tiny allocation
> requests being issued?

There is nothing special about this operation. It needs to be handled
the same way as all other splits. I.e. ideally coalesced at the bottom
of the stack so we can issue larger, contiguous commands to the
hardware.

> How are we expecting hardware to behave here? Is this a queued
> command in the scsi/nvme/sata protocols? Or is this, for the moment,
> just a special snowflake that we can't actually use in production
> because the hardware just can't handle what we throw at it?

For now it's SCSI and queued. Only found in high-end thinly provisioned
storage arrays and not in your average SSD.

The performance expectation for REQ_OP_ALLOCATE is that it is faster
than a write to the same block range since the device potentially needs
to do less work. I.e. the device simply needs to decrement the free
space and mark the LBAs reserved in a map. It doesn't need to write all
the blocks to zero them. If you want zeroed blocks, use
REQ_OP_WRITE_ZEROES.

> IOWs, what sort of latency issues is this operation going to cause
> on real hardware? Is this going to be like discard? i.e. where we
> end up not using it at all because so few devices actually handle
> the massive stream of operations the filesystem will end up sending
> the device(s) in the course of normal operations?

The intended use case, from a SCSI perspective, is that on a thinly
provisioned device you can use this operation to preallocate blocks so
that future writes to the LBAs in question will not fail due to the
device being out of space. I.e. you would use this to pin down block
ranges where you can not tolerate write failures. The advantage over
writing the blocks individually is that dedup won't apply and that the
device doesn't actually have to go write all the individual blocks.

Dave Chinner April 3, 2020, 2:57 a.m. UTC | #8

On Thu, Apr 02, 2020 at 09:34:43PM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!
> 
> > Ok, so ext4 has a very limited max allocation size for an extent, so
> > I expect this won't cause huge latency problems. However, what
> > happens when we use XFS, have a 64kB block size, and fallocate() is
> > allocating disk space in continguous 100GB extents and passing those
> > down to the block device?
> 
> Depends on the device.

Great. :(

> > How does this get split by dm devices? Are raid stripes going to dice
> > this into separate stripe unit sized bios, so instead of single large
> > requests we end up with hundreds or thousands or tiny allocation
> > requests being issued?
> 
> There is nothing special about this operation. It needs to be handled
> the same way as all other splits. I.e. ideally coalesced at the bottom
> of the stack so we can issue larger, contiguous commands to the
> hardware.
> 
> > How are we expecting hardware to behave here? Is this a queued
> > command in the scsi/nvme/sata protocols? Or is this, for the moment,
> > just a special snowflake that we can't actually use in production
> > because the hardware just can't handle what we throw at it?
> 
> For now it's SCSI and queued. Only found in high-end thinly provisioned
> storage arrays and not in your average SSD.

So it's a special snowflake :)

> The performance expectation for REQ_OP_ALLOCATE is that it is faster
> than a write to the same block range since the device potentially needs
> to do less work. I.e. the device simply needs to decrement the free
> space and mark the LBAs reserved in a map. It doesn't need to write all
> the blocks to zero them. If you want zeroed blocks, use
> REQ_OP_WRITE_ZEROES.

I suspect that the implications of wiring filesystems directly up to
this hasn't been thought through entirely....

> > IOWs, what sort of latency issues is this operation going to cause
> > on real hardware? Is this going to be like discard? i.e. where we
> > end up not using it at all because so few devices actually handle
> > the massive stream of operations the filesystem will end up sending
> > the device(s) in the course of normal operations?
> 
> The intended use case, from a SCSI perspective, is that on a thinly
> provisioned device you can use this operation to preallocate blocks so
> that future writes to the LBAs in question will not fail due to the
> device being out of space. I.e. you would use this to pin down block
> ranges where you can not tolerate write failures. The advantage over
> writing the blocks individually is that dedup won't apply and that the
> device doesn't actually have to go write all the individual blocks.

.... because when backed by thinp storage, plumbing user level
fallocate() straight through from the filesystem introduces a
trivial, user level storage DOS vector....

i.e. a user can just fallocate a bunch of files and, because the
filesystem can do that instantly, can also run the back end array
out of space almost instantly. Storage admins are going to love
this!

Cheers,

Dave.

Martin K. Petersen April 3, 2020, 3:45 a.m. UTC | #9

Dave,

> .... because when backed by thinp storage, plumbing user level
> fallocate() straight through from the filesystem introduces a
> trivial, user level storage DOS vector....
>
> i.e. a user can just fallocate a bunch of files and, because the
> filesystem can do that instantly, can also run the back end array
> out of space almost instantly. Storage admins are going to love
> this!

In the standards space, the allocation concept was mainly aimed at
protecting filesystem internals against out-of-space conditions on
devices that dedup identical blocks and where simply zeroing the blocks
therefore is ineffective.

So far we have mainly been talking about fallocate on block devices. How
XFS decides to enforce space allocation policy and potentially leverage
this plumbing is entirely up to you.

Dave Chinner April 7, 2020, 2:27 a.m. UTC | #10

On Thu, Apr 02, 2020 at 08:45:30PM -0700, Martin K. Petersen wrote:
> 
> Dave,
> 
> > .... because when backed by thinp storage, plumbing user level
> > fallocate() straight through from the filesystem introduces a
> > trivial, user level storage DOS vector....
> >
> > i.e. a user can just fallocate a bunch of files and, because the
> > filesystem can do that instantly, can also run the back end array
> > out of space almost instantly. Storage admins are going to love
> > this!
> 
> In the standards space, the allocation concept was mainly aimed at
> protecting filesystem internals against out-of-space conditions on
> devices that dedup identical blocks and where simply zeroing the blocks
> therefore is ineffective.

Um, so we're supposed to use space allocation before overwriting
existing metadata in the filesystem? So that the underlying storage
can reserve space for it before we write it? Which would mean we
have to issue a space allocation before we dirty the metadata, which
means before we dirty any metadata in a transaction. Which means
we'll basically have to redesign the filesystems from the ground up,
yes?

> So far we have mainly been talking about fallocate on block devices.

You might be talking about filesystem metadata and block devices,
but this patchset ends up connecting ext4's user data fallocate() to
the block device, thereby allowing users to reserve space directly
in the underlying block device and directly exposing this issue to
userspace.

I can only go on what is presented to me in patches - this patchset
nothing to do with filesystem metadata nor preventing ENOSPC issues
with internal filesystem updates.

XFS is no different to ext4 or btrfs here - the filesystem doesn't
matter because all of them can fallocate() terabytes of space in
a second or two these days....

> How XFS decides to enforce space allocation policy and potentially
> leverage this plumbing is entirely up to you.

Do I understand this correctly? i.e. that it is the filesystem's
responsibility to prevent users from preallocating more space than
exists in an underlying storage pool that has been intentionally
hidden from the filesystem so it can be underprovisioned?

IOWs, I'm struggling to understand exactly how the "standards space"
think filesystems are supposed to be using this feature whilst also
preventing unprivileged exhaustion of a underprovisioned storage
pool they know nothing about.

Cheers,

Dave.

Martin K. Petersen April 8, 2020, 4:10 a.m. UTC | #11

Hi Dave!

>> In the standards space, the allocation concept was mainly aimed at
>> protecting filesystem internals against out-of-space conditions on
>> devices that dedup identical blocks and where simply zeroing the blocks
>> therefore is ineffective.

> Um, so we're supposed to use space allocation before overwriting
> existing metadata in the filesystem?

Not before overwriting, no. Once you have allocated an LBA it remains
allocated until you discard it.

> So that the underlying storage can reserve space for it before we
> write it? Which would mean we have to issue a space allocation before
> we dirty the metadata, which means before we dirty any metadata in a
> transaction. Which means we'll basically have to redesign the
> filesystems from the ground up, yes?

My understanding is that this facility was aimed at filesystems that do
not dynamically allocate metadata. The intent was that mkfs would
preallocate the metadata LBA ranges, not the filesystem. For filesystems
that allocate metadata dynamically, then yes, an additional step is
required if you want to pin the LBAs.

> You might be talking about filesystem metadata and block devices,
> but this patchset ends up connecting ext4's user data fallocate() to
> the block device, thereby allowing users to reserve space directly
> in the underlying block device and directly exposing this issue to
> userspace.

I missed that Chaitanya's repost of this series included the ext4 patch.
Sorry!

>> How XFS decides to enforce space allocation policy and potentially
>> leverage this plumbing is entirely up to you.
>
> Do I understand this correctly? i.e. that it is the filesystem's
> responsibility to prevent users from preallocating more space than
> exists in an underlying storage pool that has been intentionally
> hidden from the filesystem so it can be underprovisioned?

No. But as an administrative policy it is useful to prevent runaway
applications from writing a petabyte of random garbage to media. My
point was that it is up to you and the other filesystem developers to
decide how you want to leverage the low-level allocation capability and
how you want to provide it to processes. And whether CAP_SYS_ADMIN,
ulimit, or something else is the appropriate policy interface for this.

In terms of thin provisioning and space management there are various
thresholds that may be reported by the device. In past discussions there
haven't been much interest in getting these exposed. It is also unclear
to me whether it is actually beneficial to send low space warnings to
hundreds or thousands of hosts attached to an array. In many cases the
individual server admins are not even the right audience. The most
common notification mechanism is a message to the storage array admin
saying "click here to buy more disk".

If you feel there is merit in having the kernel emit the threshold
warnings you could use as a feedback mechanism, I can absolutely look
into that.

Dave Chinner April 19, 2020, 10:36 p.m. UTC | #12

On Wed, Apr 08, 2020 at 12:10:12AM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!
> 
> >> In the standards space, the allocation concept was mainly aimed at
> >> protecting filesystem internals against out-of-space conditions on
> >> devices that dedup identical blocks and where simply zeroing the blocks
> >> therefore is ineffective.
> 
> > Um, so we're supposed to use space allocation before overwriting
> > existing metadata in the filesystem?
> 
> Not before overwriting, no. Once you have allocated an LBA it remains
> allocated until you discard it.

That is not a consistent argument. If the data has been deduped and
we overwrite, the storage array has to allocate new physical space
for an overwrite to an existing LBA. i.e. deduped data has multiple
LBAs pointing to the same physical storage. Any overwrite of an LBA
that maps to mulitply referenced physical storage requires the
storage array to allocate new physical space for that overwrite.

i.e. allocation is not determined by whether the LBA has been
written to, "pinned" or not - it's whether the act of writing to
that LBA requires the storage to allocate new space to allow the
write to proceed.

That's my point here - one particular shared data overwrite case is
being special cased by preallocation (avoiding dedupe of zero filled
data) to prevent ENOSPC, ignoring all the other cases where we
overwrite shared non-zero data and will also require new physical
space for the new data. In all those cases, the storage has to take
the same action - allocation on overwrite - and so all of them are
susceptible to ENOSPC.

> > So that the underlying storage can reserve space for it before we
> > write it? Which would mean we have to issue a space allocation before
> > we dirty the metadata, which means before we dirty any metadata in a
> > transaction. Which means we'll basically have to redesign the
> > filesystems from the ground up, yes?
> 
> My understanding is that this facility was aimed at filesystems that do
> not dynamically allocate metadata. The intent was that mkfs would
> preallocate the metadata LBA ranges, not the filesystem. For filesystems
> that allocate metadata dynamically, then yes, an additional step is
> required if you want to pin the LBAs.

Ok, so you are confirming what I thought: it's almost completely
useless to us.

i.e. this requires issuing IO to "reserve" space whilst preserving
data before every metadata object goes from clean to dirty in
memory.  But the problem with that is we don't know how much
metadata we are going to dirty in any specific operation. Worse is
that we don't know exactly *what* metadata we will modify until we
walk structures and do lookups, which often happen after we've
dirtied other structures. An ENOSPC from a space reservation at that
point is fatal to the filesystem anyway, so there's no point in even
trying to do this.  Like I said, functionality like this cannot be
retrofitted to existing filesysetms.

IOWs, this is pretty much useless functionality for the filesystem
layer, and if the only use is for some mythical filesystem with
completely static metadata then the standards space really jumped
the shark on this one....

> > You might be talking about filesystem metadata and block devices,
> > but this patchset ends up connecting ext4's user data fallocate() to
> > the block device, thereby allowing users to reserve space directly
> > in the underlying block device and directly exposing this issue to
> > userspace.
> 
> I missed that Chaitanya's repost of this series included the ext4 patch.
> Sorry!
> 
> >> How XFS decides to enforce space allocation policy and potentially
> >> leverage this plumbing is entirely up to you.
> >
> > Do I understand this correctly? i.e. that it is the filesystem's
> > responsibility to prevent users from preallocating more space than
> > exists in an underlying storage pool that has been intentionally
> > hidden from the filesystem so it can be underprovisioned?
> 
> No. But as an administrative policy it is useful to prevent runaway
> applications from writing a petabyte of random garbage to media. My
> point was that it is up to you and the other filesystem developers to
> decide how you want to leverage the low-level allocation capability and
> how you want to provide it to processes. And whether CAP_SYS_ADMIN,
> ulimit, or something else is the appropriate policy interface for this.

My cynical translation: the storage standards space haven't given
any thought to how it can be used and/or administered in the real
world. Pass the buck - let the filesystem people work that out.

What I'm hearing is that this wasn't designed for typical filesystem
use, it wasn't designed for typical user application use, and how to
prevent abuse wasn't thought about at all.

That sounds like a big fat NACK to me....

> In terms of thin provisioning and space management there are various
> thresholds that may be reported by the device. In past discussions there
> haven't been much interest in getting these exposed. It is also unclear
> to me whether it is actually beneficial to send low space warnings to
> hundreds or thousands of hosts attached to an array. In many cases the
> individual server admins are not even the right audience. The most
> common notification mechanism is a message to the storage array admin
> saying "click here to buy more disk".

Notifications are not relevant to preallocation functionality at all.

-Dave.

Martin K. Petersen April 23, 2020, 12:40 a.m. UTC | #13

Dave,

>> Not before overwriting, no. Once you have allocated an LBA it remains
>> allocated until you discard it.

> Ok, so you are confirming what I thought: it's almost completely
> useless to us.
>
> i.e. this requires issuing IO to "reserve" space whilst preserving
> data before every metadata object goes from clean to dirty in memory.

You can only reserve the space prior to writing a block for the first
time. Once an LBA has been written ("Mapped" in the SCSI state machine),
it remains allocated until it is explicitly deallocated (via a
discard/Unmap operation).

This part of the SCSI spec was written eons ago under the assumption
that when a physical resource backing a given LBA had been established,
you could write the block over and over without having to allocate new
space.

This used to be true, but obviously the introduction of de-duplication
blew a major hole in that. I have been perusing the spec over and over
trying to understand how block provisioning state transitions are
defined when dedup is in the picture. However, much is left unexplained.

As a result, I reached out to various folks. Including the people who
worked on this feature in the standards way back. And the response that
I get from them is that allocation operation got irreparably broken when
support for de-duplication was added to the spec. Nobody attempted to
fix the state transitions since most vendors only cared about
deallocation. Consequently specifying the exact behavior of the
allocation operation in the context of dedup fell by the wayside.

The recommendation I got was that we should not rely on this feature
despite it being advertised as supported by the storage. I looked at
whether it was feasible to support it on non-dedup devices only, but it
does not look like it's worthwhile to pursue. And as a result there is
no need for block layer allocation operation to have parity with
SCSI. Although we may want to keep NVMe in mind when defining the
semantics.

[0/4] block: Add support for REQ_OP_ASSIGN_RANGE

Message

Comments