Message ID | 7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | the dm-loop target | expand |
On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > This is the dm-loop target - a replacement for the regular loop driver > with better performance. The dm-loop target builds a map of the file in > the constructor and it just remaps bios according to this map. Using ->bmap is broken and a no-go for new code. If you have any real performance issues with the loop driver document and report them so that they can be fixed instead of working around them by duplicating the code (and in this case making the new code completely broken).
On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote: > dm-loop avoids the file system abstraction, thus gains the resulting > performance advantage versus the loop driver. How? > dm-loop obviously requires full provisioning, thus sparse files are being > detected and error handled. > > Additionally, block relocation on CoW filesystems has to be prevented. > dm-loop uses S_SWAPFILE to do so but that flag's limited to swap semantics > and is overloaded as such. > > Should we avoid the flag and enforce use of non-CoW filesystems for backing > or checks on non-CoW files (inode->i_flags)? No, ->bmap is fundamentally flawed. No new code should be using it, and we need to move the places still using it (most notably swap and the md bitmap code) off it. It can't deal with any kind of changes to the file system topology and is a risk to data corruption because if used in the I/O path it bypasses the file system entirely.. If it wasn't we'd use it in the loop driver.
On Mon, Mar 03, 2025 at 07:13:49AM -0800, Christoph Hellwig wrote: > No, ->bmap is fundamentally flawed. No new code should be using it, and > we need to move the places still using it (most notably swap and the md > bitmap code) off it. It can't deal with any kind of changes to the file > system topology and is a risk to data corruption because if used in the > I/O path it bypasses the file system entirely.. If it wasn't we'd use it > in the loop driver. Funnily, I was looking at the md bitmap code too. It's the last part of the kernel using buffer heads with not-a-pagecache page, so it's the only caller of alloc_page_buffers() remaining. I think it should just use the page cache to read/write the file data, but I haven't looked into it in detail.
On Mon, Mar 03, 2025 at 03:22:48PM +0000, Matthew Wilcox wrote: > Funnily, I was looking at the md bitmap code too. It's the last part of > the kernel using buffer heads with not-a-pagecache page, so it's the > only caller of alloc_page_buffers() remaining. > > I think it should just use the page cache to read/write the file data, > but I haven't looked into it in detail. The md bitmap code is awful as it abuses it's own buffer heads in a way that is non-coherent with the underlying fs. It should just be using the vfs read/write helpers for in-kernel direct I/O with a scoped nofs context.
On Mon, 3 Mar 2025, Christoph Hellwig wrote: > On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > > This is the dm-loop target - a replacement for the regular loop driver > > with better performance. The dm-loop target builds a map of the file in > > the constructor and it just remaps bios according to this map. > > Using ->bmap is broken and a no-go for new code. What should I use instead of bmap? Is fiemap exported for use in the kernel? ext4_bmap flushes the journal if journaled data are used. Is there some equivalent function that would provide the same guarantee w.r.t. journaled data? > If you have any real > performance issues with the loop driver document and report them so that > they can be fixed instead of working around them by duplicating the code > (and in this case making the new code completely broken). Would Jens Axboe agree to merge the dm-loop logic into the existing loop driver? Dm-loop is significantly faster than the regular loop: # modprobe brd rd_size=1048576 # dd if=/dev/zero of=/dev/ram0 bs=1048576 # mkfs.ext4 /dev/ram0 # mount -t ext4 /dev/ram0 /mnt/test # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 dm-loop (on /mnt/test/test): # dmsetup create loop --table '0 1048576 loop /mnt/test/test 0' # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/mapper/loop --name=job --rw=rw READ: bw=2428MiB/s (2546MB/s), 2428MiB/s-2428MiB/s (2546MB/s-2546MB/s), io=23.7GiB (25.5GB), run=10001-10001msec WRITE: bw=2429MiB/s (2547MB/s), 2429MiB/s-2429MiB/s (2547MB/s-2547MB/s), io=23.7GiB (25.5GB), run=10001-10001msec regular loop (on /mnt/test/test): # losetup /dev/loop0 /mnt/test/test # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/loop0 --name=job --rw=rw READ: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3259MiB (3417MB), run=10003-10003msec WRITE: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3260MiB (3418MB), run=10003-10003msec dm-loop is even slightly faster than running fio directly on the regular file: # fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test/test --name=job --rw=rw READ: bw=2005MiB/s (2103MB/s), 2005MiB/s-2005MiB/s (2103MB/s-2103MB/s), io=19.6GiB (21.0GB), run=10002-10002msec WRITE: bw=2007MiB/s (2104MB/s), 2007MiB/s-2007MiB/s (2104MB/s-2104MB/s), io=19.6GiB (21.0GB), run=10002-10002msec Mikulas
On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > This is the dm-loop target - a replacement for the regular loop driver > with better performance. The dm-loop target builds a map of the file in > the constructor and it just remaps bios according to this map. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> This looks like it is based on my dm-loop patch from almost 20 years ago. Afaik the locking problems that hch pointed out back then are still valid now - stashing the bmap lookup in memory lead to weird and hard to debug problems back then and I don't think this has changed. There are two other approaches that have been proposed in the mean time, Heinz posted one about 6 years ago that uses the VFS for IO to the backing file[1], and I posted another version in around 2012, this time based on splitting drivers/block/loop.c into front/backend interfaces and consuming the backend interface in dm-loop. These mechanisms don't have the peformance gain of the in-memory block mapping but they are safe and simpler (and may allow for more code sharing with /dev/loop). Regards, Bryn. [1] https://patchwork.kernel.org/project/dm-devel/patch/1b84af841912065fc57cfe395d5214f4eee0f0fc.1516124587.git.heinzm@redhat.com/ This implements a loopback target for device mapper allowing a regular file to be treated as a block device. Signed-off-by: Bryn Reeves <breeves@redhat.com> =================================================================== diff --git a/drivers/md/dm-loop.c b/drivers/md/dm-loop.c new file mode 100644 index 0000000..71f0287 --- /dev/null +++ b/drivers/md/dm-loop.c @@ -0,0 +1,1000 @@ +/* + * Copyright (C) 2006 Red Hat, Inc. All rights reserved. + * + * This file is part of device-mapper. + * + * drivers/md/dm-loop.c + * + * Extent mapping implementation heavily influenced by mm/swapfile.c + * Bryn Reeves <breeves@redhat.com> + * + * File mapping and block lookup algorithms support by + * Heinz Mauelshagen <hjm@redhat.com>. + * + * This file is released under the GPL. + * + */ +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/fs.h> +#include <linux/module.h> +#include <linux/vmalloc.h> +#include <linux/syscalls.h> +#include <linux/workqueue.h> +#include <linux/file.h> +#include <linux/bio.h> + +#include "dm.h" +#include "dm-bio-list.h" + +static const char *version = "v0.419"; +#define DAEMON "kloopd" +#define DM_MSG_PREFIX "loop" + +enum flags { LOOP_BMAP, LOOP_FSIO }; + +/*-------------------------------------------------------------------- + * Loop context + *--------------------------------------------------------------------*/ + +struct loop_c { + unsigned long flags; + + /* information describing the backing store */ + struct file *filp; /* loop file handle */ + char *path; /* path argument */ + loff_t offset; /* offset argument */ + struct block_device *bdev; /* block device */ + unsigned blkbits; /* file system block size shift bits */ + + loff_t size; /* size of entire file in bytes */ + loff_t blocks; /* blocks allocated to loop file */ + sector_t mapped_sectors; /* size of mapped area in sectors*/ + + /* mapping */ + int (*map_fn)(struct dm_target*, struct bio*); + /* mapping function private data */ + void *map_data; +}; + +/* + * block map extent + */ +struct extent { + sector_t start; /* start sector in mapped device */ + sector_t to; /* start sector on target device */ + sector_t len; /* length in sectors */ +}; + +/* + * temporary extent list + */ +struct extent_list { + struct extent * extent; + struct list_head list; +}; + +struct kmem_cache *extent_cache; + +/* + * block map private context + */ +struct block_map_c { + int nr_extents; /* number of extents in map */ + struct extent **map; /* linear map of extent pointers */ + struct extent **mru; /* pointer to mru entry */ + spinlock_t mru_lock; /* protects mru */ +}; + +/* + * file map private context + */ +struct file_map_c { + spinlock_t lock; /* protects in */ + struct bio_list in; /* new bios for processing */ + struct bio_list work; /* bios queued for processing */ + struct workqueue_struct *wq; /* workqueue */ + struct work_struct ws; /* loop work */ + struct loop_c *loop; /* for filp & offset */ +}; + +/*-------------------------------------------------------------------- + * Generic helpers + *--------------------------------------------------------------------*/ + +static inline sector_t blk2sec(struct loop_c *lc, blkcnt_t block) +{ + return block << (lc->blkbits - SECTOR_SHIFT); +} + +static inline blkcnt_t sec2blk(struct loop_c *lc, sector_t sector) +{ + return sector >> (lc->blkbits - SECTOR_SHIFT); +} + +/*-------------------------------------------------------------------- + * File I/O helpers + *--------------------------------------------------------------------*/ + +/* + * transfer data to/from file using the read/write file_operations. + */ +static int fs_io(int rw, struct file *filp, loff_t *pos, + struct bio_vec *bv) +{ + ssize_t r; + void *ptr = kmap(bv->bv_page) + bv->bv_offset; + mm_segment_t old_fs = get_fs(); + + set_fs(get_ds()); + r = (rw == READ) ? filp->f_op->read(filp, ptr, bv->bv_len, pos) : + filp->f_op->write(filp, ptr, bv->bv_len, pos); + set_fs(old_fs); + kunmap(bv->bv_page); + return (r == bv->bv_len) ? 0 : -EIO; +} + +/* + * Handle I/O for one bio + */ +static void do_one_bio(struct file_map_c *fc, struct bio *bio) +{ + int r = 0, rw = bio_data_dir(bio); + loff_t start = (bio->bi_sector << 9) + fc->loop->offset, + pos = start; + struct bio_vec *bv, *bv_end = bio->bi_io_vec + bio->bi_vcnt; + + for (bv = bio->bi_io_vec; bv < bv_end; bv++) { + r = fs_io(rw, fc->loop->filp, &pos, bv); + if (r) { + DMERR("%s error %d", rw ? "write":"read" , r); + break; + } + } + bio_endio(bio, pos - start, r); +} + +/* + * Worker thread for a 'file' type loop device + */ +static void do_loop_work(struct work_struct *ws) +{ + struct file_map_c *fc = container_of(ws, struct file_map_c, ws); + struct bio *bio; + + /* quickly grab all new bios queued and add them to the work list */ + spin_lock_irq(&fc->lock); + bio_list_merge_init(&fc->work, &fc->in); + spin_unlock_irq(&fc->lock); + + /* work the list and do file I/O on all bios */ + while ((bio = bio_list_pop(&fc->work))) + do_one_bio(fc, bio); +} + +/* + * Create work queue and initialize work + */ +static int loop_work_init(struct loop_c *lc) +{ + struct file_map_c *fc = lc->map_data; + fc->wq = create_singlethread_workqueue(DAEMON); + if (!fc->wq) + return -ENOMEM; + + return 0; +} + +/* + * Destroy work queue + */ +static void loop_work_exit(struct loop_c *lc) +{ + struct file_map_c *fc = lc->map_data; + if (fc->wq) + destroy_workqueue(fc->wq); +} + +/* + * LOOP_FSIO map_fn. Mapping just queues bios to the file map + * context and lets the daemon deal with them. + */ +static int loop_file_map(struct dm_target *ti, struct bio *bio) +{ + int wake; + struct loop_c *lc = ti->private; + struct file_map_c *fc = lc->map_data; + + spin_lock_irq(&fc->lock); + wake = bio_list_empty(&fc->in); + bio_list_add(&fc->in, bio); + spin_unlock_irq(&fc->lock); + + /* + * only call queue_work() if necessary to avoid + * superfluous preempt_{disable/enable}() overhead. + */ + if (wake) + queue_work(fc->wq, &fc->ws); + + /* handling bio -> will submit later */ + return 0; +} + +/* + * Shutdown the workqueue and free a file mapping + */ +static void destroy_file_map(struct loop_c *lc) +{ + loop_work_exit(lc); + kfree(lc->map_data); +} + +/* + * Set up a file map context and workqueue + */ +static int setup_file_map(struct loop_c *lc) +{ + struct file_map_c *fc = kzalloc(sizeof(*fc), GFP_KERNEL); + if (!fc) + return -ENOMEM; + + lc->map_data = fc; + spin_lock_init(&fc->lock); + bio_list_init(&fc->in); + bio_list_init(&fc->work); + INIT_WORK(&fc->ws, do_loop_work); + fc->loop = lc; + lc->map_fn = loop_file_map; + + return loop_work_init(lc); +} + +/*-------------------------------------------------------------------- + * Block I/O helpers + *--------------------------------------------------------------------*/ + +static int contains_sector(struct extent *e, sector_t s) +{ + if (likely(e)) + return s < (e->start + (e->len)) && + s >= e->start; + return 0; +} + +/* + * Walk over a linked list of extent_list structures, freeing them as + * we go. Does not free el->extent. + */ +static void destroy_extent_list(struct list_head *head) +{ + struct list_head *curr, *n; + + if (list_empty(head)) + return; + + list_for_each_safe(curr, n, head) { + struct extent_list *el; + el = list_entry(curr, struct extent_list, list); + list_del(curr); + kfree(el); + } +} + +/* + * Add a new extent to the tail of the list at *head with + * start/to/len parameters. Allocates from the extent cache. + */ +static int list_add_extent(struct list_head *head, + sector_t start, sector_t to, sector_t len) +{ + struct extent *extent; + struct extent_list *list; + + extent = kmem_cache_alloc(extent_cache, GFP_KERNEL); + if(!extent) + goto out; + + list = kmalloc(sizeof(*list), GFP_KERNEL); + if(!list) + goto out; + + extent->start = start; + extent->to = to; + extent->len = len; + + list->extent = extent; + list_add_tail(&list->list, head); + + return 0; +out: + if (extent) + kmem_cache_free(extent_cache, extent); + return -ENOMEM; +} + +/* + * Return an extent range (i.e. beginning and ending physical block numbers). + */ +static int extent_range(struct inode * inode, + blkcnt_t logical_blk, blkcnt_t last_blk, + blkcnt_t *begin_blk, blkcnt_t *end_blk) +{ + sector_t dist = 0, phys_blk, probe_blk = logical_blk; + + /* Find beginning physical block of extent starting at logical_blk. */ + *begin_blk = phys_blk = bmap(inode, probe_blk); + if (!phys_blk) + return -ENXIO; + + for (; phys_blk == *begin_blk + dist; dist++) { + *end_blk = phys_blk; + if (++probe_blk > last_blk) + break; + + phys_blk = bmap(inode, probe_blk); + if (unlikely(!phys_blk)) + return -ENXIO; + } + + return 0; +} + +/* + * Create a sequential list of extents from an inode and return + * it in *head. On success return the number of extents found or + * -ERRNO on failure. + */ +static int loop_extents(struct loop_c *lc, struct inode *inode, + struct list_head *head) +{ + sector_t start = 0; + int r, nr_extents = 0; + blkcnt_t nr_blks = 0, begin_blk = 0, end_blk = 0; + blkcnt_t last_blk = sec2blk(lc, + (lc->mapped_sectors + (lc->offset >> 9))) - 1; + blkcnt_t logical_blk = sec2blk(lc, (lc->offset >> 9)); + + /* for each block in the mapped region */ + while (logical_blk <= last_blk) { + r = extent_range(inode, logical_blk, last_blk, + &begin_blk, &end_blk); + + /* sparse file fallback */ + if (unlikely(r)) { + DMWARN("%s has a hole; sparse file detected - " + "switching to filesystem I/O", lc->path); + clear_bit(LOOP_BMAP, &lc->flags); + set_bit(LOOP_FSIO, &lc->flags); + return r; + } + + nr_blks = 1 + end_blk - begin_blk; + + if (likely(nr_blks)) { + r = list_add_extent(head, start, + blk2sec(lc, begin_blk), + blk2sec(lc, nr_blks)); + + if (unlikely(r)) + return r; + + /* advance to next extent */ + nr_extents++; + start += blk2sec(lc, nr_blks); + logical_blk += nr_blks; + } + } + + return nr_extents; +} + +/* + * Walk over the extents in a block_map_c, returning them to the cache and + * freeing bc->map and bc. + */ +static void destroy_block_map(struct block_map_c *bc) +{ + unsigned i; + + if (!bc) + return; + + for (i = 0; i < bc->nr_extents; i++) + kmem_cache_free(extent_cache, bc->map[i]); + DMDEBUG("destroying block map of %d entries", i); + vfree(bc->map); + kfree(bc); +} + +/* + * Find an extent in *bc using binary search. Returns a pointer into the + * extent map. Calculate index as (extent - bc->map). + */ +static struct extent **extent_binary_lookup(struct block_map_c *bc, + struct extent **extent_mru, + sector_t sector) +{ + unsigned nr_extents = bc->nr_extents; + unsigned delta, dist, prev_dist = 0; + struct extent **eptr; + + /* Optimize lookup range based on MRU extent. */ + dist = extent_mru - bc->map; + if ((*extent_mru)->start < sector) { + delta = (nr_extents - dist) / 2; + dist += delta; + } else + delta = dist = dist / 2; + + eptr = bc->map + dist; + while(*eptr && !contains_sector(*eptr, sector)) { + if (sector >= (*eptr)->start + (*eptr)->len) { + prev_dist = dist; + if (delta > 1) + delta /= 2; + dist += delta; + } else { + delta = (dist - prev_dist) / 2; + if (!delta) + delta = 1; + dist -= delta; + } + eptr = bc->map + dist; + } + + return eptr; +} + +/* + * Lookup an extent for a sector using the mru cache and binary search. + */ +static struct extent *extent_lookup(struct block_map_c *bc, sector_t sector) +{ + struct extent **eptr; + + spin_lock_irq(&bc->mru_lock); + eptr = bc->mru; + spin_unlock_irq(&bc->mru_lock); + + if (contains_sector(*eptr, sector)) + return *eptr; + + eptr = extent_binary_lookup(bc, eptr, sector); + if (!eptr) + return NULL; + + spin_lock_irq(&bc->mru_lock); + bc->mru = eptr; + spin_unlock_irq(&bc->mru_lock); + + return *eptr; +} + +/* + * LOOP_BMAP map_fn. Looks up the sector in the extent map and + * rewrites the bio device and bi_sector fields. + */ +static int loop_block_map(struct dm_target *ti, struct bio *bio) +{ + struct loop_c *lc = ti->private; + struct extent *extent = extent_lookup(lc->map_data, bio->bi_sector); + + if (likely(extent)) { + bio->bi_bdev = lc->bdev; + bio->bi_sector = extent->to + + (bio->bi_sector - extent->start); + return 1; /* Done with bio -> submit */ + } + + DMERR("no matching extent in map for sector %llu", + (unsigned long long) bio->bi_sector + ti->begin); + BUG(); + + return -EIO; +} + +/* + * Turn an extent_list into a linear pointer map of nr_extents + 1 entries + * and set the final entry to NULL. + */ +static struct extent **build_extent_map(struct list_head *head, + int nr_extents, unsigned long *flags) +{ + unsigned map_size, cache_size; + struct extent **map, **curr; + struct list_head *pos; + + map_size = 1 +(sizeof(*map) * nr_extents); + cache_size = kmem_cache_size(extent_cache) * nr_extents; + + map = vmalloc(map_size); + curr = map; + + DMDEBUG("allocated extent map of %u %s for %d extents (%u %s)", + (map_size < 8192 ) ? map_size : map_size >> 10, + (map_size < 8192 ) ? "bytes" : "kilobytes", nr_extents, + (cache_size < 8192) ? cache_size : cache_size >> 10, + (cache_size < 8192) ? "bytes" : "kilobytes"); + + list_for_each(pos, head) { + struct extent_list *el; + el = list_entry(pos, struct extent_list, list); + *(curr++) = el->extent; + } + *curr = NULL; + + return map; +} + +/* + * Set up a block map context and extent map + */ +static int setup_block_map(struct loop_c *lc, struct inode *inode) +{ + int r, nr_extents; + struct block_map_c *bc; + LIST_HEAD(head); + + if (!inode || !inode->i_sb || !inode->i_sb->s_bdev) + return -ENXIO; + + /* build a linked list of extents in linear order */ + r = nr_extents = loop_extents(lc, inode, &head); + if (nr_extents < 1) + goto out; + + r = -ENOMEM; + bc = kzalloc(sizeof(*bc), GFP_KERNEL); + if(!bc) + goto out; + + /* create a linear map of pointers into the extent cache */ + bc->map = build_extent_map(&head, nr_extents, &lc->flags); + destroy_extent_list(&head); + + if (IS_ERR(bc->map)) { + r = PTR_ERR(bc->map); + goto out; + } + + spin_lock_init(&bc->mru_lock); + bc->mru = bc->map; + bc->nr_extents = nr_extents; + lc->bdev = inode->i_sb->s_bdev; + lc->map_data = bc; + lc->map_fn = loop_block_map; + + return 0; + +out: + return r; +} + +/*-------------------------------------------------------------------- + * Generic helpers + *--------------------------------------------------------------------*/ + +/* + * Invalidate all unlocked loop file pages + */ +static int loop_invalidate_file(struct file *filp) +{ + return invalidate_inode_pages(filp->f_mapping); +} + +/* + * acquire or release a "no-truncate" lock on *filp. + * We overload the S_SWAPFILE flag for loop targets because + * it provides the same no-truncate semantics we require, and + * holding onto i_sem is no longer an option. + */ +static void file_truncate_lock(struct file *filp) +{ + struct inode *inode = filp->f_mapping->host; + + mutex_lock(&inode->i_mutex); + inode->i_flags |= S_SWAPFILE; + mutex_unlock(&inode->i_mutex); +} + +static void file_truncate_unlock(struct file *filp) +{ + struct inode *inode = filp->f_mapping->host; + + mutex_lock(&inode->i_mutex); + inode->i_flags &= ~S_SWAPFILE; + mutex_unlock(&inode->i_mutex); +} + +/* + * Fill out split_io for taget backing store + */ +static void set_split_io(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + + if(test_bit(LOOP_BMAP, &lc->flags)) + /* Split I/O at block boundaries */ + ti->split_io = 1 << (lc->blkbits - SECTOR_SHIFT); + else + ti->split_io = 64; + + DMDEBUG("splitting io at %llu sector boundaries", + (unsigned long long) ti->split_io); +} + +/* + * Check that the loop file is regular and available. + */ +static int loop_check_file(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + struct file *filp = lc->filp; + struct inode *inode = filp->f_mapping->host; + + if (!inode) + return -ENXIO; + + ti->error = "backing file must be a regular file"; + if (!S_ISREG(inode->i_mode)) + return -EINVAL; + + ti->error = "backing file is mapped into userspace for writing"; + if (mapping_writably_mapped(filp->f_mapping)) + return -EBUSY; + + if (mapping_mapped(filp->f_mapping)) + DMWARN("%s is mapped into userspace", lc->path); + + if (!inode->i_sb || !inode->i_sb->s_bdev) { + DMWARN("%s has no blockdevice - switching to filesystem I/O", lc->path); + clear_bit(LOOP_BMAP, &lc->flags); + set_bit(LOOP_FSIO, &lc->flags); + } + + ti->error = "backing file already in use"; + if (IS_SWAPFILE(inode)) + return -EBUSY; + + return 0; +} + +/* + * Check loop file size and store it in the loop context + */ +static int loop_setup_size(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + struct inode *inode = lc->filp->f_mapping->host; + int r = -EINVAL; + + lc->size = i_size_read(inode); + lc->blkbits = inode->i_blkbits; + + ti->error = "backing file is empty"; + if (!lc->size) + goto out; + + DMDEBUG("set backing file size to %llu", + (unsigned long long) lc->size); + + ti->error = "backing file cannot be less than one block in size"; + if (lc->size < (blk2sec(lc,1) << 9)) + goto out; + + ti->error = "loop file offset must be a multiple of fs blocksize"; + if (lc->offset & ((1 << lc->blkbits) - 1)) + goto out; + + ti->error = "loop file offset too large"; + if (lc->offset > (lc->size - (1 << 9))) + goto out; + + lc->mapped_sectors = (lc->size - lc->offset) >> 9; + DMDEBUG("set mapped sectors to %llu (%llu bytes)", + (unsigned long long) lc->mapped_sectors, + (lc->size - lc->offset)); + + if ((lc->offset + (lc->mapped_sectors << 9)) < lc->size) + DMWARN("not using %llu bytes in incomplete block at EOF", + lc->size - (lc->offset + (lc->mapped_sectors << 9))); + + ti->error = "mapped region cannot be smaller than target size"; + if (lc->size - lc->offset < (ti->len << 9)) + goto out; + + return 0; +out: + return r; +} + +/* + * release a loop file + */ +static void loop_put_file(struct file *filp) +{ + if (!filp) + return; + file_truncate_unlock(filp); + filp_close(filp, NULL); +} + +/* + * open loop file and perform type, availability and size checks. + */ +static int loop_get_file(struct dm_target *ti) +{ + int flags = ((dm_table_get_mode(ti->table) & FMODE_WRITE) ? + O_RDWR : O_RDONLY) | O_LARGEFILE; + struct loop_c *lc = ti->private; + struct file *filp; + int r = 0; + + ti->error = "could not open backing file"; + filp = filp_open(lc->path, flags, 0); + if (IS_ERR(filp)) + return PTR_ERR(filp); + lc->filp = filp; + r = loop_check_file(ti); + if (r) + goto out_put; + + r = loop_setup_size(ti); + if (r) + goto out_put; + + file_truncate_lock(filp); + return 0; + +out_put: + fput(filp); + return r; +} + +/* + * invalidate mapped pages belonging to the loop file + */ +void loop_flush(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + loop_invalidate_file(lc->filp); +} + +/*-------------------------------------------------------------------- + * Device-mapper target methods + *--------------------------------------------------------------------*/ + +/* + * Generic loop map function. Re-base I/O to target begin and submit + */ +static int loop_map(struct dm_target *ti, struct bio *bio, + union map_info *context) +{ + struct loop_c *lc = ti->private; + + if (unlikely(bio_barrier(bio))) + return -EOPNOTSUPP; + bio->bi_sector -= ti->begin; + if (lc->map_fn) + return lc->map_fn(ti, bio); + return -EIO; +} + +/* + * Block status helper + */ +static ssize_t loop_file_status(struct loop_c *lc, char *result, unsigned maxlen) +{ + ssize_t sz = 0; + struct file_map_c *fc = lc->map_data; + int qlen; + + spin_lock_irq(&fc->lock); + qlen = bio_list_nr(&fc->work); + qlen += bio_list_nr(&fc->in); + spin_unlock_irq(&fc->lock); + DMEMIT("file %d", qlen); + + return sz; +} + +/* + * File status helper + */ +static ssize_t loop_block_status(struct loop_c *lc, char *result, unsigned maxlen) +{ + ssize_t sz = 0; + struct block_map_c *bc = lc->map_data; + int mru; + + spin_lock_irq(&bc->mru_lock); + mru = bc->mru - bc->map; + spin_unlock_irq(&bc->mru_lock); + DMEMIT("block %d %d", bc->nr_extents, mru); + + return sz; +} + +/* + * This needs some thought on handling unlinked backing files. some parts of + * the kernel return a cached name (now invalid), while others return a dcache + * "/path/to/foo (deleted)" name (never was/is valid). Which is "better" is + * debatable. + * + * On the one hand, using a cached name gives table output which is directly + * usable assuming the user re-creates the unlinked image file, on the other + * it is more consistent with e.g. swap to use the dcache name. + * +*/ +static int loop_status(struct dm_target *ti, status_type_t type, + char *result, unsigned maxlen) +{ + struct loop_c *lc = (struct loop_c *) ti->private; + ssize_t sz = 0; + + switch (type) { + case STATUSTYPE_INFO: + if (test_bit(LOOP_BMAP, &lc->flags)) + sz += loop_block_status(lc, result, maxlen - sz); + else if (test_bit(LOOP_FSIO, &lc->flags)) + sz += loop_file_status(lc, result, maxlen - sz); + break; + + case STATUSTYPE_TABLE: + DMEMIT("%s %llu", lc->path, lc->offset); + break; + } + return 0; +} + +/* + * Destroy a loopback mapping + */ +static void loop_dtr(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + + if ((dm_table_get_mode(ti->table) & FMODE_WRITE)) + loop_invalidate_file(lc->filp); + + if (test_bit(LOOP_BMAP, &lc->flags) && lc->map_data) + destroy_block_map((struct block_map_c *)lc->map_data); + if (test_bit(LOOP_FSIO, &lc->flags) && lc->map_data) + destroy_file_map(lc); + + loop_put_file(lc->filp); + DMINFO("released file %s", lc->path); + + kfree(lc); +} + +/* + * Construct a loopback mapping: <path> <offset> + */ +static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv) +{ + struct loop_c *lc = NULL; + int r = -EINVAL; + + ti->error = "invalid argument count"; + if (argc != 2) + goto out; + + r = -ENOMEM; + ti->error = "cannot allocate loop context"; + lc = kzalloc(sizeof(*lc), GFP_KERNEL); + if (!lc) + goto out; + + /* default */ + set_bit(LOOP_BMAP, &lc->flags); + ti->error = "cannot allocate loop path"; + lc->path = kstrdup(argv[0], GFP_KERNEL); + if (!lc->path) + goto out; + + ti->private = lc; + + r = -EINVAL; + ti->error = "invalid file offset"; + if (sscanf(argv[1], "%lld", &lc->offset) != 1) + goto out; + + if (lc->offset) + DMDEBUG("setting file offset to %lld", lc->offset); + + /* open & check file and set size parameters */ + r = loop_get_file(ti); + + /* ti->error has been set by loop_get_file */ + if (r) + goto out; + + ti->error = "could not create loop mapping"; + if (test_bit(LOOP_BMAP, &lc->flags)) + r = setup_block_map(lc, lc->filp->f_mapping->host); + if (test_bit(LOOP_FSIO, &lc->flags)) + r = setup_file_map(lc); + + if (r) + goto out_putf; + + set_split_io(ti); + if (lc->bdev) + dm_set_device_limits(ti, lc->bdev); + + DMDEBUG("constructed loop target on %s " + "(%lldk, %llu sectors)", lc->path, + (lc->size >> 10), lc->mapped_sectors); + ti->error = NULL; + + return 0; + +out_putf: + loop_put_file(lc->filp); +out: + if(lc) + kfree(lc); + return r; +} + +static struct target_type loop_target = { + .name = "loop", + .version = {0, 0, 2}, + .module = THIS_MODULE, + .ctr = loop_ctr, + .dtr = loop_dtr, + .map = loop_map, + .presuspend = loop_flush, + .flush = loop_flush, + .status = loop_status, +}; + +/*-------------------------------------------------------------------- + * Module bits + *--------------------------------------------------------------------*/ +int __init dm_loop_init(void) +{ + int r; + + r = dm_register_target(&loop_target); + if (r < 0) { + DMERR("register failed %d", r); + goto out; + } + + r = -ENOMEM; + extent_cache = kmem_cache_create("extent_cache", sizeof(struct extent), + 0, SLAB_HWCACHE_ALIGN, NULL, NULL); + if (!extent_cache) + goto out; + + DMINFO("registered %s", version); + return 0; + +out: + if (extent_cache) + kmem_cache_destroy(extent_cache); + return r; +} + +void dm_loop_exit(void) +{ + int r; + + r = dm_unregister_target(&loop_target); + kmem_cache_destroy(extent_cache); + + if (r < 0) + DMERR("target unregister failed %d", r); + else + DMINFO("unregistered %s", version); +} + +module_init(dm_loop_init); +module_exit(dm_loop_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Bryn Reeves <breeves@redhat.com>"); +MODULE_DESCRIPTION("device-mapper loop target");
On Mon, 3 Mar 2025, Bryn M. Reeves wrote: > On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote: > > This is the dm-loop target - a replacement for the regular loop driver > > with better performance. The dm-loop target builds a map of the file in > > the constructor and it just remaps bios according to this map. > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > This looks like it is based on my dm-loop patch from almost 20 years > ago. Yes, it is. I copied some of the code from it. > Afaik the locking problems that hch pointed out back then are still > valid now - stashing the bmap lookup in memory lead to weird and hard to > debug problems back then and I don't think this has changed. The dm-loop driver does the same thing as swapfile activation - so I presume that if swapfile activation is correct, then dm-loop is also correct. The S_SWAPFILE flag prevents the kernel from truncating, extending or deleting the file - it just returns "Text file busy". Mikulas
On Mon, Mar 03, 2025 at 06:06:36PM +0100, Mikulas Patocka wrote: > The dm-loop driver does the same thing as swapfile activation - so I > presume that if swapfile activation is correct, then dm-loop is also > correct. > > The S_SWAPFILE flag prevents the kernel from truncating, extending or > deleting the file - it just returns "Text file busy". Right: that was the idea, and it was lifted wholesale from swapfile.c. In my testing it seemed reliable, but we had reports from users of very occasional memory or file corruption iirc that we were never able to pin down and resolve. One user in the UK was running it in production for a couple of years on multiple systems - in that time they had a handful of crashes that we were never able to make much sense of. Regards, Bryn.
On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > What should I use instead of bmap? Is fiemap exported for use in the > kernel? You can't do an ahead of time mapping. It's a broken concept. > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > driver? What logic? > Dm-loop is significantly faster than the regular loop: > > # modprobe brd rd_size=1048576 > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > # mkfs.ext4 /dev/ram0 > # mount -t ext4 /dev/ram0 /mnt/test > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 All of this needs to be in a commit log. Also note that the above: a) does not use direct I/O which any sane loop user should b) is not on a file but a block device, rendering the use of a loop device entirely pointless.
On Mon, Mar 03, 2025 at 05:06:16PM +0100, Heinz Mauelshagen wrote: > On Mon, Mar 3, 2025 at 4:24 PM Christoph Hellwig <hch@infradead.org> wrote: > > > On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote: > > > dm-loop avoids the file system abstraction, thus gains the resulting > > > performance advantage versus the loop driver. > > > > How? > > > > It uses a predefined map and thus avoids VFS -> FS write path overhead > completelly. But you simply can't use a "pre-defined map". It's fundamentally unsafe. > The inode fiemap() operation would avoid issues relative to outdated bmap(). No, using that for I/O is even worse.
On Mon, 3 Mar 2025, Christoph Hellwig wrote: > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > What should I use instead of bmap? Is fiemap exported for use in the > > kernel? > > You can't do an ahead of time mapping. It's a broken concept. Swapfile does ahead of time mapping. And I just looked at what swapfile does and copied the logic into dm-loop. If swapfile is not broken, how could dm-loop be broken? > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > > driver? > > What logic? The ahead-of-time mapping. > > Dm-loop is significantly faster than the regular loop: > > > > # modprobe brd rd_size=1048576 > > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > > # mkfs.ext4 /dev/ram0 > > # mount -t ext4 /dev/ram0 /mnt/test > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 > > All of this needs to be in a commit log. Also note that the above: > > a) does not use direct I/O which any sane loop user should > b) is not on a file but a block device, rendering the use of a loop > device entirely pointless. With "losetup --direct-io=on /dev/loop0 /mnt/test/test", it is even slower than without: READ: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2170MiB (2275MB), run=10003-10003msec WRITE: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2169MiB (2274MB), run=10003-10003msec with --direct-io=off READ: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3978MiB (4171MB), run=10003-10003msec WRITE: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3981MiB (4175MB), run=10003-10003msec Mikulas
On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > What should I use instead of bmap? Is fiemap exported for use in the > > > kernel? > > > > You can't do an ahead of time mapping. It's a broken concept. > > Swapfile does ahead of time mapping. And I just looked at what swapfile > does and copied the logic into dm-loop. If swapfile is not broken, how > could dm-loop be broken? Swap files cannot be accessed/modified by user code once the swapfile is activated. See all the IS_SWAPFILE() checked throughout the VFS and filesystem code. Swap files must be fully allocated (i.e. not sparse), nor contan shared extents. This is required so that writes to the swapfile do not require block allocation which would change the mapping... Hence we explicitly prevent modification of the underlying file mapping once a swapfile is owned and mapped by the kernel as a swapfile. That's not how loop devices/image files work - we actually rely on them being: a) sparse; and b) the mapping being mutable via direct access to the loop file whilst there is an active mounted filesystem on that loop file. and so every IO needs to be mapped through the filesystem at submission time. The reason for a) is obvious: we don't need to allocate space for the filesystem so it's effectively thin provisioned. Also, fstrim on the mounted loop device can punch out unused space in the mounted filesytsem. The reason for b) is less obvious: snapshots via file cloning, deduplication via extent sharing. The clone operaiton is an atomic modification of the underlying file mapping, which then triggers COW on future writes to those mappings, which causes the mapping to the change at write IO time. IOWs, the whole concept that there is a "static mapping" for a loop device image file for the life of the image file is fundamentally flawed. > > > Dm-loop is significantly faster than the regular loop: > > > > > > # modprobe brd rd_size=1048576 > > > # dd if=/dev/zero of=/dev/ram0 bs=1048576 > > > # mkfs.ext4 /dev/ram0 > > > # mount -t ext4 /dev/ram0 /mnt/test > > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512 Urk. ram disks are terrible for IO benchmarking. The IO is synchronous (i.e. always completes in the submitter context) and performance is -always CPU bound- due to the requirement for all data copying to be marshalled through the CPU. Please benchmark performance on NVMe SSDs - it will give a much more accurate deomonstration of the performance differences we'll see in real world usage of loop device functionality... -Dave.
On Tue, 4 Mar 2025, Dave Chinner wrote: > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > > What should I use instead of bmap? Is fiemap exported for use in the > > > > kernel? > > > > > > You can't do an ahead of time mapping. It's a broken concept. > > > > Swapfile does ahead of time mapping. And I just looked at what swapfile > > does and copied the logic into dm-loop. If swapfile is not broken, how > > could dm-loop be broken? > > Swap files cannot be accessed/modified by user code once the > swapfile is activated. See all the IS_SWAPFILE() checked throughout > the VFS and filesystem code. > > Swap files must be fully allocated (i.e. not sparse), nor contan > shared extents. This is required so that writes to the swapfile do > not require block allocation which would change the mapping... > > Hence we explicitly prevent modification of the underlying file > mapping once a swapfile is owned and mapped by the kernel as a > swapfile. > > That's not how loop devices/image files work - we actually rely on > them being: > > a) sparse; and > b) the mapping being mutable via direct access to the loop file > whilst there is an active mounted filesystem on that loop file. > > and so every IO needs to be mapped through the filesystem at > submission time. > > The reason for a) is obvious: we don't need to allocate space for > the filesystem so it's effectively thin provisioned. Also, fstrim on > the mounted loop device can punch out unused space in the mounted > filesytsem. > > The reason for b) is less obvious: snapshots via file cloning, > deduplication via extent sharing. > > The clone operaiton is an atomic modification of the underlying file > mapping, which then triggers COW on future writes to those mappings, > which causes the mapping to the change at write IO time. > > IOWs, the whole concept that there is a "static mapping" for a loop > device image file for the life of the image file is fundamentally > flawed. I'm not trying to break existing loop. But some users don't use COW filesystems, some users use fully provisioned files, some users don't need to write to a file when it is being mapped - and for them dm-loop would be viable alternative because of better performance. The Android people concluded that loop is too slow and rather than using loop they want to map a file using a table with dm-linear targets over the image of the host filesystem. So, they are already doing what dm-loop is doing. Mikulas
On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > Swapfile does ahead of time mapping. It does. But it is: a) protected against modification by the S_SWAPFILE flag and checked for full allocation first b) something we want to get rid of because even with the above it is rather problematic > And I just looked at what swapfile > does and copied the logic into dm-loop. If swapfile is not broken, how > could dm-loop be broken? As said above, swapfile works around the brokenness in ways that you can't. And just blindly copying old code without understanding it is never a good idea. > > > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop > > > driver? > > > > What logic? > > The ahead-of-time mapping. As said multiple times you can't do that. The block mapping is file system private information.
On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote: > I'm not trying to break existing loop. You are f***cking breaking file system semantics. Stop it now. > The Android people concluded that loop is too slow and rather than using > loop they want to map a file using a table with dm-linear targets over the > image of the host filesystem. So, they are already doing what dm-loop is > doing. I've never seen bug reports from " The Android people". But maybe they just need to stop pointlessly using loop devices?
On Mon, Mar 03, 2025 at 06:34:45PM +0100, Heinz Mauelshagen wrote: > It is, unless you ensure the pre-allocated and fully provisioned file to be > immutable. Which you can't. > > > The inode fiemap() operation would avoid issues relative to outdated > > bmap(). > > > > No, using that for I/O is even worse. > > > > Are you saying it imposes more overhead than the VFS APIs? No, but the reporting in fiemap is even less defined than the already overly lose bmap. fiemap for btrfs for example doesn't report actual block numbers.
On Tue, Mar 04, 2025 at 03:50:36PM +0100, Heinz Mauelshagen wrote: > Seems to have a loop target to add respective mapped device semantics, the > way to go is to design it using VFS API to get it upstream. > > Won't help the Android use case requested by Google thus they'll have to > stick to their multi-segment linear mapped approach to achieve their > performance goals unless they want to support dm-loop OOT to make their > life easier on immutable file backings. What are "the performance goals" and what is in the way of achieving them? This thread has massively blown up but no one has even tried to explain what the thing tries to address.
On Tue, Mar 04, 2025 at 06:01:04PM +0100, Heinz Mauelshagen wrote: > > As Mikulas shared earlier: > > The Android people concluded that loop is too slow and rather than using > > loop they want to map a file using a table with dm-linear targets over the > > image of the host filesystem. So, they are already doing what dm-loop is > > doing. Which again is completely handwavy. What is the workload. What are the issues with the workload? Where is the time spent making it slower. e.g. the extent mapping code in this patch is a lot less efficient than the btree based extent lookup in XFS. If it is faster it is not looking up the extents but something else, and you need to figure out what it is.
On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote: > > > On Tue, 4 Mar 2025, Dave Chinner wrote: > > > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote: > > > > > > > > > On Mon, 3 Mar 2025, Christoph Hellwig wrote: > > > > > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote: > > > > > What should I use instead of bmap? Is fiemap exported for use in the > > > > > kernel? > > > > > > > > You can't do an ahead of time mapping. It's a broken concept. > > > > > > Swapfile does ahead of time mapping. And I just looked at what swapfile > > > does and copied the logic into dm-loop. If swapfile is not broken, how > > > could dm-loop be broken? > > > > Swap files cannot be accessed/modified by user code once the > > swapfile is activated. See all the IS_SWAPFILE() checked throughout > > the VFS and filesystem code. > > > > Swap files must be fully allocated (i.e. not sparse), nor contan > > shared extents. This is required so that writes to the swapfile do > > not require block allocation which would change the mapping... > > > > Hence we explicitly prevent modification of the underlying file > > mapping once a swapfile is owned and mapped by the kernel as a > > swapfile. > > > > That's not how loop devices/image files work - we actually rely on > > them being: > > > > a) sparse; and > > b) the mapping being mutable via direct access to the loop file > > whilst there is an active mounted filesystem on that loop file. > > > > and so every IO needs to be mapped through the filesystem at > > submission time. > > > > The reason for a) is obvious: we don't need to allocate space for > > the filesystem so it's effectively thin provisioned. Also, fstrim on > > the mounted loop device can punch out unused space in the mounted > > filesytsem. > > > > The reason for b) is less obvious: snapshots via file cloning, > > deduplication via extent sharing. > > > > The clone operaiton is an atomic modification of the underlying file > > mapping, which then triggers COW on future writes to those mappings, > > which causes the mapping to the change at write IO time. > > > > IOWs, the whole concept that there is a "static mapping" for a loop > > device image file for the life of the image file is fundamentally > > flawed. > > I'm not trying to break existing loop. I didn't say you were. I said the concept that dm-loop is based on is fundamentally flawed and that your benchmark setup does not reflect real world usage of loop devices. > But some users don't use COW filesystems, some users use fully provisioned > files, some users don't need to write to a file when it is being mapped - > and for them dm-loop would be viable alternative because of better > performance. Nothing has changed since 2008 when this "fast file mapping" thing was first proposed and dm-loop made it's first appearance in this thread: https://lore.kernel.org/linux-fsdevel/20080109085231.GE6650@kernel.dk/ Let me quote Christoph's response to Jen's proposed static mapping for the loop device patch back in 2008: | And the way this is done is simply broken. It means you have to get | rid of things like delayed or unwritten hands beforehand, it'll be | a complete pain for COW or non-block backed filesystems. | | The right way to do this is to allow direct I/O from kernel sources | where the filesystem is in-charge of submitting the actual I/O after | the pages are handed to it. I think Peter Zijlstra has been looking | into something like that for swap over nfs. Jens also said this about dm-loop in that thread: } Why oh why does dm always insist to reinvent everything? That's bad } enough in itself, but on top of that most of the extra stuff ends up } being essentially unmaintained. } } If we instead improve loop, everyone wins. } } Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit } outside your own sandbox. You - personally - were also told directly by Jens back then that dm-loop's approach simply does not work for filesystems that move blocks around. i.e. it isn't a viable appraoch. Nothing has changed - it still isn't a viable approach for loopback devices for the same reasons it wasnt' viable in 2008. > The Android people concluded that loop is too slow and rather than using > loop they want to map a file using a table with dm-linear targets over the > image of the host filesystem. So, they are already doing what dm-loop is > doing. I don't care if a downstream kernel is doing something stupid with their kernels. Where are the bug reports about the loop device being slow and the analysis that indicates that it is unfixable? The fact is that AIO+DIO through filesystems like XFS performs generally within 1-2% of the underlying block device capabilities. Hence if there's a problem with loop device performance, it isn't in the backing file IO submission path. Find out why loop device AIO+DIO is slow for the workload you are testing and fix that. This way everyone who already uses loop devices benefits (as Jens said in 2008), and the Android folk can get rid of their hacky mapping setup.... -Dave.
> I didn't say you were. I said the concept that dm-loop is based on > is fundamentally flawed and that your benchmark setup does not > reflect real world usage of loop devices. > Where are the bug reports about the loop device being slow and the > analysis that indicates that it is unfixable? So, I did benchmarks on an enterprise nvme drive (SAMSUNG MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. synchronous I/O: fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw raw block device: READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec ext4/loop/ext4: READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec xfs/loop/xfs: READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec ext4/dm-loop/ext4: READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec xfs/dm-loop/xfs: READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec asynchronous I/O: fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw raw block device: READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec ext4/loop/ext4: READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec xfs/loop/xfs: READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec ext4/dm-loop/ext4: READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec xfs/dm-loop/xfs: READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec Mikulas
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. Are you running the loop device in directio mode? The default is to use buffered io, which wastes pagecache /and/ sometimes trips dirty limits throttling. The loopdev tests in fstests get noticeably faster if I force directio mode. --D > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > xfs/loop/xfs: > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > ext4/dm-loop/ext4: > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > Mikulas > >
On Fri, 7 Mar 2025, Darrick J. Wong wrote: > On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > > I didn't say you were. I said the concept that dm-loop is based on > > > is fundamentally flawed and that your benchmark setup does not > > > reflect real world usage of loop devices. > > > > > Where are the bug reports about the loop device being slow and the > > > analysis that indicates that it is unfixable? > > > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > Are you running the loop device in directio mode? The default is to use > buffered io, which wastes pagecache /and/ sometimes trips dirty limits > throttling. The loopdev tests in fstests get noticeably faster if I > force directio mode. Yes, I am. I set it up with: losetup --direct-io=on /dev/loop0 /mnt/test/l mount -t xfs /dev/loop0 /mnt/test2 I double checked it and I got the same results. Mikulas > --D > > > synchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > > xfs/loop/xfs: > > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > > ext4/dm-loop/ext4: > > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > > > asynchronous I/O: > > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > > raw block device: > > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > > ext4/loop/ext4: > > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > > xfs/loop/xfs: > > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > > ext4/dm-loop/ext4: > > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > > xfs/dm-loop/xfs: > > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > > > > Mikulas
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec > WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec > xfs/loop/xfs: > READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec > WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec > ext4/dm-loop/ext4: > READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec > WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec Hi Mikulas, Please try the following patchset: https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/ which tries to handle IO in current context directly via NOWAIT, and supports MQ for loop since 12 io jobs are applied in your test. With this change, I can observe similar perf data on raw block device and loop/xfs over mq-virtio-scsi & nvme in my test VM. 1) try single queue first by `modprobe loop` 2) then try MQ by 'modprobe loop nr_hw_queues=4' If it still doesn't work, please provide fio log for both `raw block device` and 'loop/xfs', which may provide some clue for the big perf gap. Thanks, Ming
On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote: > > I didn't say you were. I said the concept that dm-loop is based on > > is fundamentally flawed and that your benchmark setup does not > > reflect real world usage of loop devices. > > > Where are the bug reports about the loop device being slow and the > > analysis that indicates that it is unfixable? > > So, I did benchmarks on an enterprise nvme drive (SAMSUNG > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow. > > synchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec > WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec > ext4/loop/ext4: > READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec > WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec > xfs/loop/xfs: > READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec > WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec > ext4/dm-loop/ext4: > READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec > WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec > xfs/dm-loop/xfs: > READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec > > asynchronous I/O: > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw > raw block device: > READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec > WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec BTW, raw device is supposed to be xfs or ext4 over raw block device, right? Otherwise, please provide test data for this case, then it becomes one fair comparison because there should be lock contention for FS write IOs on same file. Thanks, Ming
Index: linux-2.6/drivers/md/Kconfig =================================================================== --- linux-2.6.orig/drivers/md/Kconfig 2025-03-02 21:09:46.000000000 +0100 +++ linux-2.6/drivers/md/Kconfig 2025-03-02 21:09:46.000000000 +0100 @@ -646,6 +646,15 @@ config DM_ZONED If unsure, say N. +config DM_LOOP + tristate "Loop target" + depends on BLK_DEV_DM + help + This device-mapper target allows you to treat a regular file as + a block device. + + If unsure, say N. + config DM_AUDIT bool "DM audit events" depends on BLK_DEV_DM Index: linux-2.6/drivers/md/Makefile =================================================================== --- linux-2.6.orig/drivers/md/Makefile 2025-03-02 21:09:46.000000000 +0100 +++ linux-2.6/drivers/md/Makefile 2025-03-02 21:09:46.000000000 +0100 @@ -79,6 +79,7 @@ obj-$(CONFIG_DM_CLONE) += dm-clone.o obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o obj-$(CONFIG_DM_INTEGRITY) += dm-integrity.o obj-$(CONFIG_DM_ZONED) += dm-zoned.o +obj-$(CONFIG_DM_LOOP) += dm-loop.o obj-$(CONFIG_DM_WRITECACHE) += dm-writecache.o obj-$(CONFIG_SECURITY_LOADPIN_VERITY) += dm-verity-loadpin.o Index: linux-2.6/drivers/md/dm-loop.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/drivers/md/dm-loop.c 2025-03-02 21:41:36.000000000 +0100 @@ -0,0 +1,404 @@ +#include <linux/device-mapper.h> + +#include <linux/module.h> +#include <linux/pagemap.h> + +#define DM_MSG_PREFIX "loop" + +struct loop_c { + struct file *filp; + char *path; + loff_t offset; + struct block_device *bdev; + struct inode *inode; + unsigned blkbits; + bool read_only; + sector_t mapped_sectors; + + sector_t nr_extents; + struct dm_loop_extent *map; +}; + +struct dm_loop_extent { + sector_t start; /* start sector in mapped device */ + sector_t to; /* start sector on target device */ + sector_t len; /* length in sectors */ +}; + +static sector_t blk2sect(struct loop_c *lc, blkcnt_t block) +{ + return block << (lc->blkbits - SECTOR_SHIFT); +} + +static blkcnt_t sec2blk(struct loop_c *lc, sector_t sector) +{ + return sector >> (lc->blkbits - SECTOR_SHIFT); +} + +static blkcnt_t sec2blk_roundup(struct loop_c *lc, sector_t sector) +{ + return (sector + (1 << (lc->blkbits - SECTOR_SHIFT)) - 1) >> (lc->blkbits - SECTOR_SHIFT); +} + +static struct dm_loop_extent *extent_binary_lookup(struct loop_c *lc, sector_t sector) +{ + ssize_t first = 0; + ssize_t last = lc->nr_extents - 1; + + while (first <= last) { + ssize_t middle = (first + last) >> 1; + struct dm_loop_extent *ex = &lc->map[middle]; + if (sector < ex->start) { + last = middle - 1; + continue; + } + if (likely(sector >= ex->start + ex->len)) { + first = middle + 1; + continue; + } + return ex; + } + + return NULL; +} + +static int loop_map(struct dm_target *ti, struct bio *bio) +{ + struct loop_c *lc = ti->private; + sector_t sector, len; + struct dm_loop_extent *ex; + + sector = dm_target_offset(ti, bio->bi_iter.bi_sector); + ex = extent_binary_lookup(lc, sector); + if (!ex) + return DM_MAPIO_KILL; + + bio_set_dev(bio, lc->bdev); + bio->bi_iter.bi_sector = ex->to + (sector - ex->start); + len = ex->len - (sector - ex->start); + if (len < bio_sectors(bio)) + dm_accept_partial_bio(bio, len); + + if (unlikely(!ex->to)) { + if (unlikely(!lc->read_only)) + return DM_MAPIO_KILL; + zero_fill_bio(bio); + bio_endio(bio); + return DM_MAPIO_SUBMITTED; + } + + return DM_MAPIO_REMAPPED; +} + +static void loop_status(struct dm_target *ti, status_type_t type, + unsigned status_flags, char *result, unsigned maxlen) +{ + struct loop_c *lc = ti->private; + size_t sz = 0; + + switch (type) { + case STATUSTYPE_INFO: + result[0] = '\0'; + break; + case STATUSTYPE_TABLE: + DMEMIT("%s %llu", lc->path, lc->offset); + break; + case STATUSTYPE_IMA: + DMEMIT_TARGET_NAME_VERSION(ti->type); + DMEMIT(",file_name=%s,offset=%llu;", lc->path, lc->offset); + break; + } +} + +static int loop_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, void *data) +{ + return 0; +} + +static int extent_range(struct loop_c *lc, + sector_t logical_blk, sector_t last_blk, + sector_t *begin_blk, sector_t *nr_blks, + char **error) +{ + sector_t dist = 0, phys_blk, probe_blk = logical_blk; + int r; + + /* Find beginning physical block of extent starting at logical_blk. */ + *begin_blk = probe_blk; + *nr_blks = 0; + r = bmap(lc->inode, begin_blk); + if (r) { + *error = "bmap failed"; + return r; + } + if (!*begin_blk) { + if (!lc->read_only) { + *error = "File is sparse"; + return -ENXIO; + } + } + + for (phys_blk = *begin_blk; phys_blk == *begin_blk + dist; dist += !!*begin_blk) { + cond_resched(); + + (*nr_blks)++; + if (++probe_blk > last_blk) + break; + + phys_blk = probe_blk; + r = bmap(lc->inode, &phys_blk); + if (r) { + *error = "bmap failed"; + return r; + } + if (unlikely(!phys_blk)) { + if (!lc->read_only) { + *error = "File is sparse"; + return -ENXIO; + } + } + } + + return 0; +} + +static int loop_extents(struct loop_c *lc, sector_t *nr_extents, + struct dm_loop_extent *map, char **error) +{ + int r; + sector_t start = 0; + sector_t nr_blks, begin_blk; + sector_t after_last_blk = sec2blk_roundup(lc, + (lc->mapped_sectors + (lc->offset >> 9))); + sector_t logical_blk = sec2blk(lc, lc->offset >> 9); + + *nr_extents = 0; + + /* for each block in the mapped region */ + while (logical_blk < after_last_blk) { + r = extent_range(lc, logical_blk, after_last_blk - 1, + &begin_blk, &nr_blks, error); + + if (unlikely(r)) + return r; + + if (map) { + if (*nr_extents >= lc->nr_extents) { + *error = "The file changed while mapping it"; + return -EBUSY; + } + map[*nr_extents].start = start; + map[*nr_extents].to = blk2sect(lc, begin_blk); + map[*nr_extents].len = blk2sect(lc, nr_blks); + } + + (*nr_extents)++; + start += blk2sect(lc, nr_blks); + logical_blk += nr_blks; + } + + if (*nr_extents != lc->nr_extents) { + *error = "The file changed while mapping it"; + return -EBUSY; + } + + return 0; +} + +static int setup_block_map(struct loop_c *lc, struct dm_target *ti) +{ + int r; + sector_t n_file_sectors, offset_sector, nr_extents_tmp; + + if (!S_ISREG(lc->inode->i_mode) || !lc->inode->i_sb || !lc->inode->i_sb->s_bdev) { + ti->error = "The file is not a regular file"; + return -ENXIO; + } + + lc->bdev = lc->inode->i_sb->s_bdev; + lc->blkbits = lc->inode->i_blkbits; + n_file_sectors = i_size_read(lc->inode) >> lc->blkbits << (lc->blkbits - 9); + + if (lc->offset & ((1 << lc->blkbits) - 1)) { + ti->error = "Unaligned offset"; + return -EINVAL; + } + offset_sector = lc->offset >> 9; + if (offset_sector >= n_file_sectors) { + ti->error = "Offset is greater than file size"; + return -EINVAL; + } + if (ti->len > (n_file_sectors - offset_sector)) { + ti->error = "Target maps area after file end"; + return -EINVAL; + } + lc->mapped_sectors = ti->len >> (lc->blkbits - 9) << (lc->blkbits - 9); + + r = loop_extents(lc, &lc->nr_extents, NULL, &ti->error); + if (r) + return r; + + if (lc->nr_extents != (size_t)lc->nr_extents) { + ti->error = "Too many extents"; + return -EOVERFLOW; + } + + lc->map = kvcalloc(lc->nr_extents, sizeof(struct dm_loop_extent), GFP_KERNEL); + if (!lc->map) { + ti->error = "Failed to allocate extent map"; + return -ENOMEM; + } + + r = loop_extents(lc, &nr_extents_tmp, lc->map, &ti->error); + if (r) + return r; + + return 0; +} + +static int loop_lock_inode(struct inode *inode) +{ + int r; + inode_lock(inode); + if (IS_SWAPFILE(inode)) { + inode_unlock(inode); + return -EBUSY; + } + inode->i_flags |= S_SWAPFILE; + r = inode_drain_writes(inode); + if (r) { + inode->i_flags &= ~S_SWAPFILE; + inode_unlock(inode); + return r; + } + inode_unlock(inode); + return 0; +} + +static void loop_unlock_inode(struct inode *inode) +{ + inode_lock(inode); + inode->i_flags &= ~S_SWAPFILE; + inode_unlock(inode); +} + +static void loop_free(struct loop_c *lc) +{ + if (!lc) + return; + if (!IS_ERR_OR_NULL(lc->filp)) { + loop_unlock_inode(lc->inode); + filp_close(lc->filp, NULL); + } + kvfree(lc->map); + kfree(lc->path); + kfree(lc); +} + +static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv) +{ + struct loop_c *lc = NULL; + int r; + char dummy; + + if (argc != 2) { + r = -EINVAL; + ti->error = "Invalid number of arguments"; + goto err; + } + + lc = kzalloc(sizeof(*lc), GFP_KERNEL); + if (!lc) { + r = -ENOMEM; + ti->error = "Cannot allocate loop context"; + goto err; + } + ti->private = lc; + + lc->path = kstrdup(argv[0], GFP_KERNEL); + if (!lc->path) { + r = -ENOMEM; + ti->error = "Cannot allocate loop path"; + goto err; + } + + if (sscanf(argv[1], "%lld%c", &lc->offset, &dummy) != 1) { + r = -EINVAL; + ti->error = "Invalid file offset"; + goto err; + } + + lc->read_only = !(dm_table_get_mode(ti->table) & FMODE_WRITE); + + lc->filp = filp_open(lc->path, lc->read_only ? O_RDONLY : O_RDWR, 0); + if (IS_ERR(lc->filp)) { + r = PTR_ERR(lc->filp); + ti->error = "Could not open backing file"; + goto err; + } + + lc->inode = lc->filp->f_mapping->host; + + r = loop_lock_inode(lc->inode); + if (r) { + ti->error = "Could not lock inode"; + goto err; + } + + r = setup_block_map(lc, ti); + if (r) { + goto err; + } + + return 0; + +err: + loop_free(lc); + return r; +} + +static void loop_dtr(struct dm_target *ti) +{ + struct loop_c *lc = ti->private; + loop_free(lc); +} + +static struct target_type loop_target = { + .name = "loop", + .version = {1, 0, 0}, + .module = THIS_MODULE, + .ctr = loop_ctr, + .dtr = loop_dtr, + .map = loop_map, + .status = loop_status, + .iterate_devices = loop_iterate_devices, +}; + +static int __init dm_loop_init(void) +{ + int r; + + r = dm_register_target(&loop_target); + if (r < 0) { + DMERR("register failed %d", r); + goto err_target; + } + + return 0; + +err_target: + return r; +} + +static void __exit dm_loop_exit(void) +{ + dm_unregister_target(&loop_target); +} + +module_init(dm_loop_init); +module_exit(dm_loop_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>"); +MODULE_DESCRIPTION("device-mapper loop target");
This is the dm-loop target - a replacement for the regular loop driver with better performance. The dm-loop target builds a map of the file in the constructor and it just remaps bios according to this map. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/md/Kconfig | 9 + drivers/md/Makefile | 1 drivers/md/dm-loop.c | 404 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 414 insertions(+)