the dm-loop target

Message ID	7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 215911D63F7 for <dm-devel@lists.linux.dev>; Mon, 3 Mar 2025 10:24:37 +0000 (UTC) Date: Mon, 3 Mar 2025 11:24:27 +0100 (CET) From: Mikulas Patocka <mpatocka@redhat.com> To: Jooyung Han <jooyung@google.com> cc: Alasdair Kergon <agk@redhat.com>, Mike Snitzer <snitzer@kernel.org>, zkabelac@redhat.com, dm-devel@lists.linux.dev Subject: [PATCH] the dm-loop target Message-ID: <7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII
Series	the dm-loop target \| expand the dm-loop target

Mikulas Patocka March 3, 2025, 10:24 a.m. UTC

This is the dm-loop target - a replacement for the regular loop driver 
with better performance. The dm-loop target builds a map of the file in 
the constructor and it just remaps bios according to this map.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/Kconfig   |    9 +
 drivers/md/Makefile  |    1 
 drivers/md/dm-loop.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 414 insertions(+)

Christoph Hellwig March 3, 2025, 1:59 p.m. UTC | #1

On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> This is the dm-loop target - a replacement for the regular loop driver 
> with better performance. The dm-loop target builds a map of the file in 
> the constructor and it just remaps bios according to this map.

Using ->bmap is broken and a no-go for new code.  If you have any real
performance issues with the loop driver document and report them so that
they can be fixed instead of working around them by duplicating the code
(and in this case making the new code completely broken).

Christoph Hellwig March 3, 2025, 3:13 p.m. UTC | #2

On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote:
> dm-loop avoids the file system abstraction, thus gains the resulting
> performance advantage versus the loop driver.

How?

> dm-loop obviously requires full provisioning, thus sparse files are being
> detected and error handled.
> 
> Additionally, block relocation on CoW filesystems has to be prevented.
> dm-loop uses S_SWAPFILE to do so but that flag's limited to swap semantics
> and is overloaded as such.
> 
> Should we avoid the flag and enforce use of non-CoW filesystems for backing
> or checks on non-CoW files (inode->i_flags)?

No, ->bmap is fundamentally flawed.  No new code should be using it, and
we need to move the places still using it (most notably swap and the md
bitmap code) off it.  It can't deal with any kind of changes to the file
system topology and is a risk to data corruption because if used in the
I/O path it bypasses the file system entirely..  If it wasn't we'd use it
in the loop driver.

Matthew Wilcox March 3, 2025, 3:22 p.m. UTC | #3

On Mon, Mar 03, 2025 at 07:13:49AM -0800, Christoph Hellwig wrote:
> No, ->bmap is fundamentally flawed.  No new code should be using it, and
> we need to move the places still using it (most notably swap and the md
> bitmap code) off it.  It can't deal with any kind of changes to the file
> system topology and is a risk to data corruption because if used in the
> I/O path it bypasses the file system entirely..  If it wasn't we'd use it
> in the loop driver.

Funnily, I was looking at the md bitmap code too.  It's the last part of
the kernel using buffer heads with not-a-pagecache page, so it's the
only caller of alloc_page_buffers() remaining.

I think it should just use the page cache to read/write the file data,
but I haven't looked into it in detail.

Christoph Hellwig March 3, 2025, 3:31 p.m. UTC | #4

On Mon, Mar 03, 2025 at 03:22:48PM +0000, Matthew Wilcox wrote:
> Funnily, I was looking at the md bitmap code too.  It's the last part of
> the kernel using buffer heads with not-a-pagecache page, so it's the
> only caller of alloc_page_buffers() remaining.
> 
> I think it should just use the page cache to read/write the file data,
> but I haven't looked into it in detail.

The md bitmap code is awful as it abuses it's own buffer heads in a
way that is non-coherent with the underlying fs.  It should just be
using the vfs read/write helpers for in-kernel direct I/O with a
scoped nofs context.

Mikulas Patocka March 3, 2025, 4:16 p.m. UTC | #5

On Mon, 3 Mar 2025, Christoph Hellwig wrote:

> On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> > This is the dm-loop target - a replacement for the regular loop driver 
> > with better performance. The dm-loop target builds a map of the file in 
> > the constructor and it just remaps bios according to this map.
> 
> Using ->bmap is broken and a no-go for new code.

What should I use instead of bmap? Is fiemap exported for use in the 
kernel?

ext4_bmap flushes the journal if journaled data are used. Is there some 
equivalent function that would provide the same guarantee w.r.t. journaled 
data?

> If you have any real
> performance issues with the loop driver document and report them so that
> they can be fixed instead of working around them by duplicating the code
> (and in this case making the new code completely broken).

Would Jens Axboe agree to merge the dm-loop logic into the existing loop 
driver?


Dm-loop is significantly faster than the regular loop:

# modprobe brd rd_size=1048576
# dd if=/dev/zero of=/dev/ram0 bs=1048576
# mkfs.ext4 /dev/ram0
# mount -t ext4 /dev/ram0 /mnt/test
# dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512

dm-loop (on /mnt/test/test):
# dmsetup create loop --table '0 1048576 loop /mnt/test/test 0'
# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 
--ioengine=psync --iodepth=1 --group_reporting=1 
--filename=/dev/mapper/loop --name=job --rw=rw

   READ: bw=2428MiB/s (2546MB/s), 2428MiB/s-2428MiB/s (2546MB/s-2546MB/s), io=23.7GiB (25.5GB), run=10001-10001msec
  WRITE: bw=2429MiB/s (2547MB/s), 2429MiB/s-2429MiB/s (2547MB/s-2547MB/s), io=23.7GiB (25.5GB), run=10001-10001msec

regular loop (on /mnt/test/test):
# losetup /dev/loop0 /mnt/test/test
# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 
--ioengine=psync --iodepth=1 --group_reporting=1 --filename=/dev/loop0 
--name=job --rw=rw

   READ: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3259MiB (3417MB), run=10003-10003msec
  WRITE: bw=326MiB/s (342MB/s), 326MiB/s-326MiB/s (342MB/s-342MB/s), io=3260MiB (3418MB), run=10003-10003msec


dm-loop is even slightly faster than running fio directly on the regular 
file:

# fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=16 
--ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test/test 
--name=job --rw=rw
   READ: bw=2005MiB/s (2103MB/s), 2005MiB/s-2005MiB/s (2103MB/s-2103MB/s), io=19.6GiB (21.0GB), run=10002-10002msec
  WRITE: bw=2007MiB/s (2104MB/s), 2007MiB/s-2007MiB/s (2104MB/s-2104MB/s), io=19.6GiB (21.0GB), run=10002-10002msec

Mikulas

Bryn M. Reeves March 3, 2025, 4:55 p.m. UTC | #6

On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> This is the dm-loop target - a replacement for the regular loop driver 
> with better performance. The dm-loop target builds a map of the file in 
> the constructor and it just remaps bios according to this map.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

This looks like it is based on my dm-loop patch from almost 20 years
ago. Afaik the locking problems that hch pointed out back then are still
valid now - stashing the bmap lookup in memory lead to weird and hard to
debug problems back then and I don't think this has changed.

There are two other approaches that have been proposed in the mean time,
Heinz posted one about 6 years ago that uses the VFS for IO to the
backing file[1], and I posted another version in around 2012, this time
based on splitting drivers/block/loop.c into front/backend interfaces
and consuming the backend interface in dm-loop.

These mechanisms don't have the peformance gain of the in-memory block
mapping but they are safe and simpler (and may allow for more code
sharing with /dev/loop).

Regards,
Bryn.

[1] https://patchwork.kernel.org/project/dm-devel/patch/1b84af841912065fc57cfe395d5214f4eee0f0fc.1516124587.git.heinzm@redhat.com/


This implements a loopback target for device mapper allowing a regular
file to be treated as a block device.

Signed-off-by: Bryn Reeves <breeves@redhat.com>

===================================================================
diff --git a/drivers/md/dm-loop.c b/drivers/md/dm-loop.c
new file mode 100644
index 0000000..71f0287
--- /dev/null
+++ b/drivers/md/dm-loop.c
@@ -0,0 +1,1000 @@
+/*
+ * Copyright (C) 2006 Red Hat, Inc. All rights reserved.
+ *
+ * This file is part of device-mapper.
+ *
+ * drivers/md/dm-loop.c
+ *
+ * Extent mapping implementation heavily influenced by mm/swapfile.c
+ * Bryn Reeves <breeves@redhat.com>
+ *
+ * File mapping and block lookup algorithms support by
+ * Heinz Mauelshagen <hjm@redhat.com>.
+ *
+ * This file is released under the GPL.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/syscalls.h>
+#include <linux/workqueue.h>
+#include <linux/file.h>
+#include <linux/bio.h>
+
+#include "dm.h"
+#include "dm-bio-list.h"
+
+static const char *version = "v0.419";
+#define DAEMON "kloopd"
+#define DM_MSG_PREFIX "loop"
+
+enum flags { LOOP_BMAP, LOOP_FSIO };
+
+/*--------------------------------------------------------------------
+ * Loop context
+ *--------------------------------------------------------------------*/
+
+struct loop_c {
+	unsigned long flags;
+
+	/* information describing the backing store */
+	struct file *filp;		/* loop file handle */
+	char *path;			/* path argument */
+	loff_t offset;			/* offset argument */
+	struct block_device *bdev;	/* block device */
+	unsigned blkbits;		/* file system block size shift bits */
+
+	loff_t size;			/* size of entire file in bytes */
+	loff_t blocks;			/* blocks allocated to loop file */
+	sector_t mapped_sectors;	/* size of mapped area in sectors*/
+
+	/* mapping */
+	int (*map_fn)(struct dm_target*, struct bio*);
+	/* mapping function private data */
+	void *map_data;
+};
+
+/*
+ * block map extent
+ */
+struct extent {
+	sector_t start; 		/* start sector in mapped device */
+	sector_t to;			/* start sector on target device */
+	sector_t len;			/* length in sectors */
+};
+
+/*
+ * temporary extent list
+ */
+struct extent_list {
+	struct extent * extent;
+	struct list_head list;
+};
+
+struct kmem_cache *extent_cache;
+
+/*
+ * block map private context
+ */
+struct block_map_c {
+	int nr_extents;			/* number of extents in map */
+	struct extent **map;		/* linear map of extent pointers */
+	struct extent **mru;		/* pointer to mru entry */
+	spinlock_t mru_lock;		/* protects mru */
+};
+
+/*
+ * file map private context
+ */
+struct file_map_c {
+	spinlock_t lock;		/* protects in */
+	struct bio_list in;		/* new bios for processing */
+	struct bio_list work;		/* bios queued for processing */
+	struct workqueue_struct *wq;	/* workqueue */
+	struct work_struct ws;		/* loop work */
+	struct loop_c *loop;		/* for filp & offset */
+};
+
+/*--------------------------------------------------------------------
+ * Generic helpers
+ *--------------------------------------------------------------------*/
+
+static inline sector_t blk2sec(struct loop_c *lc, blkcnt_t block)
+{
+	return block << (lc->blkbits - SECTOR_SHIFT);
+}
+
+static inline blkcnt_t sec2blk(struct loop_c *lc, sector_t sector)
+{
+	return sector >> (lc->blkbits - SECTOR_SHIFT);
+}
+
+/*--------------------------------------------------------------------
+ * File I/O helpers
+ *--------------------------------------------------------------------*/
+
+/*
+ * transfer data to/from file using the read/write file_operations.
+ */
+static int fs_io(int rw, struct file *filp, loff_t *pos,
+		     struct bio_vec *bv)
+{
+	ssize_t r;
+	void *ptr = kmap(bv->bv_page) + bv->bv_offset;
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(get_ds());
+	r = (rw == READ) ? filp->f_op->read(filp, ptr, bv->bv_len, pos) :
+			   filp->f_op->write(filp, ptr, bv->bv_len, pos);
+	set_fs(old_fs);
+	kunmap(bv->bv_page);
+	return (r == bv->bv_len) ? 0 : -EIO;
+}
+
+/*
+ * Handle I/O for one bio
+ */
+static void do_one_bio(struct file_map_c *fc, struct bio *bio)
+{
+	int r = 0, rw = bio_data_dir(bio);
+	loff_t start = (bio->bi_sector << 9) + fc->loop->offset,
+		pos = start;
+	struct bio_vec *bv, *bv_end = bio->bi_io_vec + bio->bi_vcnt;
+
+	for (bv = bio->bi_io_vec; bv < bv_end; bv++) {
+		r = fs_io(rw, fc->loop->filp, &pos, bv);
+		if (r) {
+			DMERR("%s error %d", rw ? "write":"read" , r);
+			break;
+		}
+	}
+	bio_endio(bio, pos - start, r);
+}
+
+/*
+ * Worker thread for a 'file' type loop device
+ */
+static void do_loop_work(struct work_struct *ws)
+{
+	struct file_map_c *fc = container_of(ws, struct file_map_c, ws);
+	struct bio *bio;
+
+	/* quickly grab all new bios queued and add them to the work list */
+	spin_lock_irq(&fc->lock);
+	bio_list_merge_init(&fc->work, &fc->in);
+	spin_unlock_irq(&fc->lock); +
+	/* work the list and do file I/O on all bios */
+	while ((bio = bio_list_pop(&fc->work)))
+		do_one_bio(fc, bio);
+}
+
+/*
+ * Create work queue and initialize work
+ */
+static int loop_work_init(struct loop_c *lc)
+{
+	struct file_map_c *fc = lc->map_data;
+	fc->wq = create_singlethread_workqueue(DAEMON);
+	if (!fc->wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/*
+ * Destroy work queue
+ */
+static void loop_work_exit(struct loop_c *lc)
+{
+	struct file_map_c *fc = lc->map_data;
+	if (fc->wq)
+		destroy_workqueue(fc->wq);
+}
+
+/*
+ * LOOP_FSIO map_fn. Mapping just queues bios to the file map
+ * context and lets the daemon deal with them.
+ */
+static int loop_file_map(struct dm_target *ti, struct bio *bio)
+{
+	int wake;
+	struct loop_c *lc = ti->private;
+	struct file_map_c *fc = lc->map_data;
+
+	spin_lock_irq(&fc->lock);
+	wake = bio_list_empty(&fc->in);
+	bio_list_add(&fc->in, bio);
+	spin_unlock_irq(&fc->lock);
+
+	/*
+	 * only call queue_work() if necessary to avoid
+	 * superfluous preempt_{disable/enable}() overhead.
+	 */
+	if (wake)
+		queue_work(fc->wq, &fc->ws);
+
+	/* handling bio -> will submit later */
+	return 0;
+}
+
+/*
+ * Shutdown the workqueue and free a file mapping
+ */
+static void destroy_file_map(struct loop_c *lc)
+{
+	loop_work_exit(lc);
+	kfree(lc->map_data);
+}
+
+/*
+ * Set up a file map context and workqueue
+ */
+static int setup_file_map(struct loop_c *lc)
+{
+	struct file_map_c *fc = kzalloc(sizeof(*fc), GFP_KERNEL);
+	if (!fc)
+		return -ENOMEM;
+
+	lc->map_data = fc;
+	spin_lock_init(&fc->lock);
+	bio_list_init(&fc->in);
+	bio_list_init(&fc->work);
+	INIT_WORK(&fc->ws, do_loop_work);
+	fc->loop = lc;
+	lc->map_fn = loop_file_map;
+
+	return loop_work_init(lc);
+}
+
+/*--------------------------------------------------------------------
+ * Block I/O helpers
+ *--------------------------------------------------------------------*/
+
+static int contains_sector(struct extent *e, sector_t s)
+{
+	if (likely(e))
+		return s < (e->start + (e->len)) &&
+			s >= e->start;
+	return 0;
+}
+
+/*
+ * Walk over a linked list of extent_list structures, freeing them as
+ * we go. Does not free el->extent.
+ */
+static void destroy_extent_list(struct list_head *head)
+{
+	struct list_head *curr, *n;
+
+	if (list_empty(head))
+		return;
+
+	list_for_each_safe(curr, n, head) {
+		struct extent_list *el;
+		el = list_entry(curr, struct extent_list, list);
+		list_del(curr);
+		kfree(el);
+	}
+}
+
+/*
+ * Add a new extent to the tail of the list at *head with
+ * start/to/len parameters. Allocates from the extent cache.
+ */
+static int list_add_extent(struct list_head *head,
+		sector_t start, sector_t to, sector_t len)
+{
+	struct extent *extent;
+	struct extent_list *list;
+
+	extent = kmem_cache_alloc(extent_cache, GFP_KERNEL);
+	if(!extent)
+		goto out;
+
+	list = kmalloc(sizeof(*list), GFP_KERNEL);
+	if(!list)
+		goto out;
+
+	extent->start = start;
+	extent->to = to;
+	extent->len = len;
+
+	list->extent = extent;
+	list_add_tail(&list->list, head);
+
+	return 0;
+out:
+	if (extent)
+		kmem_cache_free(extent_cache, extent);
+	return -ENOMEM;
+}
+
+/*
+ * Return an extent range (i.e. beginning and ending physical block numbers).
+ */
+static int extent_range(struct inode * inode,
+			blkcnt_t logical_blk, blkcnt_t last_blk,
+			blkcnt_t *begin_blk, blkcnt_t *end_blk)
+{
+	sector_t dist = 0, phys_blk, probe_blk = logical_blk;
+
+	/* Find beginning physical block of extent starting at logical_blk. */
+	*begin_blk = phys_blk = bmap(inode, probe_blk);
+	if (!phys_blk)
+		return -ENXIO;
+
+	for (; phys_blk == *begin_blk + dist; dist++) {
+		*end_blk = phys_blk;
+		if (++probe_blk > last_blk)
+			break;
+
+		phys_blk = bmap(inode, probe_blk);
+		if (unlikely(!phys_blk))
+			return -ENXIO;
+	}
+
+	return 0;
+}
+
+/*
+ * Create a sequential list of extents from an inode and return
+ * it in *head. On success return the number of extents found or
+ * -ERRNO on failure.
+ */
+static int loop_extents(struct loop_c *lc, struct inode *inode,
+			struct list_head *head)
+{
+	sector_t start = 0;
+	int r, nr_extents = 0;
+	blkcnt_t nr_blks = 0, begin_blk = 0, end_blk = 0;
+	blkcnt_t last_blk = sec2blk(lc,
+			(lc->mapped_sectors + (lc->offset >> 9))) - 1;
+	blkcnt_t logical_blk = sec2blk(lc, (lc->offset >> 9));
+
+	/* for each block in the mapped region */
+	while (logical_blk <= last_blk) {
+		r = extent_range(inode, logical_blk, last_blk,
+				&begin_blk, &end_blk);
+
+		/* sparse file fallback */
+		if (unlikely(r)) {
+			DMWARN("%s has a hole; sparse file detected - "
+				"switching to filesystem I/O", lc->path);
+			clear_bit(LOOP_BMAP, &lc->flags);
+			set_bit(LOOP_FSIO, &lc->flags);
+			return r;
+		}
+
+		nr_blks = 1 + end_blk - begin_blk;
+
+		if (likely(nr_blks)) {
+			r = list_add_extent(head, start,
+				blk2sec(lc, begin_blk),
+				blk2sec(lc, nr_blks));
+
+			if (unlikely(r))
+				return r;
+
+			/* advance to next extent */
+			nr_extents++;
+			start += blk2sec(lc, nr_blks);
+			logical_blk += nr_blks;
+		}
+	}
+
+	return nr_extents;
+}
+
+/*
+ * Walk over the extents in a block_map_c, returning them to the cache and
+ * freeing bc->map and bc.
+ */
+static void destroy_block_map(struct block_map_c *bc)
+{
+	unsigned i;
+
+	if (!bc)
+		return;
+
+	for (i = 0; i < bc->nr_extents; i++)
+		kmem_cache_free(extent_cache, bc->map[i]);
+	DMDEBUG("destroying block map of %d entries", i);
+	vfree(bc->map);
+	kfree(bc);
+}
+
+/*
+ * Find an extent in *bc using binary search. Returns a pointer into the
+ * extent map. Calculate index as (extent - bc->map).
+ */
+static struct extent **extent_binary_lookup(struct block_map_c *bc,
+					   struct extent **extent_mru,
+					   sector_t sector)
+{
+	unsigned nr_extents = bc->nr_extents;
+	unsigned delta, dist, prev_dist = 0;
+	struct extent **eptr;
+
+	/* Optimize lookup range based on MRU extent. */
+	dist = extent_mru - bc->map;
+	if ((*extent_mru)->start < sector) {
+		delta = (nr_extents - dist) / 2;
+		dist += delta;
+	} else
+		delta = dist = dist / 2;
+
+	eptr = bc->map + dist;
+	while(*eptr && !contains_sector(*eptr, sector)) {
+		if (sector >= (*eptr)->start + (*eptr)->len) {
+			prev_dist = dist;
+			if (delta > 1)
+				delta /= 2;
+			dist += delta;
+		} else {
+			delta = (dist - prev_dist) / 2;
+			if (!delta)
+				delta = 1;
+			dist -= delta;
+		}
+		eptr = bc->map + dist;
+	}
+
+	return eptr;
+}
+
+/*
+ * Lookup an extent for a sector using the mru cache and binary search.
+ */
+static struct extent *extent_lookup(struct block_map_c *bc, sector_t sector)
+{
+	struct extent **eptr;
+
+	spin_lock_irq(&bc->mru_lock);
+	eptr = bc->mru;
+	spin_unlock_irq(&bc->mru_lock);
+
+	if (contains_sector(*eptr, sector))
+		return *eptr;
+
+	eptr = extent_binary_lookup(bc, eptr, sector);
+	if (!eptr)
+		return NULL;
+
+	spin_lock_irq(&bc->mru_lock);
+	bc->mru = eptr;
+	spin_unlock_irq(&bc->mru_lock);
+
+	return *eptr;
+}
+
+/*
+ * LOOP_BMAP map_fn. Looks up the sector in the extent map and
+ * rewrites the bio device and bi_sector fields.
+ */
+static int loop_block_map(struct dm_target *ti, struct bio *bio)
+{
+	struct loop_c *lc = ti->private;
+	struct extent *extent = extent_lookup(lc->map_data, bio->bi_sector);
+
+	if (likely(extent)) {
+		bio->bi_bdev = lc->bdev;
+		bio->bi_sector = extent->to +
+				 (bio->bi_sector - extent->start);
+		return 1;       /* Done with bio -> submit */
+	}
+
+	DMERR("no matching extent in map for sector %llu",
+	      (unsigned long long) bio->bi_sector + ti->begin);
+	BUG();
+
+	return -EIO;
+}
+
+/*
+ * Turn an extent_list into a linear pointer map of nr_extents + 1 entries
+ * and set the final entry to NULL.
+ */
+static struct extent **build_extent_map(struct list_head *head,
+				int nr_extents, unsigned long *flags)
+{
+	unsigned map_size, cache_size;
+	struct extent **map, **curr;
+	struct list_head *pos;
+
+	map_size = 1 +(sizeof(*map) * nr_extents);
+	cache_size = kmem_cache_size(extent_cache) * nr_extents;
+
+	map = vmalloc(map_size);
+	curr = map;
+
+	DMDEBUG("allocated extent map of %u %s for %d extents (%u %s)",
+		(map_size < 8192 ) ? map_size : map_size >> 10,
+		(map_size < 8192 ) ? "bytes" : "kilobytes", nr_extents,
+		(cache_size < 8192) ? cache_size : cache_size >> 10,
+		(cache_size < 8192) ? "bytes" : "kilobytes");
+
+	list_for_each(pos, head) {
+		struct extent_list *el;
+		el = list_entry(pos, struct extent_list, list);
+		*(curr++) = el->extent;
+	}
+	*curr = NULL;
+
+	return map;
+}
+
+/*
+ * Set up a block map context and extent map
+ */
+static int setup_block_map(struct loop_c *lc, struct inode *inode)
+{
+	int r, nr_extents;
+	struct block_map_c *bc;
+	LIST_HEAD(head);
+
+	if (!inode || !inode->i_sb || !inode->i_sb->s_bdev)
+		return -ENXIO;
+
+	/* build a linked list of extents in linear order */
+	r = nr_extents = loop_extents(lc, inode, &head);
+	if (nr_extents < 1)
+		goto out;
+
+	r = -ENOMEM;
+	bc = kzalloc(sizeof(*bc), GFP_KERNEL);
+	if(!bc)
+		goto out;
+
+	/* create a linear map of pointers into the extent cache */
+	bc->map = build_extent_map(&head, nr_extents, &lc->flags);
+	destroy_extent_list(&head);
+
+	if (IS_ERR(bc->map)) {
+		r = PTR_ERR(bc->map);
+		goto out;
+	}
+
+	spin_lock_init(&bc->mru_lock);
+	bc->mru = bc->map;
+	bc->nr_extents = nr_extents;
+	lc->bdev = inode->i_sb->s_bdev;
+	lc->map_data = bc;
+	lc->map_fn = loop_block_map;
+
+	return 0;
+
+out:
+	return r;
+}
+
+/*--------------------------------------------------------------------
+ * Generic helpers
+ *--------------------------------------------------------------------*/
+
+/*
+ * Invalidate all unlocked loop file pages
+ */
+static int loop_invalidate_file(struct file *filp)
+{
+	return invalidate_inode_pages(filp->f_mapping);
+}
+
+/*
+ * acquire or release a "no-truncate" lock on *filp.
+ * We overload the S_SWAPFILE flag for loop targets because
+ * it provides the same no-truncate semantics we require, and
+ * holding onto i_sem is no longer an option.
+ */
+static void file_truncate_lock(struct file *filp)
+{
+	struct inode *inode = filp->f_mapping->host;
+
+	mutex_lock(&inode->i_mutex);
+	inode->i_flags |= S_SWAPFILE;
+	mutex_unlock(&inode->i_mutex);
+}
+
+static void file_truncate_unlock(struct file *filp)
+{
+	struct inode *inode = filp->f_mapping->host;
+
+	mutex_lock(&inode->i_mutex);
+	inode->i_flags &= ~S_SWAPFILE;
+	mutex_unlock(&inode->i_mutex);
+}
+
+/*
+ * Fill out split_io for taget backing store
+ */
+static void set_split_io(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+
+	if(test_bit(LOOP_BMAP, &lc->flags))
+		/* Split I/O at block boundaries */
+		ti->split_io = 1 << (lc->blkbits - SECTOR_SHIFT);
+	else
+		ti->split_io = 64;
+
+	DMDEBUG("splitting io at %llu sector boundaries",
+			(unsigned long long) ti->split_io);
+}
+
+/*
+ * Check that the loop file is regular and available.
+ */
+static int loop_check_file(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+	struct file *filp = lc->filp;
+	struct inode *inode = filp->f_mapping->host;
+
+	if (!inode)
+		return -ENXIO;
+
+	ti->error = "backing file must be a regular file";
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	ti->error = "backing file is mapped into userspace for writing";
+	if (mapping_writably_mapped(filp->f_mapping))
+		return -EBUSY;
+
+	if (mapping_mapped(filp->f_mapping))
+		DMWARN("%s is mapped into userspace", lc->path);
+
+	if (!inode->i_sb || !inode->i_sb->s_bdev) {
+		DMWARN("%s has no blockdevice - switching to filesystem I/O", lc->path);
+		clear_bit(LOOP_BMAP, &lc->flags);
+		set_bit(LOOP_FSIO, &lc->flags);
+	}
+
+	ti->error = "backing file already in use";
+	if (IS_SWAPFILE(inode))
+		return -EBUSY;
+
+	return 0;
+}
+
+/*
+ * Check loop file size and store it in the loop context
+ */
+static int loop_setup_size(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+	struct inode *inode = lc->filp->f_mapping->host;
+	int r = -EINVAL;
+
+	lc->size = i_size_read(inode);
+	lc->blkbits = inode->i_blkbits;
+
+	ti->error = "backing file is empty";
+	if (!lc->size)
+		goto out;
+
+	DMDEBUG("set backing file size to %llu",
+		(unsigned long long) lc->size);
+
+	ti->error = "backing file cannot be less than one block in size";
+	if (lc->size < (blk2sec(lc,1) << 9))
+		goto out;
+
+	ti->error = "loop file offset must be a multiple of fs blocksize";
+	if (lc->offset & ((1 << lc->blkbits) - 1))
+		goto out;
+
+	ti->error = "loop file offset too large";
+	if (lc->offset > (lc->size - (1 << 9)))
+		goto out;
+
+	lc->mapped_sectors = (lc->size - lc->offset) >> 9;
+	DMDEBUG("set mapped sectors to %llu (%llu bytes)",
+		(unsigned long long) lc->mapped_sectors,
+		(lc->size - lc->offset));
+
+	if ((lc->offset + (lc->mapped_sectors << 9)) < lc->size)
+		DMWARN("not using %llu bytes in incomplete block at EOF",
+		       lc->size - (lc->offset + (lc->mapped_sectors << 9)));
+
+	ti->error = "mapped region cannot be smaller than target size";
+	if (lc->size - lc->offset < (ti->len << 9))
+		goto out;
+
+	return 0;
+out:
+	return r;
+}
+
+/*
+ * release a loop file
+ */
+static void loop_put_file(struct file *filp)
+{
+	if (!filp)
+		return;
+	file_truncate_unlock(filp);
+	filp_close(filp, NULL);
+}
+
+/*
+ * open loop file and perform type, availability and size checks.
+ */
+static int loop_get_file(struct dm_target *ti)
+{
+	int flags = ((dm_table_get_mode(ti->table) & FMODE_WRITE) ?
+		    O_RDWR : O_RDONLY) | O_LARGEFILE;
+	struct loop_c *lc = ti->private;
+	struct file *filp;
+	int r = 0;
+
+	ti->error = "could not open backing file";
+	filp = filp_open(lc->path, flags, 0);
+	if (IS_ERR(filp))
+		return PTR_ERR(filp);
+	lc->filp = filp;
+	r = loop_check_file(ti);
+	if (r)
+		goto out_put;
+
+	r = loop_setup_size(ti);
+	if (r)
+		goto out_put;
+
+	file_truncate_lock(filp);
+	return 0;
+
+out_put:
+	fput(filp);
+	return r;
+}
+
+/*
+ * invalidate mapped pages belonging to the loop file
+ */
+void loop_flush(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+	loop_invalidate_file(lc->filp);
+}
+
+/*--------------------------------------------------------------------
+ * Device-mapper target methods
+ *--------------------------------------------------------------------*/
+
+/*
+ * Generic loop map function. Re-base I/O to target begin and submit
+ */
+static int loop_map(struct dm_target *ti, struct bio *bio,
+					union map_info *context)
+{
+	struct loop_c *lc = ti->private;
+
+	if (unlikely(bio_barrier(bio)))
+		return -EOPNOTSUPP;
+	bio->bi_sector -= ti->begin;
+	if (lc->map_fn)
+		return lc->map_fn(ti, bio);
+	return -EIO;
+}
+
+/*
+ * Block status helper
+ */
+static ssize_t loop_file_status(struct loop_c *lc, char *result, unsigned maxlen)
+{
+	ssize_t sz = 0;
+	struct file_map_c *fc = lc->map_data;
+	int qlen;
+
+	spin_lock_irq(&fc->lock);
+	qlen = bio_list_nr(&fc->work);
+	qlen += bio_list_nr(&fc->in);
+	spin_unlock_irq(&fc->lock);
+	DMEMIT("file %d", qlen);
+
+	return sz;
+}
+
+/*
+ * File status helper
+ */
+static ssize_t loop_block_status(struct loop_c *lc, char *result, unsigned maxlen)
+{
+	ssize_t sz = 0;
+	struct block_map_c *bc = lc->map_data;
+	int mru;
+
+	spin_lock_irq(&bc->mru_lock);
+	mru = bc->mru - bc->map;
+	spin_unlock_irq(&bc->mru_lock);
+	DMEMIT("block %d %d", bc->nr_extents, mru);
+
+	return sz;
+}
+
+/*
+ * This needs some thought on handling unlinked backing files. some parts of
+ * the kernel return a cached name (now invalid), while others return a dcache
+ * "/path/to/foo (deleted)" name (never was/is valid). Which is "better" is
+ * debatable.
+ *
+ * On the one hand, using a cached name gives table output which is directly
+ * usable assuming the user re-creates the unlinked image file, on the other
+ * it is more consistent with e.g. swap to use the dcache name.
+ *
+*/
+static int loop_status(struct dm_target *ti, status_type_t type,
+				char *result, unsigned maxlen)
+{
+	struct loop_c *lc = (struct loop_c *) ti->private;
+	ssize_t sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		if (test_bit(LOOP_BMAP, &lc->flags))
+			sz += loop_block_status(lc, result, maxlen - sz);
+		else if (test_bit(LOOP_FSIO, &lc->flags))
+			sz += loop_file_status(lc, result, maxlen - sz);
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s %llu", lc->path, lc->offset);
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Destroy a loopback mapping
+ */
+static void loop_dtr(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+
+	if ((dm_table_get_mode(ti->table) & FMODE_WRITE))
+		loop_invalidate_file(lc->filp);
+
+	if (test_bit(LOOP_BMAP, &lc->flags) && lc->map_data)
+		destroy_block_map((struct block_map_c *)lc->map_data);
+	if (test_bit(LOOP_FSIO, &lc->flags) && lc->map_data)
+		destroy_file_map(lc);
+
+	loop_put_file(lc->filp);
+	DMINFO("released file %s", lc->path);
+
+	kfree(lc);
+}
+
+/*
+ * Construct a loopback mapping: <path> <offset>
+ */
+static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct loop_c *lc = NULL;
+	int r = -EINVAL;
+
+	ti->error = "invalid argument count";
+	if (argc != 2)
+		goto out;
+
+	r = -ENOMEM;
+	ti->error = "cannot allocate loop context";
+	lc = kzalloc(sizeof(*lc), GFP_KERNEL);
+	if (!lc)
+		goto out;
+
+	/* default */
+	set_bit(LOOP_BMAP, &lc->flags);
+	ti->error = "cannot allocate loop path";
+	lc->path = kstrdup(argv[0], GFP_KERNEL);
+	if (!lc->path)
+		goto out;
+
+	ti->private = lc;
+
+	r = -EINVAL;
+	ti->error = "invalid file offset";
+	if (sscanf(argv[1], "%lld", &lc->offset) != 1)
+		goto out;
+
+	if (lc->offset)
+		DMDEBUG("setting file offset to %lld", lc->offset);
+
+	/* open & check file and set size parameters */
+	r = loop_get_file(ti);
+
+	/* ti->error has been set by loop_get_file */
+	if (r)
+		goto out;
+
+	ti->error = "could not create loop mapping";
+	if (test_bit(LOOP_BMAP, &lc->flags))
+		r = setup_block_map(lc, lc->filp->f_mapping->host);
+	if (test_bit(LOOP_FSIO, &lc->flags))
+		r = setup_file_map(lc);
+
+	if (r)
+		goto out_putf;
+
+	set_split_io(ti);
+	if (lc->bdev)
+		dm_set_device_limits(ti, lc->bdev);
+
+	DMDEBUG("constructed loop target on %s "
+		"(%lldk, %llu sectors)", lc->path,
+		(lc->size >> 10), lc->mapped_sectors);
+	ti->error = NULL;
+
+	return 0;
+
+out_putf:
+	loop_put_file(lc->filp);
+out:
+	if(lc)
+		kfree(lc);
+	return r;
+}
+
+static struct target_type loop_target = {
+	.name = "loop",
+	.version = {0, 0, 2},
+	.module = THIS_MODULE,
+	.ctr = loop_ctr,
+	.dtr = loop_dtr,
+	.map = loop_map,
+	.presuspend = loop_flush,
+	.flush = loop_flush,
+	.status = loop_status,
+};
+
+/*--------------------------------------------------------------------
+ * Module bits
+ *--------------------------------------------------------------------*/
+int __init dm_loop_init(void)
+{
+	int r;
+
+	r = dm_register_target(&loop_target);
+	if (r < 0) {
+		DMERR("register failed %d", r);
+		goto out;
+	}
+
+	r = -ENOMEM;
+	extent_cache = kmem_cache_create("extent_cache", sizeof(struct extent),
+					0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+	if (!extent_cache)
+		goto out;
+
+	DMINFO("registered %s", version);
+	return 0;
+
+out:
+	if (extent_cache)
+		kmem_cache_destroy(extent_cache);
+	return r;
+}
+
+void dm_loop_exit(void)
+{
+	int r;
+
+	r = dm_unregister_target(&loop_target);
+	kmem_cache_destroy(extent_cache);
+
+	if (r < 0)
+		DMERR("target unregister failed %d", r);
+	else
+		DMINFO("unregistered %s", version);
+}
+
+module_init(dm_loop_init);
+module_exit(dm_loop_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Bryn Reeves <breeves@redhat.com>");
+MODULE_DESCRIPTION("device-mapper loop target");

Mikulas Patocka March 3, 2025, 5:06 p.m. UTC | #7

On Mon, 3 Mar 2025, Bryn M. Reeves wrote:

> On Mon, Mar 03, 2025 at 11:24:27AM +0100, Mikulas Patocka wrote:
> > This is the dm-loop target - a replacement for the regular loop driver 
> > with better performance. The dm-loop target builds a map of the file in 
> > the constructor and it just remaps bios according to this map.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> This looks like it is based on my dm-loop patch from almost 20 years
> ago.

Yes, it is. I copied some of the code from it.

> Afaik the locking problems that hch pointed out back then are still
> valid now - stashing the bmap lookup in memory lead to weird and hard to
> debug problems back then and I don't think this has changed.

The dm-loop driver does the same thing as swapfile activation - so I 
presume that if swapfile activation is correct, then dm-loop is also 
correct.

The S_SWAPFILE flag prevents the kernel from truncating, extending or 
deleting the file - it just returns "Text file busy".

Mikulas

Bryn M. Reeves March 3, 2025, 5:18 p.m. UTC | #8

On Mon, Mar 03, 2025 at 06:06:36PM +0100, Mikulas Patocka wrote:
> The dm-loop driver does the same thing as swapfile activation - so I 
> presume that if swapfile activation is correct, then dm-loop is also 
> correct.
> 
> The S_SWAPFILE flag prevents the kernel from truncating, extending or 
> deleting the file - it just returns "Text file busy".

Right: that was the idea, and it was lifted wholesale from swapfile.c.

In my testing it seemed reliable, but we had reports from users of
very occasional memory or file corruption iirc that we were never
able to pin down and resolve.

One user in the UK was running it in production for a couple of years on
multiple systems - in that time they had a handful of crashes that we
were never able to make much sense of. 

Regards,
Bryn.

Christoph Hellwig March 3, 2025, 5:24 p.m. UTC | #9

On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> What should I use instead of bmap? Is fiemap exported for use in the 
> kernel?

You can't do an ahead of time mapping.  It's a broken concept.

> Would Jens Axboe agree to merge the dm-loop logic into the existing loop 
> driver?

What logic?

> Dm-loop is significantly faster than the regular loop:
> 
> # modprobe brd rd_size=1048576
> # dd if=/dev/zero of=/dev/ram0 bs=1048576
> # mkfs.ext4 /dev/ram0
> # mount -t ext4 /dev/ram0 /mnt/test
> # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512

All of this needs to be in a commit log.  Also note that the above:

 a) does not use direct I/O which any sane loop user should
 b) is not on a file but a block device, rendering the use of a loop
    device entirely pointless.

Christoph Hellwig March 3, 2025, 5:24 p.m. UTC | #10

On Mon, Mar 03, 2025 at 05:06:16PM +0100, Heinz Mauelshagen wrote:
> On Mon, Mar 3, 2025 at 4:24 PM Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Mar 03, 2025 at 03:57:23PM +0100, Heinz Mauelshagen wrote:
> > > dm-loop avoids the file system abstraction, thus gains the resulting
> > > performance advantage versus the loop driver.
> >
> > How?
> >
> 
> It uses a predefined map and thus avoids VFS -> FS write path overhead
> completelly.

But you simply can't use a "pre-defined map".  It's fundamentally
unsafe.

> The inode fiemap() operation would avoid issues relative to outdated bmap().

No, using that for I/O is even worse.

Mikulas Patocka March 3, 2025, 9:03 p.m. UTC | #11

On Mon, 3 Mar 2025, Christoph Hellwig wrote:

> On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > What should I use instead of bmap? Is fiemap exported for use in the 
> > kernel?
> 
> You can't do an ahead of time mapping.  It's a broken concept.

Swapfile does ahead of time mapping. And I just looked at what swapfile 
does and copied the logic into dm-loop. If swapfile is not broken, how 
could dm-loop be broken?

> > Would Jens Axboe agree to merge the dm-loop logic into the existing loop 
> > driver?
> 
> What logic?

The ahead-of-time mapping.

> > Dm-loop is significantly faster than the regular loop:
> > 
> > # modprobe brd rd_size=1048576
> > # dd if=/dev/zero of=/dev/ram0 bs=1048576
> > # mkfs.ext4 /dev/ram0
> > # mount -t ext4 /dev/ram0 /mnt/test
> > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512
> 
> All of this needs to be in a commit log.  Also note that the above:
> 
>  a) does not use direct I/O which any sane loop user should
>  b) is not on a file but a block device, rendering the use of a loop
>     device entirely pointless.

With "losetup --direct-io=on /dev/loop0 /mnt/test/test", it is even slower 
than without:

   READ: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2170MiB (2275MB), run=10003-10003msec
  WRITE: bw=217MiB/s (227MB/s), 217MiB/s-217MiB/s (227MB/s-227MB/s), io=2169MiB (2274MB), run=10003-10003msec

with --direct-io=off

   READ: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3978MiB (4171MB), run=10003-10003msec
  WRITE: bw=398MiB/s (417MB/s), 398MiB/s-398MiB/s (417MB/s-417MB/s), io=3981MiB (4175MB), run=10003-10003msec

Mikulas

Dave Chinner March 4, 2025, 2:13 a.m. UTC | #12

On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> 
> 
> On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> 
> > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > What should I use instead of bmap? Is fiemap exported for use in the 
> > > kernel?
> > 
> > You can't do an ahead of time mapping.  It's a broken concept.
> 
> Swapfile does ahead of time mapping. And I just looked at what swapfile 
> does and copied the logic into dm-loop. If swapfile is not broken, how 
> could dm-loop be broken?

Swap files cannot be accessed/modified by user code once the
swapfile is activated.  See all the IS_SWAPFILE() checked throughout
the VFS and filesystem code.

Swap files must be fully allocated (i.e. not sparse), nor contan
shared extents. This is required so that writes to the swapfile do
not require block allocation which would change the mapping...

Hence we explicitly prevent modification of the underlying file
mapping once a swapfile is owned and mapped by the kernel as a
swapfile.

That's not how loop devices/image files work - we actually rely on
them being:

a) sparse; and
b) the mapping being mutable via direct access to the loop file
whilst there is an active mounted filesystem on that loop file.

and so every IO needs to be mapped through the filesystem at
submission time.

The reason for a) is obvious: we don't need to allocate space for
the filesystem so it's effectively thin provisioned. Also, fstrim on
the mounted loop device can punch out unused space in the mounted
filesytsem.

The reason for b) is less obvious: snapshots via file cloning,
deduplication via extent sharing.

The clone operaiton is an atomic modification of the underlying file
mapping, which then triggers COW on future writes to those mappings,
which causes the mapping to the change at write IO time.

IOWs, the whole concept that there is a "static mapping" for a loop
device image file for the life of the image file is fundamentally
flawed.

> > > Dm-loop is significantly faster than the regular loop:
> > > 
> > > # modprobe brd rd_size=1048576
> > > # dd if=/dev/zero of=/dev/ram0 bs=1048576
> > > # mkfs.ext4 /dev/ram0
> > > # mount -t ext4 /dev/ram0 /mnt/test
> > > # dd if=/dev/zero of=/mnt/test/test bs=1048576 count=512

Urk. ram disks are terrible for IO benchmarking. The IO is
synchronous (i.e. always completes in the submitter context) and
performance is -always CPU bound- due to the requirement for all
data copying to be marshalled through the CPU.

Please benchmark performance on NVMe SSDs - it will give a much more
accurate deomonstration of the performance differences we'll see in
real world usage of loop device functionality...

-Dave.

Mikulas Patocka March 4, 2025, 11:18 a.m. UTC | #13

On Tue, 4 Mar 2025, Dave Chinner wrote:

> On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> > 
> > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > > What should I use instead of bmap? Is fiemap exported for use in the 
> > > > kernel?
> > > 
> > > You can't do an ahead of time mapping.  It's a broken concept.
> > 
> > Swapfile does ahead of time mapping. And I just looked at what swapfile 
> > does and copied the logic into dm-loop. If swapfile is not broken, how 
> > could dm-loop be broken?
> 
> Swap files cannot be accessed/modified by user code once the
> swapfile is activated.  See all the IS_SWAPFILE() checked throughout
> the VFS and filesystem code.
> 
> Swap files must be fully allocated (i.e. not sparse), nor contan
> shared extents. This is required so that writes to the swapfile do
> not require block allocation which would change the mapping...
> 
> Hence we explicitly prevent modification of the underlying file
> mapping once a swapfile is owned and mapped by the kernel as a
> swapfile.
> 
> That's not how loop devices/image files work - we actually rely on
> them being:
> 
> a) sparse; and
> b) the mapping being mutable via direct access to the loop file
> whilst there is an active mounted filesystem on that loop file.
> 
> and so every IO needs to be mapped through the filesystem at
> submission time.
> 
> The reason for a) is obvious: we don't need to allocate space for
> the filesystem so it's effectively thin provisioned. Also, fstrim on
> the mounted loop device can punch out unused space in the mounted
> filesytsem.
> 
> The reason for b) is less obvious: snapshots via file cloning,
> deduplication via extent sharing.
> 
> The clone operaiton is an atomic modification of the underlying file
> mapping, which then triggers COW on future writes to those mappings,
> which causes the mapping to the change at write IO time.
> 
> IOWs, the whole concept that there is a "static mapping" for a loop
> device image file for the life of the image file is fundamentally
> flawed.

I'm not trying to break existing loop.

But some users don't use COW filesystems, some users use fully provisioned 
files, some users don't need to write to a file when it is being mapped - 
and for them dm-loop would be viable alternative because of better 
performance.

The Android people concluded that loop is too slow and rather than using 
loop they want to map a file using a table with dm-linear targets over the 
image of the host filesystem. So, they are already doing what dm-loop is 
doing.

Mikulas

Christoph Hellwig March 4, 2025, 1:49 p.m. UTC | #14

On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> Swapfile does ahead of time mapping.

It does.  But it is:

 a) protected against modification by the S_SWAPFILE flag and checked
    for full allocation first
 b) something we want to get rid of because even with the above it is
    rather problematic

> And I just looked at what swapfile 
> does and copied the logic into dm-loop. If swapfile is not broken, how 
> could dm-loop be broken?

As said above, swapfile works around the brokenness in ways that you
can't.  And just blindly copying old code without understanding it is
never a good idea.

> 
> > > Would Jens Axboe agree to merge the dm-loop logic into the existing loop 
> > > driver?
> > 
> > What logic?
> 
> The ahead-of-time mapping.

As said multiple times you can't do that.  The block mapping is
file system private information.

Christoph Hellwig March 4, 2025, 1:50 p.m. UTC | #15

On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote:
> I'm not trying to break existing loop.

You are f***cking breaking file system semantics.  Stop it now.

> The Android people concluded that loop is too slow and rather than using 
> loop they want to map a file using a table with dm-linear targets over the 
> image of the host filesystem. So, they are already doing what dm-loop is 
> doing.

I've never seen bug reports from " The Android people".  But maybe they
just need to stop pointlessly using loop devices?

Christoph Hellwig March 4, 2025, 1:52 p.m. UTC | #16

On Mon, Mar 03, 2025 at 06:34:45PM +0100, Heinz Mauelshagen wrote:
> It is, unless you ensure the pre-allocated and fully provisioned file to be
> immutable.

Which you can't.

> > > The inode fiemap() operation would avoid issues relative to outdated
> > bmap().
> >
> > No, using that for I/O is even worse.
> >
> 
> Are you saying it imposes more overhead than the VFS APIs?

No, but the reporting in fiemap is even less defined than the already
overly lose bmap.  fiemap for btrfs for example doesn't report actual
block numbers.

Christoph Hellwig March 4, 2025, 4:04 p.m. UTC | #17

On Tue, Mar 04, 2025 at 03:50:36PM +0100, Heinz Mauelshagen wrote:
> Seems to have a loop target to add respective mapped device semantics, the
> way to go is to design it using VFS API to get it upstream.
> 
> Won't help the Android use case requested by Google thus they'll have to
> stick to their multi-segment linear mapped approach to achieve their
> performance goals unless they want to support dm-loop OOT to make their
> life easier on immutable file backings.

What are "the performance goals" and what is in the way of achieving
them?  This thread has massively blown up but no one has even tried to
explain what the thing tries to address.

Christoph Hellwig March 4, 2025, 5:17 p.m. UTC | #18

On Tue, Mar 04, 2025 at 06:01:04PM +0100, Heinz Mauelshagen wrote:
> > As Mikulas shared earlier:
>  > The Android people concluded that loop is too slow and rather than using
> > loop they want to map a file using a table with dm-linear targets over the
> > image of the host filesystem. So, they are already doing what dm-loop is
> > doing.

Which again is completely handwavy.  What is the workload.  What are the
issues with the workload?  Where is the time spent making it slower.
e.g. the extent mapping code in this patch is a lot less efficient than
the btree based extent lookup in XFS.  If it is faster it is not looking
up the extents but something else, and you need to figure out what it
is.

Dave Chinner March 5, 2025, 12:01 a.m. UTC | #19

On Tue, Mar 04, 2025 at 12:18:04PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 4 Mar 2025, Dave Chinner wrote:
> 
> > On Mon, Mar 03, 2025 at 10:03:42PM +0100, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 3 Mar 2025, Christoph Hellwig wrote:
> > > 
> > > > On Mon, Mar 03, 2025 at 05:16:48PM +0100, Mikulas Patocka wrote:
> > > > > What should I use instead of bmap? Is fiemap exported for use in the 
> > > > > kernel?
> > > > 
> > > > You can't do an ahead of time mapping.  It's a broken concept.
> > > 
> > > Swapfile does ahead of time mapping. And I just looked at what swapfile 
> > > does and copied the logic into dm-loop. If swapfile is not broken, how 
> > > could dm-loop be broken?
> > 
> > Swap files cannot be accessed/modified by user code once the
> > swapfile is activated.  See all the IS_SWAPFILE() checked throughout
> > the VFS and filesystem code.
> > 
> > Swap files must be fully allocated (i.e. not sparse), nor contan
> > shared extents. This is required so that writes to the swapfile do
> > not require block allocation which would change the mapping...
> > 
> > Hence we explicitly prevent modification of the underlying file
> > mapping once a swapfile is owned and mapped by the kernel as a
> > swapfile.
> > 
> > That's not how loop devices/image files work - we actually rely on
> > them being:
> > 
> > a) sparse; and
> > b) the mapping being mutable via direct access to the loop file
> > whilst there is an active mounted filesystem on that loop file.
> > 
> > and so every IO needs to be mapped through the filesystem at
> > submission time.
> > 
> > The reason for a) is obvious: we don't need to allocate space for
> > the filesystem so it's effectively thin provisioned. Also, fstrim on
> > the mounted loop device can punch out unused space in the mounted
> > filesytsem.
> > 
> > The reason for b) is less obvious: snapshots via file cloning,
> > deduplication via extent sharing.
> > 
> > The clone operaiton is an atomic modification of the underlying file
> > mapping, which then triggers COW on future writes to those mappings,
> > which causes the mapping to the change at write IO time.
> > 
> > IOWs, the whole concept that there is a "static mapping" for a loop
> > device image file for the life of the image file is fundamentally
> > flawed.
> 
> I'm not trying to break existing loop.

I didn't say you were. I said the concept that dm-loop is based on
is fundamentally flawed and that your benchmark setup does not
reflect real world usage of loop devices.

> But some users don't use COW filesystems, some users use fully provisioned 
> files, some users don't need to write to a file when it is being mapped - 
> and for them dm-loop would be viable alternative because of better 
> performance.

Nothing has changed since 2008 when this "fast file mapping" thing
was first proposed and dm-loop made it's first appearance in this
thread:

https://lore.kernel.org/linux-fsdevel/20080109085231.GE6650@kernel.dk/

Let me quote Christoph's response to Jen's proposed static mapping
for the loop device patch back in 2008:

| And the way this is done is simply broken.  It means you have to get
| rid of things like delayed or unwritten hands beforehand, it'll be
| a complete pain for COW or non-block backed filesystems.
| 
| The right way to do this is to allow direct I/O from kernel sources
| where the filesystem is in-charge of submitting the actual I/O after
| the pages are handed to it.  I think Peter Zijlstra has been looking
| into something like that for swap over nfs.

Jens also said this about dm-loop in that thread:

} Why oh why does dm always insist to reinvent everything? That's bad
} enough in itself, but on top of that most of the extra stuff ends up
} being essentially unmaintained.
} 
} If we instead improve loop, everyone wins.
} 
} Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
} outside your own sandbox.

You - personally - were also told directly by Jens back then that
dm-loop's approach simply does not work for filesystems that move
blocks around. i.e. it isn't a viable appraoch.

Nothing has changed - it still isn't a viable approach for loopback
devices for the same reasons it wasnt' viable in 2008.

> The Android people concluded that loop is too slow and rather than using 
> loop they want to map a file using a table with dm-linear targets over the 
> image of the host filesystem. So, they are already doing what dm-loop is 
> doing.

I don't care if a downstream kernel is doing something stupid with
their kernels.

Where are the bug reports about the loop device being slow and the
analysis that indicates that it is unfixable?

The fact is that AIO+DIO through filesystems like XFS performs
generally within 1-2% of the underlying block device capabilities.
Hence if there's a problem with loop device performance, it isn't in
the backing file IO submission path.

Find out why loop device AIO+DIO is slow for the workload you are
testing and fix that. This way everyone who already uses loop
devices benefits (as Jens said in 2008), and the Android folk can
get rid of their hacky mapping setup....

-Dave.

Mikulas Patocka March 7, 2025, 3:21 p.m. UTC | #20

> I didn't say you were. I said the concept that dm-loop is based on
> is fundamentally flawed and that your benchmark setup does not
> reflect real world usage of loop devices.

> Where are the bug reports about the loop device being slow and the
> analysis that indicates that it is unfixable?

So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
--direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.

synchronous I/O:
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
raw block device:
   READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
  WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
ext4/loop/ext4:
   READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
  WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
xfs/loop/xfs:
   READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
  WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
ext4/dm-loop/ext4:
   READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
  WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
xfs/dm-loop/xfs:
   READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
  WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec

asynchronous I/O:
fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
raw block device:
   READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
  WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
ext4/loop/ext4:
   READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
  WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
xfs/loop/xfs:
   READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
  WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
ext4/dm-loop/ext4:
   READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
  WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
xfs/dm-loop/xfs:
   READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
  WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec

Mikulas

Darrick J. Wong March 8, 2025, 3:49 a.m. UTC | #21

On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
> 
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
> 
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.

Are you running the loop device in directio mode?  The default is to use
buffered io, which wastes pagecache /and/ sometimes trips dirty limits
throttling.  The loopdev tests in fstests get noticeably faster if I
force directio mode.

--D

> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
>   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
>    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
>   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
>    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
>   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
>    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
>   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
>    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
>   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> 
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
>   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> ext4/loop/ext4:
>    READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
>   WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> xfs/loop/xfs:
>    READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
>   WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> ext4/dm-loop/ext4:
>    READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
>   WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> xfs/dm-loop/xfs:
>    READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
>   WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> 
> Mikulas
> 
>

Mikulas Patocka March 8, 2025, 8:45 p.m. UTC | #22

On Fri, 7 Mar 2025, Darrick J. Wong wrote:

> On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > > I didn't say you were. I said the concept that dm-loop is based on
> > > is fundamentally flawed and that your benchmark setup does not
> > > reflect real world usage of loop devices.
> > 
> > > Where are the bug reports about the loop device being slow and the
> > > analysis that indicates that it is unfixable?
> > 
> > So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> > MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> > --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> 
> Are you running the loop device in directio mode?  The default is to use
> buffered io, which wastes pagecache /and/ sometimes trips dirty limits
> throttling.  The loopdev tests in fstests get noticeably faster if I
> force directio mode.

Yes, I am. I set it up with:
losetup --direct-io=on /dev/loop0 /mnt/test/l
mount -t xfs /dev/loop0 /mnt/test2

I double checked it and I got the same results.

Mikulas

> --D
> 
> > synchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> >    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
> >   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> > ext4/loop/ext4:
> >    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
> >   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> > xfs/loop/xfs:
> >    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
> >   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> > ext4/dm-loop/ext4:
> >    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
> >   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> >    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
> >   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> > 
> > asynchronous I/O:
> > fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> > raw block device:
> >    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> >   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> > ext4/loop/ext4:
> >    READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
> >   WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> > xfs/loop/xfs:
> >    READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
> >   WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> > ext4/dm-loop/ext4:
> >    READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> >   WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> > xfs/dm-loop/xfs:
> >    READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> >   WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
> > 
> > Mikulas

Ming Lei March 9, 2025, 12:05 a.m. UTC | #23

On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
> 
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
> 
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> 
> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
>   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
>    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
>   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
>    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
>   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
>    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
>   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
>    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
>   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> 
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
>   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
> ext4/loop/ext4:
>    READ: bw=274MiB/s (288MB/s), 274MiB/s-274MiB/s (288MB/s-288MB/s), io=2743MiB (2877MB), run=10001-10001msec
>   WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=2747MiB (2880MB), run=10001-10001msec
> xfs/loop/xfs:
>    READ: bw=276MiB/s (289MB/s), 276MiB/s-276MiB/s (289MB/s-289MB/s), io=2761MiB (2896MB), run=10002-10002msec
>   WRITE: bw=276MiB/s (290MB/s), 276MiB/s-276MiB/s (290MB/s-290MB/s), io=2765MiB (2899MB), run=10002-10002msec
> ext4/dm-loop/ext4:
>    READ: bw=1189MiB/s (1247MB/s), 1189MiB/s-1189MiB/s (1247MB/s-1247MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
>   WRITE: bw=1190MiB/s (1248MB/s), 1190MiB/s-1190MiB/s (1248MB/s-1248MB/s), io=11.6GiB (12.5GB), run=10002-10002msec
> xfs/dm-loop/xfs:
>    READ: bw=1209MiB/s (1268MB/s), 1209MiB/s-1209MiB/s (1268MB/s-1268MB/s), io=11.8GiB (12.7GB), run=10001-10001msec
>   WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=11.8GiB (12.7GB), run=10001-10001msec

Hi Mikulas,

Please try the following patchset:

https://lore.kernel.org/linux-block/20250308162312.1640828-1-ming.lei@redhat.com/

which tries to handle IO in current context directly via NOWAIT, and
supports MQ for loop since 12 io jobs are applied in your test. With this
change, I can observe similar perf data on raw block device and loop/xfs
over mq-virtio-scsi & nvme in my test VM.

1) try single queue first by `modprobe loop`

2) then try MQ by 'modprobe loop nr_hw_queues=4'

If it still doesn't work, please provide fio log for both `raw block
device` and 'loop/xfs', which may provide some clue for the big perf
gap.



Thanks,
Ming

Ming Lei March 9, 2025, 12:16 a.m. UTC | #24

On Fri, Mar 07, 2025 at 04:21:58PM +0100, Mikulas Patocka wrote:
> > I didn't say you were. I said the concept that dm-loop is based on
> > is fundamentally flawed and that your benchmark setup does not
> > reflect real world usage of loop devices.
> 
> > Where are the bug reports about the loop device being slow and the
> > analysis that indicates that it is unfixable?
> 
> So, I did benchmarks on an enterprise nvme drive (SAMSUNG 
> MZPLJ1T6HBJR-00007). I stacked ext4/loop/ext4, xfs/loop/xfs (using losetup 
> --direct-io=on), ext4/dm-loop/ext4 and xfs/dm-loop/xfs. And loop is slow.
> 
> synchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=psync --iodepth=1 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3985MiB (4179MB), run=10001-10001msec
>   WRITE: bw=399MiB/s (418MB/s), 399MiB/s-399MiB/s (418MB/s-418MB/s), io=3990MiB (4184MB), run=10001-10001msec
> ext4/loop/ext4:
>    READ: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2232MiB (2341MB), run=10002-10002msec
>   WRITE: bw=223MiB/s (234MB/s), 223MiB/s-223MiB/s (234MB/s-234MB/s), io=2231MiB (2339MB), run=10002-10002msec
> xfs/loop/xfs:
>    READ: bw=220MiB/s (230MB/s), 220MiB/s-220MiB/s (230MB/s-230MB/s), io=2196MiB (2303MB), run=10001-10001msec
>   WRITE: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=2193MiB (2300MB), run=10001-10001msec
> ext4/dm-loop/ext4:
>    READ: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3383MiB (3547MB), run=10002-10002msec
>   WRITE: bw=338MiB/s (355MB/s), 338MiB/s-338MiB/s (355MB/s-355MB/s), io=3385MiB (3549MB), run=10002-10002msec
> xfs/dm-loop/xfs:
>    READ: bw=375MiB/s (393MB/s), 375MiB/s-375MiB/s (393MB/s-393MB/s), io=3752MiB (3934MB), run=10002-10002msec
>   WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=3756MiB (3938MB), run=10002-10002msec
> 
> asynchronous I/O:
> fio --direct=1 --bs=4k --runtime=10 --time_based --numjobs=12 --ioengine=libaio --iodepth=16 --group_reporting=1 --filename=/mnt/test2/l -name=job --rw=rw
> raw block device:
>    READ: bw=1246MiB/s (1306MB/s), 1246MiB/s-1246MiB/s (1306MB/s-1306MB/s), io=12.2GiB (13.1GB), run=10001-10001msec
>   WRITE: bw=1247MiB/s (1308MB/s), 1247MiB/s-1247MiB/s (1308MB/s-1308MB/s), io=12.2GiB (13.1GB), run=10001-10001msec

BTW, raw device is supposed to be xfs or ext4 over raw block device, right?

Otherwise, please provide test data for this case, then it becomes one fair
comparison because there should be lock contention for FS write IOs on same file.



Thanks,
Ming

the dm-loop target

Commit Message

Comments

Patch