diff mbox series

Weird loop device behavior in 6.15-rc1?

Message ID 20250407233007.GG6266@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series Weird loop device behavior in 6.15-rc1? | expand

Commit Message

Darrick J. Wong April 7, 2025, 11:30 p.m. UTC
Hey Christoph,

I have a ... weird test setup where loop devices have directio enabled
unconditionally on a system with 4k-lba disks, and now that I pulled
down 6.15-rc1, I see failures in xfs/259:

Comments

Christoph Hellwig April 8, 2025, 6:44 a.m. UTC | #1
On Mon, Apr 07, 2025 at 04:30:07PM -0700, Darrick J. Wong wrote:
> Hey Christoph,
> 
> I have a ... weird test setup where loop devices have directio enabled
> unconditionally on a system with 4k-lba disks, and now that I pulled
> down 6.15-rc1, I see failures in xfs/259:

Hmm, this works just fine for with a 4k LBA size NVMe setup on -rc1
with latest xfsprogs and xfstests for-next.

> Then trying to format an XFS filesystem fails:

That on the other hand I can reproduce locally.

> I think there's a bug in the loop driver where changing
> LO_FLAGS_DIRECT_IO doesn't actually try to change the O_DIRECT state of
> the underlying lo->lo_backing_file->f_flags.  So I can try to set a 2k
> block size on the loop dev, which turns off LO_FLAGS_DIRECT_IO but the
> fd is still open O_DIRECT so the writes fail.  But this isn't a
> regression in -rc1, so maybe this is the expected behavior?

This does look old, but also I would not call it expected.

> On 6.15-rc1, you actually /can/ change the sector size:

> But the backing file still has O_DIRECT on, so formatting fails:

Looks like the fact that fixing the silent failure to change the sector
size exposed the not clear O_DIRECT bug..

I'll cook up a patch to clear O_DIRECT.

> Thoughts?
> 
> --D
> 
> (/me notes that xfs/801 is failing across the board, and I don't know
> what changed about THPs in tmpfs but clearly something's corrupting
> memory.)

That one always failed for me because it uses a sysfs-dump tool that
simply doesn't seem to exist.
Darrick J. Wong April 8, 2025, 2:27 p.m. UTC | #2
On Mon, Apr 07, 2025 at 11:44:34PM -0700, Christoph Hellwig wrote:
> On Mon, Apr 07, 2025 at 04:30:07PM -0700, Darrick J. Wong wrote:
> > Hey Christoph,
> > 
> > I have a ... weird test setup where loop devices have directio enabled
> > unconditionally on a system with 4k-lba disks, and now that I pulled
> > down 6.15-rc1, I see failures in xfs/259:
> 
> Hmm, this works just fine for with a 4k LBA size NVMe setup on -rc1
> with latest xfsprogs and xfstests for-next.

Yeah, fstests works fine with loop in buffered mode. :)

I /think/ the (separate) problem is that prior to 6.15, the logican and
physical blocksizes of the loop device would be set to 512b in
direct-io=on mode.  Now it's set to either the STATX_DIOALIGN size or
the underlying bdev's logical block size, which means 4k.  mkfs.xfs runs
BLKSSZGET, compares that to the -b size= argument, and rejects when
blocksize < loop device logical block size.

I don't know if the loop device should behave more like 512e drives,
where we advertise a (potentially slow) 512b LBA and a 4k physical block
size?  Or just stick with the way things are right now because 512e mode
sucks.  The first means I don't have to patch fstests here, the second
means I'd have to adjust _create_loop to take a desired blocksize and
try to set up the loopdev with that block size, even if it means
dropping dio mode.

> > Then trying to format an XFS filesystem fails:
> 
> That on the other hand I can reproduce locally.
> 
> > I think there's a bug in the loop driver where changing
> > LO_FLAGS_DIRECT_IO doesn't actually try to change the O_DIRECT state of
> > the underlying lo->lo_backing_file->f_flags.  So I can try to set a 2k
> > block size on the loop dev, which turns off LO_FLAGS_DIRECT_IO but the
> > fd is still open O_DIRECT so the writes fail.  But this isn't a
> > regression in -rc1, so maybe this is the expected behavior?
> 
> This does look old, but also I would not call it expected.
> 
> > On 6.15-rc1, you actually /can/ change the sector size:
> 
> > But the backing file still has O_DIRECT on, so formatting fails:
> 
> Looks like the fact that fixing the silent failure to change the sector
> size exposed the not clear O_DIRECT bug..
> 
> I'll cook up a patch to clear O_DIRECT.

Thanks!

> > Thoughts?
> > 
> > --D
> > 
> > (/me notes that xfs/801 is failing across the board, and I don't know
> > what changed about THPs in tmpfs but clearly something's corrupting
> > memory.)
> 
> That one always failed for me because it uses a sysfs-dump tool that
> simply doesn't seem to exist.

Ooops.  I meant to take that out before committing and left it in.
Maybe I should just paste a stupid version into xfs/801:

$ sysfs-dump /sys/block/sda/queue/
/sys/block/sda/queue//add_random = 0
/sys/block/sda/queue//chunk_sectors : 0
/sys/block/sda/queue//dax : 0
/sys/block/sda/queue//discard_granularity : 512
/sys/block/sda/queue//discard_max_bytes = 0
/sys/block/sda/queue//discard_max_hw_bytes : 0
/sys/block/sda/queue//discard_zeroes_data : 0
/sys/block/sda/queue//dma_alignment : 511
<etc>

Full version below.

--D

#!/bin/sh

# Dump a sysfs directory as a key: value stream.

WANT_NEWLINE=

print_help() {
        echo "Usage: $0 [-n] files..."
        exit 1
}

dump() {
        test -f "$1" || return
        SEP='?'
        test -r "$1" && SEP=':'
        stat -c '%A' "$1" | grep -q 'w' && SEP='='
        if [ -n "${WANT_NEWLINE}" ]; then
                echo "$1 ${SEP}"
                cat "$1" 2> /dev/null
        else
                echo "$1 ${SEP} $(cat "$1" 2> /dev/null)"
        fi
}

for i in "$@"; do
        if [ "$i" = "--help" ]; then
                print_help
        fi
        if [ "$i" = "-n" ]; then
                WANT_NEWLINE=1
        fi
        if [ -d "$i" ]; then
                for x in "$i/"*; do
                        dump "$x"
                done
        else
                dump "$i"
        fi
done

exit 0
diff mbox series

Patch

--- /run/fstests/bin/tests/xfs/259.out	2025-01-30 10:00:17.074275830 -0800
+++ /var/tmp/fstests/xfs/259.out.bad	2025-04-06 19:34:56.587315490 -0700
@@ -1,17 +1,428 @@ 
 QA output created by 259
 Trying to make (4TB - 4096B) long xfs, block size 4096
 Trying to make (4TB - 4096B) long xfs, block size 2048
+block size 2048 cannot be smaller than sector size 4096

I think bugs in the loop driver's O_DIRECT handling are responsible for
this.  I boiled it down to the key commands so that you don't have to
set up a bunch of hardware.

First, some setup:

# losetup -f --direct-io=on --sector-size 4096 --show /dev/sda
# mkfs.xfs -f /dev/sda
# mount /dev/sda /mnt

On 6.14 and 6.15-rc1, if I create the loop device with directio mode
and try to turn it off so that I can reduce the block size:

# truncate -s 30g /mnt/a
# losetup --direct-io=on -f --show /mnt/a
/dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096

# losetup --sector-size 2048 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 2048

# losetup --direct-io=off /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 2048

# losetup --sector-size 2048 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 2048

(yes, that is a weird sequence)

Then trying to format an XFS filesystem fails:

# mkfs.xfs -f /dev/loop1 -b size=2k -K
meta-data=/dev/loop1             isize=512    agcount=4, agsize=393216 blks
         =                       sectsz=2048  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=2048   blocks=1572864, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=2048   blocks=32768, version=2
         =                       sectsz=2048  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
mkfs.xfs: pwrite failed: Input/output error

I think there's a bug in the loop driver where changing
LO_FLAGS_DIRECT_IO doesn't actually try to change the O_DIRECT state of
the underlying lo->lo_backing_file->f_flags.  So I can try to set a 2k
block size on the loop dev, which turns off LO_FLAGS_DIRECT_IO but the
fd is still open O_DIRECT so the writes fail.  But this isn't a
regression in -rc1, so maybe this is the expected behavior?

/me notes that going the opposite direction (turning directio on after
creation) fails:

# losetup --direct-io=on /dev/loop2
losetup: /dev/loop2: set direct io failed: Invalid argument

At least the loopdev stays in buffered mode and mkfs runs fine.

But now let's try passing in "0" to losetup --sector-size to set the
sector size to the minimum value.  On 6.14, this happens:

# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096

# losetup --sector-size 0 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096

# losetup --direct-io=off /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 4096

# losetup --sector-size 0 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096

Notice that the loopdev never changes block size, so mkfs fails:

# mkfs.xfs -f /dev/loop1 -b size=2k -K
block size 2048 cannot be smaller than sector size 4096

On 6.15-rc1, you actually /can/ change the sector size:

# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096
# losetup --sector-size 0 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 1 4096
# losetup --direct-io=off /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 4096
# losetup --sector-size 0 /dev/loop1
# losetup --list --raw /dev/loop1
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 0 /mnt/a 0 512

But the backing file still has O_DIRECT on, so formatting fails:

# mkfs.xfs -f /dev/loop1 -b size=2k -K
meta-data=/dev/loop1             isize=512    agcount=4, agsize=393216 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=2048   blocks=1572864, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=2048   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
mkfs.xfs: pwrite failed: Input/output error

So I /think/ what's going on here is that LOOP_SET_DIRECT_IO should be
trying to set/clear O_DIRECT on the backing file.

I chose to tag you because I think commit f4774e92aab85d ("loop: take
the file system minimum dio alignment into account") is what caused the
change in the block size setting behavior.  I also see similar messages
in xfs/078 and maybe xfs/432 if I turn on zoned=1 mode.

Though as I mentioned, I think the problems with the loop driver go
deeper than that commit. :/

Thoughts?

--D

(/me notes that xfs/801 is failing across the board, and I don't know
what changed about THPs in tmpfs but clearly something's corrupting
memory.)