Message ID | 20220802060323.khfjqwhwwbpjbegb@shindev (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mkfs.btrfs failure on zoned block devices with DUP metadata profile | expand |
On 2022/8/2 14:03, Shinichiro Kawasaki wrote: > Hello Qu, > > I found that the latest version of mkfs.btrfs fails on zoned block device when > DUP profile is specified for metadata. The failure depends on btrfs-progs > version and the trigger commit is 2a93728391a1 ("btrfs-progs: use > write_data_to_disk() to replace write_extent_to_disk()"). After some > investigation, it looks for me that this is a common problem for zoned and non- > zoned block devices. Here I describe findings below. Your comments will be > appreciated. > > On the failure, mkfs.btrfs reports "Error writing" to the device [1]. At this > time, kernel reports an I/O error, which indicates "Unaligned write command > error" to sequential write required zones. With strace, I observed that the mkfs > writes data to same sector twice. This repeated writes were observed regardless > whether the device is zoned or non-zoned. It does not cause an error for non- > zoned devices, but it causes the I/O error for zoned devices. > > Using strace, I compared the write to a non-zoned disk image file before and > after the commit 2a93728391a1. It shows that the commit introduced the repeated > writes to same sectors [2]. Is the repeated write expected after the commit? > If not, common fix is needed for zoned and non-zoned devices. Thanks for the detailed report. To me, the repeated write is definitely not an expected behavior. Although a quick glance doesn't show me much hint, I'll definitely look into the case. My initial guess is write_data_to_disk() has something wrong related to the mirror iteration, but no concrete conclusion yet. I'll definitely keep you updated since this incorrect behavior is now affecting zoned device now, and it will be treated as an emergency. Thanks, Qu > > > [1] > > (/dev/nullb0 is a zoned block device with no conventional zone) > $ sudo ./mkfs.btrfs /dev/nullb0 > btrfs-progs v5.18.1 > See http://btrfs.wiki.kernel.org for more information. > > Zoned: /dev/nullb0: host-managed device detected, setting zoned feature > Resetting device zones /dev/nullb0 (64 zones) ... > NOTE: several default settings have changed in version 5.15, please make sure > this does not affect your deployments: > - DUP for metadata (-m dup) > - enabled no-holes (-O no-holes) > - enabled free-space-tree (-R free-space-tree) > > Error writing to device 5 > kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` triggered, value 5 > ./mkfs.btrfs(__commit_transaction+0x197)[0x4382a6] > ./mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43840a] > ./mkfs.btrfs(main+0x171c)[0x40d64e] > /lib64/libc.so.6(+0x29550)[0x7f5108629550] > /lib64/libc.so.6(__libc_start_main+0x89)[0x7f5108629609] > ./mkfs.btrfs(_start+0x25)[0x40b965] > Aborted > > > [2] > > $ truncate -s 1G /tmp/btrfs.img > $ git reset --hard 2a937283~; make -j > $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/pre.log > $ git reset --hard 2a937283; make -j > $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/post.log > $ diff -u /tmp/pre.log /tmp/post.log > --- /tmp/pre.log 2022-08-02 14:12:19.670688371 +0900 > +++ /tmp/post.log 2022-08-02 14:12:47.019382686 +0900 > @@ -32,36 +32,66 @@ > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096 > pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 > +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 > pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 > .... >
On 2022/8/2 14:17, Qu Wenruo wrote: > > > On 2022/8/2 14:03, Shinichiro Kawasaki wrote: >> Hello Qu, >> >> I found that the latest version of mkfs.btrfs fails on zoned block >> device when >> DUP profile is specified for metadata. The failure depends on btrfs-progs >> version and the trigger commit is 2a93728391a1 ("btrfs-progs: use >> write_data_to_disk() to replace write_extent_to_disk()"). After some >> investigation, it looks for me that this is a common problem for zoned >> and non- >> zoned block devices. Here I describe findings below. Your comments >> will be >> appreciated. >> >> On the failure, mkfs.btrfs reports "Error writing" to the device [1]. >> At this >> time, kernel reports an I/O error, which indicates "Unaligned write >> command >> error" to sequential write required zones. With strace, I observed >> that the mkfs >> writes data to same sector twice. This repeated writes were observed >> regardless >> whether the device is zoned or non-zoned. It does not cause an error >> for non- >> zoned devices, but it causes the I/O error for zoned devices. >> >> Using strace, I compared the write to a non-zoned disk image file >> before and >> after the commit 2a93728391a1. It shows that the commit introduced the >> repeated >> writes to same sectors [2]. Is the repeated write expected after the >> commit? >> If not, common fix is needed for zoned and non-zoned devices. > > Thanks for the detailed report. > > To me, the repeated write is definitely not an expected behavior. > > Although a quick glance doesn't show me much hint, I'll definitely look > into the case. > > My initial guess is write_data_to_disk() has something wrong related to > the mirror iteration, but no concrete conclusion yet. > > I'll definitely keep you updated since this incorrect behavior is now > affecting zoned device now, and it will be treated as an emergency. Pinned down the root cause. It's the write_data_to_disk behavior not correct for multi-copy profiles. I just added some extra debug output: For SINGLE profile everything is fine: write_data_to_disk: logical=5341184 len=16384 mirror=1 btrfs_pwrite: fd=5 offset=5341184 count=16384 write_data_to_disk: logical=5357568 len=16384 mirror=1 btrfs_pwrite: fd=5 offset=5357568 count=16384 But the problem is related to profiles with multiple copies (DUP included). From higher layer, we will call write_data_to_disk() for EACH mirror: write_data_to_disk: logical=30408704 len=16384 mirror=1 btrfs_pwrite: fd=5 offset=38797312 count=16384 btrfs_pwrite: fd=5 offset=307232768 count=16384 write_data_to_disk: logical=30408704 len=16384 mirror=2 btrfs_pwrite: fd=5 offset=38797312 count=16384 btrfs_pwrite: fd=5 offset=307232768 count=16384 But inside write_data_to_disk() we call btrfs_map_block() which will always return the full mirrors for write. The fix can go either way: - Don't try to iterate mirrors when calling write_data_to_disk(). Just pass mirror = 0 into write_data_to_disk. - Make write_data_to_disk() to really respect the mirror number Thus not really passing WRITE for btrfs_map_block(). I'll evaluate above solutions then send out a proper fix. Thanks, Qu > > Thanks, > Qu >> >> >> [1] >> >> (/dev/nullb0 is a zoned block device with no conventional zone) >> $ sudo ./mkfs.btrfs /dev/nullb0 >> btrfs-progs v5.18.1 >> See http://btrfs.wiki.kernel.org for more information. >> >> Zoned: /dev/nullb0: host-managed device detected, setting zoned feature >> Resetting device zones /dev/nullb0 (64 zones) ... >> NOTE: several default settings have changed in version 5.15, please >> make sure >> this does not affect your deployments: >> - DUP for metadata (-m dup) >> - enabled no-holes (-O no-holes) >> - enabled free-space-tree (-R free-space-tree) >> >> Error writing to device 5 >> kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` >> triggered, value 5 >> ./mkfs.btrfs(__commit_transaction+0x197)[0x4382a6] >> ./mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43840a] >> ./mkfs.btrfs(main+0x171c)[0x40d64e] >> /lib64/libc.so.6(+0x29550)[0x7f5108629550] >> /lib64/libc.so.6(__libc_start_main+0x89)[0x7f5108629609] >> ./mkfs.btrfs(_start+0x25)[0x40b965] >> Aborted >> >> >> [2] >> >> $ truncate -s 1G /tmp/btrfs.img >> $ git reset --hard 2a937283~; make -j >> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup >> /tmp/btrfs.img |& grep btrfs.img > /tmp/pre.log >> $ git reset --hard 2a937283; make -j >> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup >> /tmp/btrfs.img |& grep btrfs.img > /tmp/post.log >> $ diff -u /tmp/pre.log /tmp/post.log >> --- /tmp/pre.log 2022-08-02 14:12:19.670688371 +0900 >> +++ /tmp/post.log 2022-08-02 14:12:47.019382686 +0900 >> @@ -32,36 +32,66 @@ >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096 >> pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 >> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 >> pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 >> .... >>
--- /tmp/pre.log 2022-08-02 14:12:19.670688371 +0900 +++ /tmp/post.log 2022-08-02 14:12:47.019382686 +0900 @@ -32,36 +32,66 @@ pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096 pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384 .... -- Shin'ichiro Kawasaki