[v7,00/39] btrfs: zoned block device support

Message ID	20200911123259.3782926-1-naohiro.aota@wdc.com (mailing list archive)
Headers	show Return-Path: <SRS0=84Zi=CU=vger.kernel.org=linux-fsdevel-owner@kernel.org> IronPort-SDR: hb7Jblh2teSMvw9pz8aRAoDSBw9603tZetVW3Xaw9flPA7vJ0D03xTxlNzEtYRRC1wdq0dQ3WG /S6LzuPptkaME/PNIWjlQKfrNzEg7Zj2sjRcyMM9i5IVafhEhzs6NX//pCizzp216/LuhAqEol +WXaoVz8NCVi27Djfke85iOqiJ61RpZlkzy71/wL5pwhg832aq2U8SIQlII7rheSw89SMpsHI8 OYDagca+oumjGGcteaPXaUr7I5bQ4y1egerj0m8S5LEc1NwYGsZ5IwNGxDWJ4P3axCriIrrh8P uAA= IronPort-SDR: 7SDwQOag7UTpm6SvOc17nKoo4/rq/1m9PyutabZDXV+lNOF0na3cdMX/J2O1KYqmvX++/TNqc+ qicGK2bpi13A== IronPort-SDR: 3df3d5wXzEqdg69RgkJjKAwD7K5BJnBR5W+ad4gevxKx/MiesOut8iPmLX0ZGrckHsWuPDyFAg dY38m1LGtjzg== WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Hannes Reinecke <hare@suse.com>, linux-fsdevel@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH v7 00/39] btrfs: zoned block device support Date: Fri, 11 Sep 2020 21:32:20 +0900 Message-Id: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	btrfs: zoned block device support \| expand [v7,00/39] btrfs: zoned block device support [v7,01/39] btrfs: introduce ZONED feature flag [v7,02/39] btrfs: Get zone information of zoned block devices [v7,03/39] btrfs: Check and enable ZONED mode [v7,04/39] btrfs: introduce max_zone_append_size [v7,05/39] btrfs: disallow space_cache in ZONED mode [v7,06/39] btrfs: disallow NODATACOW in ZONED mode [v7,07/39] btrfs: disable fallocate in ZONED mode [v7,08/39] btrfs: disallow mixed-bg in ZONED mode [v7,09/39] btrfs: disallow inode_cache in ZONED mode [v7,10/39] btrfs: implement log-structured superblock for ZONED mode [v7,11/39] btrfs: implement zoned chunk allocator [v7,12/39] btrfs: verify device extent is aligned to zone [v7,13/39] btrfs: load zone's alloction offset [v7,14/39] btrfs: emulate write pointer for conventional zones [v7,15/39] btrfs: track unusable bytes for zones [v7,16/39] btrfs: do sequential extent allocation in ZONED mode [v7,17/39] btrfs: reset zones of unused block groups [v7,18/39] btrfs: redirty released extent buffers in ZONED mode [v7,19/39] btrfs: limit bio size under max_zone_append_size [v7,20/39] btrfs: limit ordered extent size to max_zone_append_size [v7,21/39] btrfs: extend btrfs_rmap_block for specifying a device [v7,22/39] btrfs: use ZONE_APPEND write for ZONED btrfs [v7,23/39] btrfs: handle REQ_OP_ZONE_APPEND as writing [v7,24/39] btrfs: enable zone append writing for direct IO [v7,25/39] btrfs: introduce dedicated data write path for ZONED mode [v7,26/39] btrfs: serialize meta IOs on ZONED mode [v7,27/39] btrfs: wait existing extents before truncating [v7,28/39] btrfs: avoid async metadata checksum on ZONED mode [v7,29/39] btrfs: mark block groups to copy for device-replace [v7,30/39] btrfs: implement cloning for ZONED device-replace [v7,31/39] btrfs: implement copying for ZONED device-replace [v7,32/39] btrfs: support dev-replace in ZONED mode [v7,33/39] btrfs: enable relocation in ZONED mode [v7,34/39] btrfs: relocate block group to repair IO failure in ZONED [v7,35/39] btrfs: split alloc_log_tree() [v7,36/39] btrfs: extend zoned allocator to use dedicated tree-log block group [v7,37/39] btrfs: serialize log transaction on ZONED mode [v7,38/39] btrfs: reorder log node allocation [v7,39/39] btrfs: enable to mount ZONED incompat flag

Naohiro Aota Sept. 11, 2020, 12:32 p.m. UTC

This series adds zoned block device support to btrfs.

Changes from v6:
 - Use zone append write command instead of normal write command
   - Bio issuing order does not matter
   - No need to use lock anymore
   - Can use asynchronous checksum
 - Removed RAID support for now
 - Rename HMZONED to ZONED
 - Split some patches
 - Rebased on kdave/for-5.9-rc3 + iomap direct IO

Userland series will follow.

This version of ZONED btrfs switched from normal write command to zone
append write command. You do not need to specify LBA (at the write pointer)
to write for zone append write command. Instead, you only select a zone to
write with its start LBA. Then the device (NVMe ZNS), or the emulation of
zone append command in the sd driver in the case of SAS or SATA HDDs,
automatically writes the data at the write pointer position and return the
written LBA as a command reply.

The benefit of using the zone append write command is that write command
issuing order does not matter. So, we can eliminate block group lock and
utilize asynchronous checksum, which can reorder the IOs.

Eliminating the lock improves performance. In particular, on a workload
with massive competing to the same zone [1], we observed 36% performance
improvement compared to normal write.

[1] Fio running 16 jobs with 4KB random writes for 5 minutes

However, there are some limitations. We cannot use the non-SINGLE profile.
Supporting non-SINGLE profile with zone append writing is not trivial. For
example, in the DUP profile, we send a zone append writing IO to two zones
on a device. The device reply with written LBAs for the IOs. If the offsets
of the returned addresses from the beginning of the zone are different,
then it results in different logical addresses.

For the same reason, we cannot issue multiple IOs for one ordered extent.
Thus, the size of an ordered extent is limited under max_zone_append_size.
This limitation will cause fragmentation and increased usage of metadata.
In the future, we can add optimization to merge ordered extents after
end_bio.

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Enabling tree-log

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing than a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes which ZONED mode must avoid.

This series introduces a dedicated block group for tree-log blocks to
create two metadata writing streams, one for tree-log blocks and the
other for metadata blocks. As a result, each write stream can now be
written to devices separately and sequentially.

* Log-structured superblock

Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the
zone.

This series implements superblock log writing. It uses two zones as a
circular buffer to write updated superblocks. Once the first zone is filled
up, start writing into the second zone. The first zone will be reset once
both zones are filled. We can determine the postion of the latest
superblock by reading the write pointer information from a device.

* Patch series organization

Patch 1 introduces the ZONED incompatible feature flag to indicate that the
btrfs volume was formatted for use on zoned block devices.

Patches 2 to 4 implement functions to gather information on the zones of
the device (zones type, write pointer position, and max_zone_append_size).

Patches 5 to 9 disable features which are not compatible with the
sequential write constraints of zoned block devices. These includes
space_cache, NODATACOW, fallocate, MIXED_BG and inode cache.

Patch 10 implements the log-structured superblock writing.

Patches 11 and 12 tweak the device extent allocation for ZONED mode and add
verification to check if a device extent is properly aligned to zones.

Patches 13 to 16 implements sequential block allocator for ZONED mode.

Patch 17 implement a zone reset for unused block groups.

Patches 18 to 28 implement the writing path for several types of IO
(non-compressed data, direct IO, and metadata). These include re-dirtying
once-freed metadata blocks to prevent write holes.

Patches 29 to 38 tweak some btrfs features work with ZONED mode. These
include device-replace, relocation, repairing IO error, and tree-log.

Finally, patch 28 adds the ZONED feature to the list of supported features.

* Patch testing note

This series is based on kdave/for-5.5.

** Zone-aware util-linux

Since the log-structured superblock feature changed the location of
superblock magic, the current util-linux (libblkid) cannot detect ZONED
btrfs anymore. You need to apply a to-be posted patch to util-linux to make
it "zone aware".

** Testing device

You need devices with zone append writing command support to run ZONED
btrfs.

Other than real devices, null_blk supports zone append write command. You
can use memory backed null_blk to run the test on it. Following script
creates 12800 MB /dev/nullb0.

    sysfs=/sys/kernel/config/nullb/nullb0
    size=12800 # MB
    
    # drop nullb0
    if [[ -d $sysfs ]]; then
            echo 0 > "${sysfs}"/power
            rmdir $sysfs
    fi
    lsmod | grep -q null_blk && rmmod null_blk
    modprobe null_blk nr_devices=0
    
    mkdir "${sysfs}"
    
    echo "${size}" > "${sysfs}"/size
    echo 1 > "${sysfs}"/zoned
    echo 0 > "${sysfs}"/zone_nr_conv
    echo 1 > "${sysfs}"/memory_backed
    
    echo 1 > "${sysfs}"/power
    udevadm settle

Zoned SCSI devices such as SMR HDDs or scsi_debug also support the zone
append command as an emulated command within the SCSI sd driver. This
emulation is completely transparent to the user and provides the same
semantic as a NVMe ZNS native drive support.

Also, there is a qemu patch available to enable NVMe ZNS device.

** xfstests

We ran xfstests on ZONED btrfs, and, if we omit some cases that are known
to fail currently, all test cases pass.

Cases that can be ignored:
1) failing also with the regular btrfs on regular devices,
2) trying to test fallocate feature without testing with
   "_require_xfs_io_command "falloc"",
3) trying to test incompatible features for ZONED btrfs (e.g. RAID5/6)
4) trying to use incompatible setup for ZONED btrfs (e.g. dm-linear not
   aligned to zone boundary, swap)
5) trying to create a file system with too small size, (we require at least
   9 zones to initiate a ZONED btrfs)
6) dropping original MKFS_OPTIONS ("-O zoned"), so it cannot create ZONED
   btrfs (btrfs/003)
7) having ENOSPC which incurred by larger metadata block group size

I will send a patch series for xfstests to handle these cases (2-6)
properly.

Also, you need to apply the following patch if you run xfstests with
tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

v6 https://lore.kernel.org/linux-btrfs/20191213040915.3502922-1-naohiro.aota@wdc.com/
v5 https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/
v4 https://lwn.net/Articles/797061/
v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v6:
 - Use bitmap helpers (Johannes)
 - Code cleanup (Johannes)
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v5:
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v4:
 - Move memory allcation of zone informattion out of
   btrfs_get_dev_zones() (Anand)
 - Add disabled features table in commit log (Anand)
 - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
v3:
 - Serialize allocation and submit_bio instead of bio buffering in
   btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)


Naohiro Aota (39):
  btrfs: introduce ZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable ZONED mode
  btrfs: introduce max_zone_append_size
  btrfs: disallow space_cache in ZONED mode
  btrfs: disallow NODATACOW in ZONED mode
  btrfs: disable fallocate in ZONED mode
  btrfs: disallow mixed-bg in ZONED mode
  btrfs: disallow inode_cache in ZONED mode
  btrfs: implement log-structured superblock for ZONED mode
  btrfs: implement zoned chunk allocator
  btrfs: verify device extent is aligned to zone
  btrfs: load zone's alloction offset
  btrfs: emulate write pointer for conventional zones
  btrfs: track unusable bytes for zones
  btrfs: do sequential extent allocation in ZONED mode
  btrfs: reset zones of unused block groups
  btrfs: redirty released extent buffers in ZONED mode
  btrfs: limit bio size under max_zone_append_size
  btrfs: limit ordered extent size to max_zone_append_size
  btrfs: extend btrfs_rmap_block for specifying a device
  btrfs: use ZONE_APPEND write for ZONED btrfs
  btrfs: handle REQ_OP_ZONE_APPEND as writing
  btrfs: enable zone append writing for direct IO
  btrfs: introduce dedicated data write path for ZONED mode
  btrfs: serialize meta IOs on ZONED mode
  btrfs: wait existing extents before truncating
  btrfs: avoid async metadata checksum on ZONED mode
  btrfs: mark block groups to copy for device-replace
  btrfs: implement cloning for ZONED device-replace
  btrfs: implement copying for ZONED device-replace
  btrfs: support dev-replace in ZONED mode
  btrfs: enable relocation in ZONED mode
  btrfs: relocate block group to repair IO failure in ZONED
  btrfs: split alloc_log_tree()
  btrfs: extend zoned allocator to use dedicated tree-log block group
  btrfs: serialize log transaction on ZONED mode
  btrfs: reorder log node allocation
  btrfs: enable to mount ZONED incompat flag

 fs/btrfs/Makefile           |    1 +
 fs/btrfs/block-group.c      |   85 ++-
 fs/btrfs/block-group.h      |   13 +
 fs/btrfs/ctree.h            |   12 +-
 fs/btrfs/dev-replace.c      |  187 +++++
 fs/btrfs/dev-replace.h      |    3 +
 fs/btrfs/disk-io.c          |   82 ++-
 fs/btrfs/disk-io.h          |    2 +
 fs/btrfs/extent-tree.c      |  206 +++++-
 fs/btrfs/extent_io.c        |   48 +-
 fs/btrfs/extent_io.h        |    2 +
 fs/btrfs/file.c             |    4 +
 fs/btrfs/free-space-cache.c |   58 ++
 fs/btrfs/free-space-cache.h |    4 +
 fs/btrfs/inode.c            |   72 +-
 fs/btrfs/ioctl.c            |    3 +
 fs/btrfs/ordered-data.c     |    3 +
 fs/btrfs/ordered-data.h     |    4 +
 fs/btrfs/relocation.c       |   35 +-
 fs/btrfs/scrub.c            |  145 ++++
 fs/btrfs/space-info.c       |   13 +-
 fs/btrfs/space-info.h       |    4 +-
 fs/btrfs/super.c            |   13 +-
 fs/btrfs/sysfs.c            |    4 +
 fs/btrfs/transaction.c      |   10 +
 fs/btrfs/transaction.h      |    3 +
 fs/btrfs/tree-log.c         |   50 +-
 fs/btrfs/volumes.c          |  307 ++++++++-
 fs/btrfs/volumes.h          |    7 +
 fs/btrfs/zoned.c            | 1279 +++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h            |  281 ++++++++
 include/uapi/linux/btrfs.h  |    1 +
 32 files changed, 2864 insertions(+), 77 deletions(-)
 create mode 100644 fs/btrfs/zoned.c
 create mode 100644 fs/btrfs/zoned.h

David Sterba Sept. 15, 2020, 8:09 a.m. UTC | #1

On Fri, Sep 11, 2020 at 09:32:20PM +0900, Naohiro Aota wrote:
> Changelog
> v6:
>  - Use bitmap helpers (Johannes)
>  - Code cleanup (Johannes)
>  - Rebased on kdave/for-5.5
>  - Enable the tree-log feature.
>  - Treat conventional zones as sequential zones, so we can now allow
>    mixed allocation of conventional zone and sequential write required
>    zone to construct a block group.
>  - Implement log-structured superblock
>    - No need for one conventional zone at the beginning of a device.
>  - Fix deadlock of direct IO writing
>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>  - Fix leak of zone_info (Johannes)

I did a quick check to see if the patchset passes the default VM tests
and there's use after free short after the fstests start. No zoned
devices or such. I had to fix some conflicts when rebasing on misc-next
but I tried to base it on the last iomap-dio patch ("btrfs: switch to
iomap for direct IO"), same result so it's something in the zoned
patches.

The reported pointer 0x6b6b6b6b6d1918eb contains the use-after-free
poison (0x6b) (CONFIG_PAGE_POISONING=y).

MKFS_OPTIONS  -- -f -K --csum xxhash /dev/vdb
MOUNT_OPTIONS -- -o discard /dev/vdb /tmp/scratch

[   19.928844] BTRFS: device fsid 663b9b17-ab02-4021-92bf-dc24c3e4351a devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (398)
[   19.974176] BTRFS info (device vdb): turning on sync discard
[   19.977035] BTRFS info (device vdb): disk space caching is enabled
[   19.979965] BTRFS info (device vdb): has skinny extents
[   19.982586] BTRFS info (device vdb): flagging fs with big metadata feature
[   19.991757] BTRFS info (device vdb): checking UUID tree
[   20.002740] general protection fault, probably for non-canonical address 0x6b6b6b6b6d1918eb: 0000 [#1] SMP
[   20.006949] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5-default+ #1260
[   20.009746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[   20.013566] RIP: 0010:end_bio_extent_writepage+0x51/0x180 [btrfs]
[   20.015873] Code: 8b 54 24 2c 41 8b 74 24 30 41 89 c5 48 c1 e2 04 49 03 54 24 60 03 72 0c 89 f0 81 e6 ff 0f 00 00 c1 e8 0c 48 c1 e0 06 48 03 02 <48> 8b 50 20 48 8b 40 18 48 c1 e2 0c 48 8b 38 48 01 d6 4c 89 e2 e8
[   20.022029] RSP: 0018:ffff93fc800b8e20 EFLAGS: 00010206
[   20.023615] RAX: 6b6b6b6b6d1918eb RBX: ffff8eceaf630378 RCX: 0000000000000020
[   20.025565] RDX: ffff8eceb297da00 RSI: 0000000000000b6b RDI: 0000000000000000
[   20.027289] RBP: ffff93fc800b8e80 R08: 00000004a84169d3 R09: 0000000000000000
[   20.028634] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8eceaf630378
[   20.030569] R13: 0000000000000000 R14: 0000000000010000 R15: ffff8ecebaf62280
[   20.032607] FS:  0000000000000000(0000) GS:ffff8ecebd800000(0000) knlGS:0000000000000000
[   20.035314] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.036641] CR2: 00007f9bc3b5a990 CR3: 000000005fa3a005 CR4: 0000000000170ea0
[   20.038523] Call Trace:
[   20.039527]  <IRQ>
[   20.040378]  ? sched_clock_cpu+0x15/0x130
[   20.041619]  ? bio_endio+0x120/0x2c0
[   20.042880]  btrfs_end_bio+0x83/0x130 [btrfs]
[   20.044172]  blk_update_request+0x230/0x710
[   20.045368]  blk_mq_end_request+0x1c/0x130
[   20.046594]  blk_done_softirq+0x9f/0xd0
[   20.047754]  __do_softirq+0x1eb/0x56c
[   20.048917]  asm_call_on_stack+0xf/0x20
[   20.050185]  </IRQ>
[   20.051035]  do_softirq_own_stack+0x52/0x60
[   20.052368]  irq_exit_rcu+0x98/0xb0
[   20.053533]  sysvec_call_function_single+0x43/0xa0
[   20.054887]  asm_sysvec_call_function_single+0x12/0x20
[   20.056545] RIP: 0010:default_idle+0x1d/0x20
[   20.057961] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 e8 26 4d 9e ff 8b 05 f0 0e 03 01 85 c0 7e 07 0f 00 2d 67 1d 46 00 fb f4 <c3> 66 90 0f 1f 44 00 00 65 48 8b 04 25 00 8e 01 00 f0 80 48 02 20
[   20.063377] RSP: 0018:ffff93fc80073ed0 EFLAGS: 00000246
[   20.064915] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   20.066743] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9b7a5c6a
[   20.068656] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
[   20.070530] R10: 0000000000000000 R11: 0000000000000046 R12: ffff8ecebd330040
[   20.072439] R13: ffff8ecebd330040 R14: 0000000000000000 R15: 0000000000000000
[   20.074408]  ? default_idle+0xa/0x20
[   20.075564]  default_idle_call+0x52/0x230
[   20.076849]  do_idle+0x201/0x210
[   20.077993]  cpu_startup_entry+0x19/0x1b
[   20.079301]  secondary_startup_64+0xa4/0xb0
[   20.080685] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
[   20.085376] ---[ end trace fb99f1646d553ef6 ]---
[   20.087627] RIP: 0010:end_bio_extent_writepage+0x51/0x180 [btrfs]
[   20.090290] Code: 8b 54 24 2c 41 8b 74 24 30 41 89 c5 48 c1 e2 04 49 03 54 24 60 03 72 0c 89 f0 81 e6 ff 0f 00 00 c1 e8 0c 48 c1 e0 06 48 03 02 <48> 8b 50 20 48 8b 40 18 48 c1 e2 0c 48 8b 38 48 01 d6 4c 89 e2 e8
[   20.096561] RSP: 0018:ffff93fc800b8e20 EFLAGS: 00010206
[   20.098468] RAX: 6b6b6b6b6d1918eb RBX: ffff8eceaf630378 RCX: 0000000000000020
[   20.100883] RDX: ffff8eceb297da00 RSI: 0000000000000b6b RDI: 0000000000000000
[   20.103124] RBP: ffff93fc800b8e80 R08: 00000004a84169d3 R09: 0000000000000000
[   20.105312] R10: 0000000000000000 R11: 0000000000000246 R12: ffff8eceaf630378
[   20.107553] R13: 0000000000000000 R14: 0000000000010000 R15: ffff8ecebaf62280
[   20.109464] FS:  0000000000000000(0000) GS:ffff8ecebd800000(0000) knlGS:0000000000000000
[   20.112286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.114048] CR2: 00007f9bc3b5a990 CR3: 000000005fa3a005 CR4: 0000000000170ea0
[   20.115946] Kernel panic - not syncing: Fatal exception in interrupt
[   20.117851] Kernel Offset: 0x1a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   20.120788] Rebooting in 90 seconds..

Johannes Thumshirn Sept. 16, 2020, 5:42 p.m. UTC | #2

On 15/09/2020 10:25, David Sterba wrote:
> On Fri, Sep 11, 2020 at 09:32:20PM +0900, Naohiro Aota wrote:
>> Changelog
>> v6:
>>  - Use bitmap helpers (Johannes)
>>  - Code cleanup (Johannes)
>>  - Rebased on kdave/for-5.5
>>  - Enable the tree-log feature.
>>  - Treat conventional zones as sequential zones, so we can now allow
>>    mixed allocation of conventional zone and sequential write required
>>    zone to construct a block group.
>>  - Implement log-structured superblock
>>    - No need for one conventional zone at the beginning of a device.
>>  - Fix deadlock of direct IO writing
>>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>>  - Fix leak of zone_info (Johannes)
> 
> I did a quick check to see if the patchset passes the default VM tests
> and there's use after free short after the fstests start. No zoned
> devices or such. I had to fix some conflicts when rebasing on misc-next
> but I tried to base it on the last iomap-dio patch ("btrfs: switch to
> iomap for direct IO"), same result so it's something in the zoned
> patches.
> 
> The reported pointer 0x6b6b6b6b6d1918eb contains the use-after-free
> poison (0x6b) (CONFIG_PAGE_POISONING=y).
> 
> MKFS_OPTIONS  -- -f -K --csum xxhash /dev/vdb
> MOUNT_OPTIONS -- -o discard /dev/vdb /tmp/scratch

Hi David,

Can you check if this on top of the series fixes the issue? According
to Keith we can't call bio_iovec() from endio() as the iterator is already
advanced (see req_bio_endio()).


diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bda4e02b5eab..311956697682 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2753,10 +2753,6 @@ static void end_bio_extent_writepage(struct bio *bio)
        u64 end;
        struct bvec_iter_all iter_all;
 
-       btrfs_record_physical_zoned(bio_iovec(bio).bv_page->mapping->host,
-                                   page_offset(bio_iovec(bio).bv_page) + bio_iovec(bio).bv_offset,
-                                   bio);
-
        ASSERT(!bio_flagged(bio, BIO_CLONED));
        bio_for_each_segment_all(bvec, bio, iter_all) {
                struct page *page = bvec->bv_page;
@@ -2782,6 +2778,7 @@ static void end_bio_extent_writepage(struct bio *bio)
                start = page_offset(page);
                end = start + bvec->bv_offset + bvec->bv_len - 1;
 
+               btrfs_record_physical_zoned(inode, start, bio);
                end_extent_writepage(page, error, start, end);
                end_page_writeback(page);
        }
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 576f8e333f16..6fdb21029ea9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1086,8 +1086,7 @@ void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
 {
        struct btrfs_ordered_extent *ordered;
        struct bio_vec bvec = bio_iovec(bio);
-       u64 physical = ((u64)bio->bi_iter.bi_sector << SECTOR_SHIFT) +
-               bvec.bv_offset;
+       u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
 
        if (bio_op(bio) != REQ_OP_ZONE_APPEND)
                return;

David Sterba Sept. 16, 2020, 7:46 p.m. UTC | #3

On Wed, Sep 16, 2020 at 05:42:50PM +0000, Johannes Thumshirn wrote:
> On 15/09/2020 10:25, David Sterba wrote:
> > On Fri, Sep 11, 2020 at 09:32:20PM +0900, Naohiro Aota wrote:
> >> Changelog
> >> v6:
> >>  - Use bitmap helpers (Johannes)
> >>  - Code cleanup (Johannes)
> >>  - Rebased on kdave/for-5.5
> >>  - Enable the tree-log feature.
> >>  - Treat conventional zones as sequential zones, so we can now allow
> >>    mixed allocation of conventional zone and sequential write required
> >>    zone to construct a block group.
> >>  - Implement log-structured superblock
> >>    - No need for one conventional zone at the beginning of a device.
> >>  - Fix deadlock of direct IO writing
> >>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
> >>  - Fix leak of zone_info (Johannes)
> > 
> > I did a quick check to see if the patchset passes the default VM tests
> > and there's use after free short after the fstests start. No zoned
> > devices or such. I had to fix some conflicts when rebasing on misc-next
> > but I tried to base it on the last iomap-dio patch ("btrfs: switch to
> > iomap for direct IO"), same result so it's something in the zoned
> > patches.
> > 
> > The reported pointer 0x6b6b6b6b6d1918eb contains the use-after-free
> > poison (0x6b) (CONFIG_PAGE_POISONING=y).
> > 
> > MKFS_OPTIONS  -- -f -K --csum xxhash /dev/vdb
> > MOUNT_OPTIONS -- -o discard /dev/vdb /tmp/scratch
> 
> Hi David,
> 
> Can you check if this on top of the series fixes the issue? According
> to Keith we can't call bio_iovec() from endio() as the iterator is already
> advanced (see req_bio_endio()).

It booted and is past the point it crashed before.

Johannes Thumshirn Sept. 16, 2020, 7:50 p.m. UTC | #4

On 16/09/2020 21:48, David Sterba wrote:
>> Can you check if this on top of the series fixes the issue? According
>> to Keith we can't call bio_iovec() from endio() as the iterator is already
>> advanced (see req_bio_endio()).
> It booted and is past the point it crashed before.
> 

Thanks a lot for confirming

Naohiro Aota Sept. 17, 2020, 5:40 a.m. UTC | #5

On Wed, Sep 16, 2020 at 05:42:50PM +0000, Johannes Thumshirn wrote:
>On 15/09/2020 10:25, David Sterba wrote:
>> On Fri, Sep 11, 2020 at 09:32:20PM +0900, Naohiro Aota wrote:
>>> Changelog
>>> v6:
>>>  - Use bitmap helpers (Johannes)
>>>  - Code cleanup (Johannes)
>>>  - Rebased on kdave/for-5.5
>>>  - Enable the tree-log feature.
>>>  - Treat conventional zones as sequential zones, so we can now allow
>>>    mixed allocation of conventional zone and sequential write required
>>>    zone to construct a block group.
>>>  - Implement log-structured superblock
>>>    - No need for one conventional zone at the beginning of a device.
>>>  - Fix deadlock of direct IO writing
>>>  - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
>>>  - Fix leak of zone_info (Johannes)
>>
>> I did a quick check to see if the patchset passes the default VM tests
>> and there's use after free short after the fstests start. No zoned
>> devices or such. I had to fix some conflicts when rebasing on misc-next
>> but I tried to base it on the last iomap-dio patch ("btrfs: switch to
>> iomap for direct IO"), same result so it's something in the zoned
>> patches.
>>
>> The reported pointer 0x6b6b6b6b6d1918eb contains the use-after-free
>> poison (0x6b) (CONFIG_PAGE_POISONING=y).
>>
>> MKFS_OPTIONS  -- -f -K --csum xxhash /dev/vdb
>> MOUNT_OPTIONS -- -o discard /dev/vdb /tmp/scratch
>
>Hi David,
>
>Can you check if this on top of the series fixes the issue? According
>to Keith we can't call bio_iovec() from endio() as the iterator is already
>advanced (see req_bio_endio()).
>
>

Thank you for fixing this.

>diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>index bda4e02b5eab..311956697682 100644
>--- a/fs/btrfs/extent_io.c
>+++ b/fs/btrfs/extent_io.c
>@@ -2753,10 +2753,6 @@ static void end_bio_extent_writepage(struct bio *bio)
>        u64 end;
>        struct bvec_iter_all iter_all;
>
>-       btrfs_record_physical_zoned(bio_iovec(bio).bv_page->mapping->host,
>-                                   page_offset(bio_iovec(bio).bv_page) + bio_iovec(bio).bv_offset,
>-                                   bio);
>-
>        ASSERT(!bio_flagged(bio, BIO_CLONED));
>        bio_for_each_segment_all(bvec, bio, iter_all) {
>                struct page *page = bvec->bv_page;
>@@ -2782,6 +2778,7 @@ static void end_bio_extent_writepage(struct bio *bio)
>                start = page_offset(page);
>                end = start + bvec->bv_offset + bvec->bv_len - 1;
>
>+               btrfs_record_physical_zoned(inode, start, bio);

We need to record the physical address only once per an ordered extent.
So, this should be like:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c21d1dbe314e..0bbe6e52ea0d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2748,6 +2748,7 @@ static void end_bio_extent_writepage(struct bio *bio)
         u64 start;
         u64 end;
         struct bvec_iter_all iter_all;
+       bool first_bvec = true;

         ASSERT(!bio_flagged(bio, BIO_CLONED));
         bio_for_each_segment_all(bvec, bio, iter_all) {
@@ -2774,6 +2775,11 @@ static void end_bio_extent_writepage(struct bio *bio)
                 start = page_offset(page);
                 end = start + bvec->bv_offset + bvec->bv_len - 1;

+               if (first_bvec) {
+                       btrfs_record_physical_zoned(inode, start, bio);
+                       first_bvec = false;
+               }
+
                 end_extent_writepage(page, error, start, end);
                 end_page_writeback(page);
         }


>                end_extent_writepage(page, error, start, end);
>                end_page_writeback(page);
>        }
>diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>index 576f8e333f16..6fdb21029ea9 100644
>--- a/fs/btrfs/zoned.c
>+++ b/fs/btrfs/zoned.c
>@@ -1086,8 +1086,7 @@ void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
> {
>        struct btrfs_ordered_extent *ordered;
>        struct bio_vec bvec = bio_iovec(bio);
>-       u64 physical = ((u64)bio->bi_iter.bi_sector << SECTOR_SHIFT) +
>-               bvec.bv_offset;
>+       u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>
>        if (bio_op(bio) != REQ_OP_ZONE_APPEND)
>                return;
>

Johannes Thumshirn Sept. 17, 2020, 7:14 a.m. UTC | #6

On 17/09/2020 07:40, Naohiro Aota wrote:
> Thank you for fixing this.

Well it was you who had the idea, I just sent it.

>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index bda4e02b5eab..311956697682 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2753,10 +2753,6 @@ static void end_bio_extent_writepage(struct bio *bio)
>>        u64 end;
>>        struct bvec_iter_all iter_all;
>>
>> -       btrfs_record_physical_zoned(bio_iovec(bio).bv_page->mapping->host,
>> -                                   page_offset(bio_iovec(bio).bv_page) + bio_iovec(bio).bv_offset,
>> -                                   bio);
>> -
>>        ASSERT(!bio_flagged(bio, BIO_CLONED));
>>        bio_for_each_segment_all(bvec, bio, iter_all) {
>>                struct page *page = bvec->bv_page;
>> @@ -2782,6 +2778,7 @@ static void end_bio_extent_writepage(struct bio *bio)
>>                start = page_offset(page);
>>                end = start + bvec->bv_offset + bvec->bv_len - 1;
>>
>> +               btrfs_record_physical_zoned(inode, start, bio);
> We need to record the physical address only once per an ordered extent.
> So, this should be like:
> 

Right, this would save us a lot of unneeded function calls for the non-zoned
version of btrfs as well.

> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c21d1dbe314e..0bbe6e52ea0d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2748,6 +2748,7 @@ static void end_bio_extent_writepage(struct bio *bio)
>          u64 start;
>          u64 end;
>          struct bvec_iter_all iter_all;
> +       bool first_bvec = true;
> 
>          ASSERT(!bio_flagged(bio, BIO_CLONED));
>          bio_for_each_segment_all(bvec, bio, iter_all) {
> @@ -2774,6 +2775,11 @@ static void end_bio_extent_writepage(struct bio *bio)
>                  start = page_offset(page);
>                  end = start + bvec->bv_offset + bvec->bv_len - 1;
> 
> +               if (first_bvec) {
> +                       btrfs_record_physical_zoned(inode, start, bio);
> +                       first_bvec = false;
> +               }
> +
>                  end_extent_writepage(page, error, start, end);
>                  end_page_writeback(page);
>          }
> 
> 
>>                end_extent_writepage(page, error, start, end);
>>                end_page_writeback(page);
>>        }
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 576f8e333f16..6fdb21029ea9 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -1086,8 +1086,7 @@ void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
>> {
>>        struct btrfs_ordered_extent *ordered;
>>        struct bio_vec bvec = bio_iovec(bio);
>> -       u64 physical = ((u64)bio->bi_iter.bi_sector << SECTOR_SHIFT) +
>> -               bvec.bv_offset;
>> +       u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>>
>>        if (bio_op(bio) != REQ_OP_ZONE_APPEND)
>>                return;
>>

[v7,00/39] btrfs: zoned block device support

Message

Comments