mbox series

[GIT,PULL] Block updates for 6.9-rc1

Message ID eaeec3b6-75c2-4b65-8c50-2d37450ccdd9@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Block updates for 6.9-rc1 | expand

Pull-request

git://git.kernel.dk/linux.git tags/for-6.9/block-20240310

Message

Jens Axboe March 10, 2024, 8:30 p.m. UTC
Hi Linus,

Here are the core block changes queued for the 6.9-rc1 kernel. This pull
request contains:

- MD pull requests via Song:
	- Cleanup redundant checks, by Yu Kuai.
	- Remove deprecated headers, by Marc Zyngier and Song Liu.
	- Concurrency fixes, by Li Lingfeng.
	- Memory leak fix, by Li Nan.
	- Refactor raid1 read_balance, by Yu Kuai and Paul Luse.
	- Clean up and fix for md_ioctl, by Li Nan.
	- Other small fixes, by Gui-Dong Han and Heming Zhao.
	- MD atomic limits (Christoph)

- NVMe pull request via Keith:
	- RDMA target enhancements (Max)
	- Fabrics fixes (Max, Guixin, Hannes)
	- Atomic queue_limits usage (Christoph)
	- Const use for class_register (Ricardo)
	- Identification error handling fixes (Shin'ichiro, Keith)

- Improvement and cleanup for cached request handling (Christoph)

- Moving towards atomic queue limits. Core changes and driver bits so
  far (Christoph)

- Fix UAF issues in aoeblk (Chun-Yi)

- Zoned fix and cleanups (Damien)

- s390 dasd cleanups and fixes (Jan, Miroslav)

- Block issue timestamp caching (me)

- noio scope guarding for zoned IO (Johannes)

- block/nvme PI improvements (Kanchan)

- Ability to terminate long running discard loop (Keith)

- bdev revalidation fix (Li)

- Get rid of old nr_queues hack for kdump kernels (Ming)

- Support for async deletion of ublk (Ming)

- Improve IRQ bio recycling (Pavel)

- Factor in CPU capacity for remote vs local completion (Qais)

- Add shared_tags configfs entry for null_blk (Shin'ichiro

- Fix for a regression in page refcounts introduced by the folio
  unification (Tony)

- Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo,
  Roman, Tang, Uwe, 

Please pull!


The following changes since commit 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478:

  Linux 6.8-rc3 (2024-02-04 12:20:36 +0000)

are available in the Git repository at:

  git://git.kernel.dk/linux.git tags/for-6.9/block-20240310

for you to fetch changes up to 5205a4aa8fc9454853b705b69611c80e9c644283:

  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC (2024-03-09 07:31:42 -0700)

----------------------------------------------------------------
for-6.9/block-20240310

----------------------------------------------------------------
Arnd Bergmann (2):
      floppy: fix function pointer cast warnings
      drbd: fix function cast warnings in state machine

Chengming Zhou (1):
      bdev: remove SLAB_MEM_SPREAD flag usage

Christoph Hellwig (108):
      blk-mq: move blk_mq_attempt_bio_merge out blk_mq_get_new_requests
      blk-mq: introduce a blk_mq_peek_cached_request helper
      blk-mq: special case cached requests less
      block: move max_{open,active}_zones to struct queue_limits
      block: refactor disk_update_readahead
      block: decouple blk_set_stacking_limits from blk_set_default_limits
      block: add an API to atomically update queue limits
      block: use queue_limits_commit_update in queue_max_sectors_store
      block: add a max_user_discard_sectors queue limit
      block: use queue_limits_commit_update in queue_discard_max_store
      block: pass a queue_limits argument to blk_alloc_queue
      block: pass a queue_limits argument to blk_mq_init_queue
      block: pass a queue_limits argument to blk_mq_alloc_disk
      virtio_blk: split virtblk_probe
      virtio_blk: pass queue_limits to blk_mq_alloc_disk
      loop: cleanup loop_config_discard
      loop: pass queue_limits to blk_mq_alloc_disk
      loop: use the atomic queue limits update API
      block: pass a queue_limits argument to blk_alloc_disk
      nfblock: pass queue_limits to blk_mq_alloc_disk
      brd: pass queue_limits to blk_mq_alloc_disk
      n64cart: pass queue_limits to blk_mq_alloc_disk
      zram: pass queue_limits to blk_mq_alloc_disk
      bcache: pass queue_limits to blk_mq_alloc_disk
      btt: pass queue_limits to blk_mq_alloc_disk
      pmem: pass queue_limits to blk_mq_alloc_disk
      dcssblk: pass queue_limits to blk_mq_alloc_disk
      ubd: pass queue_limits to blk_mq_alloc_disk
      aoe: pass queue_limits to blk_mq_alloc_disk
      floppy: pass queue_limits to blk_mq_alloc_disk
      mtip: pass queue_limits to blk_mq_alloc_disk
      nbd: pass queue_limits to blk_mq_alloc_disk
      ps3disk: pass queue_limits to blk_mq_alloc_disk
      rbd: pass queue_limits to blk_mq_alloc_disk
      rnbd-clt: pass queue_limits to blk_mq_alloc_disk
      sunvdc: pass queue_limits to blk_mq_alloc_disk
      gdrom: pass queue_limits to blk_mq_alloc_disk
      ms_block: pass queue_limits to blk_mq_alloc_disk
      mspro_block: pass queue_limits to blk_mq_alloc_disk
      mtd_blkdevs: pass queue_limits to blk_mq_alloc_disk
      ubiblock: pass queue_limits to blk_mq_alloc_disk
      scm_blk: pass queue_limits to blk_mq_alloc_disk
      ublk: pass queue_limits to blk_mq_alloc_disk
      mmc: pass queue_limits to blk_mq_alloc_disk
      null_blk: remove the bio based I/O path
      null_blk: initialize the tag_set timeout in null_init_tag_set
      null_blk: refactor tag_set setup
      null_blk: remove null_gendisk_register
      null_blk: pass queue_limits to blk_mq_alloc_disk
      block: fix virt_boundary handling in blk_validate_limits
      pktcdvd: stop setting q->queuedata
      pktcdvd: set queue limits at disk allocation time
      xen-blkfront: set max_discard/secure erase limits to UINT_MAX
      xen-blkfront: rely on the default discard granularity
      xen-blkfront: don't redundantly set max_sements in blkif_recover
      xen-blkfront: atomically update queue limits
      ubd: remove the ubd_gendisk array
      ubd: remove ubd_disk_register
      ubd: move setting the nonrot flag to ubd_add
      ubd: move setting the variable queue limits to ubd_add
      ubd: move set_disk_ro to ubd_add
      ubd: remove the queue pointer in struct ubd
      ubd: open the backing files in ubd_add
      block: add a queue_limits_set helper
      block: add a queue_limits_stack_bdev helper
      dm: use queue_limits_set
      pktcdvd: don't set max_hw_sectors on the underlying device
      nbd: don't clear discard_sectors in nbd_config_put
      nbd: freeze the queue for queue limits updates
      nbd: use the atomic queue limits API in nbd_set_size
      nvme: set max_hw_sectors unconditionally
      nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard
      nvme: remove nvme_revalidate_zones
      nvme: move max_integrity_segments handling out of nvme_init_integrity
      nvme: cleanup the nvme_init_integrity calling conventions
      nvme: move blk_integrity_unregister into nvme_init_integrity
      nvme: don't use nvme_update_disk_info for the multipath disk
      nvme: move a few things out of nvme_update_disk_info
      nvme: move setting the write cache flags out of nvme_set_queue_limits
      nvme: move common logic into nvme_update_ns_info
      nvme: split out a nvme_identify_ns_nvm helper
      nvme: don't query identify data in configure_metadata
      nvme: cleanup nvme_configure_metadata
      nvme: use the atomic queue limits update API
      nvme-multipath: pass queue_limits to blk_alloc_disk
      nvme-multipath: use atomic queue limits API for stacking limits
      dasd: cleamup dasd_state_basic_to_ready
      dasd: move queue setup to common code
      dasd: use the atomic queue limits API
      drbd: pass the max_hw_sectors limit to blk_alloc_disk
      drbd: refactor drbd_reconsider_queue_parameters
      drbd: refactor the backing dev max_segments calculation
      drbd: merge drbd_setup_queue_param into drbd_reconsider_queue_parameters
      drbd: don't set max_write_zeroes_sectors in decide_on_discard_support
      drbd: split out a drbd_discard_supported helper
      drbd: atomically update queue limits in drbd_reconsider_queue_parameters
      bcache: move calculation of stripe_size and io_opt into bcache_device_init
      md: add a mddev_trace_remap helper
      md: add a mddev_add_trace_msg helper
      md: add a mddev_is_dm helper
      md: add queue limit helpers
      md/raid0: use the atomic queue limit update APIs
      md/raid1: use the atomic queue limit update APIs
      md/raid5: use the atomic queue limit update APIs
      md/raid10: use the atomic queue limit update APIs
      md: don't initialize queue limits
      md: remove mddev->queue
      block: remove disk_stack_limits

Chun-Yi Lee (1):
      aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts

Colin Ian King (1):
      block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC

Damien Le Moal (3):
      block: Clear zone limits for a non-zoned stacked queue
      block: Do not include rbtree.h in blk-zoned.c
      virtio_blk: Do not use disk_set_max_open/active_zones()

Gui-Dong Han (1):
      md/raid5: fix atomicity violation in raid5_cache_count

Guixin Liu (1):
      nvme-fabrics: check max outstanding commands

Hannes Reinecke (1):
      nvme-fabrics: typo in nvmf_parse_key()

Heming Zhao (1):
      md/md-bitmap: fix incorrect usage for sb_index

Jan Höppner (10):
      s390/dasd: Simplify uid string generation
      s390/dasd: Use sysfs_emit() over sprintf()
      s390/dasd: Remove unnecessary errorstring generation
      s390/dasd: Move allocation error message to DBF
      s390/dasd: Remove unused message logging macros
      s390/dasd: Use dev_err() over printk()
      s390/dasd: Remove %p format specifier from error messages
      s390/dasd: Remove PRINTK_HEADER and KMSG_COMPONENT definitions
      s390/dasd: Use dev_*() for device log messages
      s390/dasd: Improve ERP error messages

Jens Axboe (9):
      block: move cgroup time handling code into blk.h
      block: add blk_time_get_ns() and blk_time_get() helpers
      block: cache current nsec time in struct blk_plug
      block: update cached timestamp post schedule/preemption
      Merge tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
      Merge tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
      Merge tag 'md-6.9-20240305' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
      Merge tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
      Merge tag 'nvme-6.9-2024-03-07' of git://git.infradead.org/nvme into for-6.9/block

Johannes Thumshirn (5):
      zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call
      dm: dm-zoned: guard blkdev_zone_mgmt with noio scope
      btrfs: zoned: call blkdev_zone_mgmt in nofs scope
      f2fs: guard blkdev_zone_mgmt with nofs scope
      block: remove gfp_flags from blkdev_zone_mgmt

John Garry (1):
      null_blk: Delete nullb.{queue_depth, nr_queues}

Kanchan Joshi (3):
      block: refactor guard helpers
      block: support PI at non-zero offset within metadata
      nvme: allow integrity when PI is not in first bytes

Keith Busch (5):
      block: blkdev_issue_secure_erase loop style
      block: cleanup __blkdev_issue_write_zeroes
      block: io wait hang check helper
      blk-lib: check for kill signal
      nvme: clear caller pointer on identify failure

Kunwu Chan (1):
      block: Simplify the allocation of slab caches

Li Lingfeng (3):
      md: get rdev->mddev with READ_ONCE()
      md: use RCU lock to protect traversal in md_spares_need_change()
      block: move capacity validation to blkpg_do_ioctl()

Li Nan (11):
      md: fix kmemleak of rdev->serial
      block: fix deadlock between bd_link_disk_holder and partition scan
      md: merge the check of capabilities into md_ioctl_valid()
      md: changed the switch of RAID_VERSION to if
      md: clean up invalid BUG_ON in md_ioctl
      md: return directly before setting did_set_md_closing
      md: Don't clear MD_CLOSING when the raid is about to stop
      md: factor out a helper to sync mddev
      md: sync blockdev before stopping raid or setting readonly
      md: clean up openers check in do_md_stop() and md_set_readonly()
      md: check mddev->pers before calling md_set_readonly()

Li kunyu (2):
      sed-opal: Remove unnecessary ‘0’ values from ret
      sed-opal: Remove the ret variable from the function

Li zeming (2):
      sed-opal: Remove unnecessary ‘0’ values from error
      sed-opal: Remove unnecessary ‘0’ values from err

Marc Zyngier (1):
      md/linear: Get rid of md-linear.h

Max Gurtovoy (8):
      nvme-rdma: move NVME_RDMA_IP_PORT from common file
      nvmet: compare mqes and sqsize only for IO SQ
      nvmet: set maxcmd to be per controller
      nvmet: set ctrl pi_support cap before initializing cap reg
      nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definition
      nvme-rdma: clamp queue size according to ctrl cap
      nvmet: introduce new max queue size configuration entry
      nvmet-rdma: set max_queue_size for RDMA transport

Ming Lei (3):
      blk-mq: don't change nr_hw_queues and nr_maps for kdump kernel
      ublk: improve getting & putting ublk device
      ublk: add UBLK_CMD_DEL_DEV_ASYNC

Miroslav Franc (1):
      s390/dasd: fix double module refcount decrement

Navid Emamdoost (1):
      nbd: null check for nla_nest_start

Pavel Begunkov (2):
      block: extend bio caching to task context
      block: optimise in irq bio put caching

Qais Yousef (2):
      sched: Add a new function to compare if two cpus have the same capacity
      block/blk-mq: Don't complete locally if capacities are different

Ricardo B. Marliere (5):
      block: rbd: make rbd_bus_type const
      nvme: core: constify struct class usage
      nvme: fabrics: make nvmf_class constant
      nvme: fcloop: make fcloop_class constant
      block: make block_class constant

Roman Smirnov (1):
      block: prevent division by zero in blk_rq_stat_sum()

Shin'ichiro Kawasaki (2):
      null_blk: add configfs variable shared_tags
      nvme: host: fix double-free of struct nvme_id_ns in ns_update_nuse()

Song Liu (4):
      md/multipath: Remove md-multipath.h
      Merge branch 'raid1-read_balance' into md-6.9
      Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""
      Merge branch 'dmraid-fix-6.9' into md-6.9

Tang Yizhou (1):
      blk-throttle: Eliminate redundant checks for data direction

Tony Battersby (1):
      block: Fix page refcounts for unaligned buffers in __bio_release_pages()

Uwe Kleine-König (2):
      cdrom: gdrom: Convert to platform remove callback returning void
      block/swim: Convert to platform remove callback returning void

Yu Kuai (22):
      md: remove redundant check of 'mddev->sync_thread'
      md: remove redundant md_wakeup_thread()
      md: add a new helper rdev_has_badblock()
      md/raid1: factor out helpers to add rdev to conf
      md/raid1: record nonrot rdevs while adding/removing rdevs to conf
      md/raid1: fix choose next idle in read_balance()
      md/raid1-10: add a helper raid1_check_read_range()
      md/raid1-10: factor out a new helper raid1_should_read_first()
      md/raid1: factor out read_first_rdev() from read_balance()
      md/raid1: factor out choose_slow_rdev() from read_balance()
      md/raid1: factor out choose_bb_rdev() from read_balance()
      md/raid1: factor out the code to manage sequential IO
      md/raid1: factor out helpers to choose the best rdev from read_balance()
      md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
      md: export helpers to stop sync_thread
      md: export helper md_is_rdwr()
      md: add a new helper reshape_interrupted()
      dm-raid: really frozen sync_thread during suspend
      md/dm-raid: don't call md_reap_sync_thread() directly
      dm-raid: add a new helper prepare_suspend() in md_personality
      dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape
      dm-raid: fix lockdep waring in "pers->hot_add_disk"

 arch/m68k/emu/nfblock.c                |  10 +-
 arch/um/drivers/ubd_kern.c             | 135 +++-----
 arch/xtensa/platforms/iss/simdisk.c    |   8 +-
 block/bdev.c                           |   2 +-
 block/bfq-cgroup.c                     |  14 +-
 block/bfq-iosched.c                    |  28 +-
 block/bio-integrity.c                  |   1 +
 block/bio.c                            |  45 ++-
 block/blk-cgroup.c                     |   2 +-
 block/blk-cgroup.h                     |   1 +
 block/blk-core.c                       |  33 +-
 block/blk-flush.c                      |   2 +-
 block/blk-integrity.c                  |   1 +
 block/blk-iocost.c                     |   8 +-
 block/blk-iolatency.c                  |   6 +-
 block/blk-lib.c                        |  70 +++-
 block/blk-mq.c                         | 186 +++++-----
 block/blk-settings.c                   | 329 ++++++++++++++----
 block/blk-stat.c                       |   2 +-
 block/blk-sysfs.c                      |  59 ++--
 block/blk-throttle.c                   |  10 +-
 block/blk-wbt.c                        |   6 +-
 block/blk-zoned.c                      |  20 +-
 block/blk.h                            |  84 ++++-
 block/bsg-lib.c                        |   2 +-
 block/genhd.c                          |  14 +-
 block/holder.c                         |  12 +-
 block/ioctl.c                          |   9 +-
 block/partitions/core.c                |  11 -
 block/partitions/mac.c                 |   2 +
 block/sed-opal.c                       |  16 +-
 block/t10-pi.c                         |  72 ++--
 drivers/base/base.h                    |   2 +-
 drivers/block/amiflop.c                |   2 +-
 drivers/block/aoe/aoeblk.c             |  15 +-
 drivers/block/aoe/aoecmd.c             |  12 +-
 drivers/block/aoe/aoenet.c             |   1 +
 drivers/block/ataflop.c                |   2 +-
 drivers/block/brd.c                    |  26 +-
 drivers/block/drbd/drbd_main.c         |  17 +-
 drivers/block/drbd/drbd_nl.c           | 210 ++++++------
 drivers/block/drbd/drbd_state.c        |  24 +-
 drivers/block/drbd/drbd_state_change.h |   8 +-
 drivers/block/floppy.c                 |  17 +-
 drivers/block/loop.c                   |  75 ++--
 drivers/block/mtip32xx/mtip32xx.c      |  13 +-
 drivers/block/n64cart.c                |  12 +-
 drivers/block/nbd.c                    |  49 ++-
 drivers/block/null_blk/main.c          | 535 ++++++++---------------------
 drivers/block/null_blk/null_blk.h      |  24 +-
 drivers/block/null_blk/trace.h         |   5 +-
 drivers/block/null_blk/zoned.c         |  25 +-
 drivers/block/pktcdvd.c                |  41 +--
 drivers/block/ps3disk.c                |  17 +-
 drivers/block/ps3vram.c                |   6 +-
 drivers/block/rbd.c                    |  31 +-
 drivers/block/rnbd/rnbd-clt.c          |  64 ++--
 drivers/block/sunvdc.c                 |  18 +-
 drivers/block/swim.c                   |   8 +-
 drivers/block/swim3.c                  |   2 +-
 drivers/block/ublk_drv.c               | 111 +++---
 drivers/block/virtio_blk.c             | 303 +++++++++--------
 drivers/block/xen-blkfront.c           |  53 +--
 drivers/block/z2ram.c                  |   2 +-
 drivers/block/zram/zram_drv.c          |  51 ++-
 drivers/cdrom/gdrom.c                  |  20 +-
 drivers/md/bcache/super.c              |  59 ++--
 drivers/md/dm-raid.c                   |  93 +++--
 drivers/md/dm-table.c                  |  27 +-
 drivers/md/dm-zoned-metadata.c         |   5 +-
 drivers/md/dm.c                        |   4 +-
 drivers/md/md-bitmap.c                 |  18 +-
 drivers/md/md-linear.h                 |  17 -
 drivers/md/md-multipath.h              |  32 --
 drivers/md/md.c                        | 400 +++++++++++++---------
 drivers/md/md.h                        |  77 ++++-
 drivers/md/raid0.c                     |  42 +--
 drivers/md/raid1-10.c                  |  69 ++++
 drivers/md/raid1.c                     | 601 ++++++++++++++++++++-------------
 drivers/md/raid1.h                     |   1 +
 drivers/md/raid10.c                    | 143 ++++----
 drivers/md/raid5-ppl.c                 |   3 +-
 drivers/md/raid5.c                     | 273 ++++++++-------
 drivers/memstick/core/ms_block.c       |  14 +-
 drivers/memstick/core/mspro_block.c    |  15 +-
 drivers/mmc/core/queue.c               |  97 +++---
 drivers/mtd/mtd_blkdevs.c              |  12 +-
 drivers/mtd/ubi/block.c                |   6 +-
 drivers/nvdimm/btt.c                   |  14 +-
 drivers/nvdimm/pmem.c                  |  14 +-
 drivers/nvme/host/apple.c              |   2 +-
 drivers/nvme/host/core.c               | 458 +++++++++++++------------
 drivers/nvme/host/fabrics.c            |  22 +-
 drivers/nvme/host/multipath.c          |  17 +-
 drivers/nvme/host/nvme.h               |  12 +-
 drivers/nvme/host/rdma.c               |  14 +-
 drivers/nvme/host/sysfs.c              |   7 +-
 drivers/nvme/host/zns.c                |  24 +-
 drivers/nvme/target/admin-cmd.c        |   2 +-
 drivers/nvme/target/configfs.c         |  28 ++
 drivers/nvme/target/core.c             |  18 +-
 drivers/nvme/target/discovery.c        |   2 +-
 drivers/nvme/target/fabrics-cmd.c      |   5 +-
 drivers/nvme/target/fcloop.c           |  17 +-
 drivers/nvme/target/nvmet.h            |   6 +-
 drivers/nvme/target/passthru.c         |   2 +-
 drivers/nvme/target/rdma.c             |  10 +
 drivers/nvme/target/zns.c              |   5 +-
 drivers/s390/block/dasd.c              | 180 +++++-----
 drivers/s390/block/dasd_3990_erp.c     |  80 ++---
 drivers/s390/block/dasd_alias.c        |   8 -
 drivers/s390/block/dasd_devmap.c       |  34 +-
 drivers/s390/block/dasd_diag.c         |  26 +-
 drivers/s390/block/dasd_eckd.c         | 186 ++++------
 drivers/s390/block/dasd_eer.c          |   7 -
 drivers/s390/block/dasd_erp.c          |   9 +-
 drivers/s390/block/dasd_fba.c          |  88 ++---
 drivers/s390/block/dasd_genhd.c        |  18 +-
 drivers/s390/block/dasd_int.h          |  35 +-
 drivers/s390/block/dasd_ioctl.c        |   6 -
 drivers/s390/block/dasd_proc.c         |   5 -
 drivers/s390/block/dcssblk.c           |  10 +-
 drivers/s390/block/scm_blk.c           |  17 +-
 drivers/scsi/scsi_scan.c               |   2 +-
 drivers/ufs/core/ufshcd.c              |   2 +-
 fs/btrfs/zoned.c                       |  35 +-
 fs/f2fs/segment.c                      |  15 +-
 fs/zonefs/super.c                      |   2 +-
 include/linux/blk-integrity.h          |   1 +
 include/linux/blk-mq.h                 |  10 +-
 include/linux/blk_types.h              |  42 ---
 include/linux/blkdev.h                 |  73 +++-
 include/linux/nvme-rdma.h              |   6 +-
 include/linux/nvme.h                   |   2 -
 include/linux/sched.h                  |   2 +-
 include/linux/sched/topology.h         |   6 +
 include/uapi/linux/ublk_cmd.h          |   2 +
 kernel/sched/core.c                    |  17 +-
 138 files changed, 3443 insertions(+), 3171 deletions(-)
 delete mode 100644 drivers/md/md-linear.h
 delete mode 100644 drivers/md/md-multipath.h

Comments

pr-tracker-bot@kernel.org March 11, 2024, 7:43 p.m. UTC | #1
The pull request you sent on Sun, 10 Mar 2024 14:30:57 -0600:

> git://git.kernel.dk/linux.git tags/for-6.9/block-20240310

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1ddeeb2a058d7b2a58ed9e820396b4ceb715d529

Thank you!
Johannes Weiner March 11, 2024, 11:50 p.m. UTC | #2
On Sun, Mar 10, 2024 at 02:30:57PM -0600, Jens Axboe wrote:
> Hi Linus,
> 
> Here are the core block changes queued for the 6.9-rc1 kernel. This pull
> request contains:
> 
> - MD pull requests via Song:
> 	- Cleanup redundant checks, by Yu Kuai.
> 	- Remove deprecated headers, by Marc Zyngier and Song Liu.
> 	- Concurrency fixes, by Li Lingfeng.
> 	- Memory leak fix, by Li Nan.
> 	- Refactor raid1 read_balance, by Yu Kuai and Paul Luse.
> 	- Clean up and fix for md_ioctl, by Li Nan.
> 	- Other small fixes, by Gui-Dong Han and Heming Zhao.
> 	- MD atomic limits (Christoph)

My desktop fails to decrypt /home on boot with this:

[   12.152489] WARNING: CPU: 0 PID: 626 at block/blk-settings.c:192 blk_validate_limits+0x1da/0x1f0
[   12.152493] Modules linked in: amdgpu drm_ttm_helper ttm drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper btusb btintel
[   12.152498] CPU: 0 PID: 626 Comm: systemd-cryptse Not tainted 6.8.0-00855-gd08c407f715f #25 c6b9e287c2730f07982c9e0e4ed9225e8333a29f
[   12.152499] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[   12.152500] RIP: 0010:blk_validate_limits+0x1da/0x1f0
[   12.152502] Code: ff 0f 00 00 0f 87 2d ff ff ff 0f 0b eb 02 0f 0b ba ea ff ff ff e9 7a ff ff ff 0f 0b eb f2 0f 0b eb ee 0f 0b eb ea 0f 0b eb e6 <0f> 0b eb e2 0f 0b eb de 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
[   12.152503] RSP: 0018:ffff9c41065b3b68 EFLAGS: 00010203
[   12.152503] RAX: ffff9c41065b3bc0 RBX: ffff9c41065b3bc0 RCX: 00000000ffffffff
[   12.152504] RDX: 0000000000000fff RSI: 0000000000000200 RDI: 0000000000000100
[   12.152504] RBP: ffff8a11c0d28350 R08: 0000000000000100 R09: 0000000000000001
[   12.152505] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c41065b3bc0
[   12.152505] R13: ffff8a11c0d285c8 R14: ffff9c41065b3bc0 R15: ffff8a122eedc138
[   12.152505] FS:  00007faa969214c0(0000) GS:ffff8a18dde00000(0000) knlGS:0000000000000000
[   12.152506] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.152506] CR2: 00007f11d8a2a910 CR3: 00000001059d0000 CR4: 0000000000350ef0
[   12.152507] Call Trace:
[   12.152508]  <TASK>
[   12.152508]  ? __warn+0x6f/0xd0
[   12.152511]  ? blk_validate_limits+0x1da/0x1f0
[   12.152512]  ? report_bug+0x147/0x190
[   12.152514]  ? handle_bug+0x36/0x70
[   12.152516]  ? exc_invalid_op+0x17/0x60
[   12.152516]  ? asm_exc_invalid_op+0x1a/0x20
[   12.152519]  ? blk_validate_limits+0x1da/0x1f0
[   12.152520]  queue_limits_set+0x27/0x130
[   12.152521]  dm_table_set_restrictions+0x1bb/0x440
[   12.152525]  dm_setup_md_queue+0x9a/0x1e0
[   12.152527]  table_load+0x251/0x400
[   12.152528]  ? dev_suspend+0x2d0/0x2d0
[   12.152529]  ctl_ioctl+0x305/0x5e0
[   12.152531]  dm_ctl_ioctl+0x9/0x10
[   12.152532]  __x64_sys_ioctl+0x89/0xb0
[   12.152534]  do_syscall_64+0x7f/0x160
[   12.152536]  ? syscall_exit_to_user_mode+0x6b/0x1a0
[   12.152537]  ? do_syscall_64+0x8b/0x160
[   12.152538]  ? do_syscall_64+0x8b/0x160
[   12.152538]  ? do_syscall_64+0x8b/0x160
[   12.152539]  ? do_syscall_64+0x8b/0x160
[   12.152540]  ? irq_exit_rcu+0x4a/0xb0
[   12.152541]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
[   12.152542] RIP: 0033:0x7faa9632319b
[   12.152543] Code: 00 48 89 44 24 18 31 c0 c7 04 24 10 00 00 00 48 8d 44 24 60 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[   12.152543] RSP: 002b:00007ffd8ac496d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   12.152544] RAX: ffffffffffffffda RBX: 0000564061a630c0 RCX: 00007faa9632319b
[   12.152544] RDX: 0000564061a630c0 RSI: 00000000c138fd09 RDI: 0000000000000004
[   12.152545] RBP: 00007ffd8ac498d0 R08: 0000000000000007 R09: 0000000000000006
[   12.152545] R10: 0000000000000007 R11: 0000000000000246 R12: 00005640619fcbd0
[   12.152545] R13: 0000000000000003 R14: 0000564061a63170 R15: 00007faa95ea4b2f
[   12.152546]  </TASK>
[   12.152546] ---[ end trace 0000000000000000 ]---
[   12.152547] device-mapper: ioctl: unable to set up device queue for new table.

Reverting 8e0ef4128694 ("dm: use queue_limits_set") makes it work.

Happy to provide more debugging info and/or test patches!
Jens Axboe March 11, 2024, 11:53 p.m. UTC | #3
On 3/11/24 5:50 PM, Johannes Weiner wrote:
> On Sun, Mar 10, 2024 at 02:30:57PM -0600, Jens Axboe wrote:
>> Hi Linus,
>>
>> Here are the core block changes queued for the 6.9-rc1 kernel. This pull
>> request contains:
>>
>> - MD pull requests via Song:
>> 	- Cleanup redundant checks, by Yu Kuai.
>> 	- Remove deprecated headers, by Marc Zyngier and Song Liu.
>> 	- Concurrency fixes, by Li Lingfeng.
>> 	- Memory leak fix, by Li Nan.
>> 	- Refactor raid1 read_balance, by Yu Kuai and Paul Luse.
>> 	- Clean up and fix for md_ioctl, by Li Nan.
>> 	- Other small fixes, by Gui-Dong Han and Heming Zhao.
>> 	- MD atomic limits (Christoph)
> 
> My desktop fails to decrypt /home on boot with this:
> 
> [   12.152489] WARNING: CPU: 0 PID: 626 at block/blk-settings.c:192 blk_validate_limits+0x1da/0x1f0
> [   12.152493] Modules linked in: amdgpu drm_ttm_helper ttm drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper btusb btintel
> [   12.152498] CPU: 0 PID: 626 Comm: systemd-cryptse Not tainted 6.8.0-00855-gd08c407f715f #25 c6b9e287c2730f07982c9e0e4ed9225e8333a29f
> [   12.152499] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
> [   12.152500] RIP: 0010:blk_validate_limits+0x1da/0x1f0
> [   12.152502] Code: ff 0f 00 00 0f 87 2d ff ff ff 0f 0b eb 02 0f 0b ba ea ff ff ff e9 7a ff ff ff 0f 0b eb f2 0f 0b eb ee 0f 0b eb ea 0f 0b eb e6 <0f> 0b eb e2 0f 0b eb de 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
> [   12.152503] RSP: 0018:ffff9c41065b3b68 EFLAGS: 00010203
> [   12.152503] RAX: ffff9c41065b3bc0 RBX: ffff9c41065b3bc0 RCX: 00000000ffffffff
> [   12.152504] RDX: 0000000000000fff RSI: 0000000000000200 RDI: 0000000000000100
> [   12.152504] RBP: ffff8a11c0d28350 R08: 0000000000000100 R09: 0000000000000001
> [   12.152505] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c41065b3bc0
> [   12.152505] R13: ffff8a11c0d285c8 R14: ffff9c41065b3bc0 R15: ffff8a122eedc138
> [   12.152505] FS:  00007faa969214c0(0000) GS:ffff8a18dde00000(0000) knlGS:0000000000000000
> [   12.152506] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.152506] CR2: 00007f11d8a2a910 CR3: 00000001059d0000 CR4: 0000000000350ef0
> [   12.152507] Call Trace:
> [   12.152508]  <TASK>
> [   12.152508]  ? __warn+0x6f/0xd0
> [   12.152511]  ? blk_validate_limits+0x1da/0x1f0
> [   12.152512]  ? report_bug+0x147/0x190
> [   12.152514]  ? handle_bug+0x36/0x70
> [   12.152516]  ? exc_invalid_op+0x17/0x60
> [   12.152516]  ? asm_exc_invalid_op+0x1a/0x20
> [   12.152519]  ? blk_validate_limits+0x1da/0x1f0
> [   12.152520]  queue_limits_set+0x27/0x130
> [   12.152521]  dm_table_set_restrictions+0x1bb/0x440
> [   12.152525]  dm_setup_md_queue+0x9a/0x1e0
> [   12.152527]  table_load+0x251/0x400
> [   12.152528]  ? dev_suspend+0x2d0/0x2d0
> [   12.152529]  ctl_ioctl+0x305/0x5e0
> [   12.152531]  dm_ctl_ioctl+0x9/0x10
> [   12.152532]  __x64_sys_ioctl+0x89/0xb0
> [   12.152534]  do_syscall_64+0x7f/0x160
> [   12.152536]  ? syscall_exit_to_user_mode+0x6b/0x1a0
> [   12.152537]  ? do_syscall_64+0x8b/0x160
> [   12.152538]  ? do_syscall_64+0x8b/0x160
> [   12.152538]  ? do_syscall_64+0x8b/0x160
> [   12.152539]  ? do_syscall_64+0x8b/0x160
> [   12.152540]  ? irq_exit_rcu+0x4a/0xb0
> [   12.152541]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
> [   12.152542] RIP: 0033:0x7faa9632319b
> [   12.152543] Code: 00 48 89 44 24 18 31 c0 c7 04 24 10 00 00 00 48 8d 44 24 60 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> [   12.152543] RSP: 002b:00007ffd8ac496d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [   12.152544] RAX: ffffffffffffffda RBX: 0000564061a630c0 RCX: 00007faa9632319b
> [   12.152544] RDX: 0000564061a630c0 RSI: 00000000c138fd09 RDI: 0000000000000004
> [   12.152545] RBP: 00007ffd8ac498d0 R08: 0000000000000007 R09: 0000000000000006
> [   12.152545] R10: 0000000000000007 R11: 0000000000000246 R12: 00005640619fcbd0
> [   12.152545] R13: 0000000000000003 R14: 0000564061a63170 R15: 00007faa95ea4b2f
> [   12.152546]  </TASK>
> [   12.152546] ---[ end trace 0000000000000000 ]---
> [   12.152547] device-mapper: ioctl: unable to set up device queue for new table.
> 
> Reverting 8e0ef4128694 ("dm: use queue_limits_set") makes it work.

Gah! Sorry about that. Does:

https://lore.kernel.org/linux-block/20240309164140.719752-1-hch@lst.de/

help?
Linus Torvalds March 11, 2024, 11:58 p.m. UTC | #4
On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> My desktop fails to decrypt /home on boot with this:

Yup. Same here. I'm actually in the middle of bisect, just got to the
block pull, and was going to report that the pull was broken.

I don't have a full bisect done yet.

              Linus
Jens Axboe March 12, 2024, 12:02 a.m. UTC | #5
On 3/11/24 5:58 PM, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> My desktop fails to decrypt /home on boot with this:
> 
> Yup. Same here. I'm actually in the middle of bisect, just got to the
> block pull, and was going to report that the pull was broken.
> 
> I don't have a full bisect done yet.

Just revert that commit it for now. Christoph has a pending fix, but it
wasn't reported against this pretty standard use case. Very odd that
we haven't seen that yet.

Sorry about that!
Linus Torvalds March 12, 2024, 12:21 a.m. UTC | #6
On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote:
>
> Just revert that commit it for now.

Done.

I have to say, this is *not* some odd config here. It's literally a
default Fedora setup with encrypted volumes.

So the fact that this got reported after I merged this shows a
complete lack of testing.

It also makes me suspect that you do all your performance-testing on
something that may show great performance, but isn't perhaps the best
thing to actually use.

May I suggest you start looking at encrypted volumes, and do your
performance work on those for a while?

Yes, it will suck to see the crypto overhead, but hey, it's also real
life for real loads, so...

             Linus
Mike Snitzer March 12, 2024, 12:25 a.m. UTC | #7
On Mon, Mar 11 2024 at  8:02P -0400,
Jens Axboe <axboe@kernel.dk> wrote:

> On 3/11/24 5:58 PM, Linus Torvalds wrote:
> > On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>
> >> My desktop fails to decrypt /home on boot with this:
> > 
> > Yup. Same here. I'm actually in the middle of bisect, just got to the
> > block pull, and was going to report that the pull was broken.
> > 
> > I don't have a full bisect done yet.
> 
> Just revert that commit it for now. Christoph has a pending fix, but it
> wasn't reported against this pretty standard use case.

That fix is specific to discards being larger than supported (and FYI,
I did include it in the dm-6.9 pull request).

But Hannes' backtrace points to block/blk-settings.c:192 which is:

                if (WARN_ON_ONCE(lim->max_segment_size &&
                                 lim->max_segment_size != UINT_MAX))
                        return -EINVAL;
                lim->max_segment_size = UINT_MAX;

> Very odd that we haven't seen that yet.

It is odd.  dm-6.9 is based on the block tree for 6.9, which included
8e0ef412869 ("dm: use queue_limits_set").  And I ran the full
cryptsetup testsuite against my for-next branch to validate dm-crypt
and dm-verity working with Tejun's BH workqueue changes.

I agree with reverting commit 8e0ef412869 -- but again hch's fix was
for something else and really can stand on its own even independent of
commit 8e0ef412869

Mike
Mike Snitzer March 12, 2024, 12:28 a.m. UTC | #8
On Mon, Mar 11 2024 at  8:21P -0400,
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote:
> >
> > Just revert that commit it for now.
> 
> Done.
> 
> I have to say, this is *not* some odd config here. It's literally a
> default Fedora setup with encrypted volumes.
> 
> So the fact that this got reported after I merged this shows a
> complete lack of testing.

Please see my other reply just now.  This breakage is new.  Obviously
cryptsetup's testsuite is lacking too because there wasn't any issue
for weeks now.

> It also makes me suspect that you do all your performance-testing on
> something that may show great performance, but isn't perhaps the best
> thing to actually use.
> 
> May I suggest you start looking at encrypted volumes, and do your
> performance work on those for a while?
> 
> Yes, it will suck to see the crypto overhead, but hey, it's also real
> life for real loads, so...

All for Jens being made to suffer with dm-crypt but I think we need a
proper root cause of what is happening for you and Johannes ;)

Mike
Jens Axboe March 12, 2024, 1:01 a.m. UTC | #9
On 3/11/24 6:21 PM, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Just revert that commit it for now.
> 
> Done.

Thanks!

> I have to say, this is *not* some odd config here. It's literally a
> default Fedora setup with encrypted volumes.

Oh I realize that, which is why I'm so puzzled why it was broken. It's
probably THE most common laptop setup.

> So the fact that this got reported after I merged this shows a
> complete lack of testing.

Mike reviewed AND tested the whole thing, so you are wrong. I see he's
also responded with that. Why we had this fallout is as-of yet not
known, but we'll certainly figure it out.

> It also makes me suspect that you do all your performance-testing on
> something that may show great performance, but isn't perhaps the best
> thing to actually use.

I do that on things that I use, and what's being used in production.
This includes obvious the block core and bits that use it, and on the
storage front mostly nvme these days. I tested dm scalability and
performance with Mike some months ago, and md is a regular thing too. In
fact some of the little tweaks in this current series benefit the distro
configurations quite a bit, which is obviously what customers/users tend
to run. It's all being worked up through the stack.

crypt is fine and all for laptop usage, but I haven't otherwise seen it
used.

> May I suggest you start looking at encrypted volumes, and do your
> performance work on those for a while?
> 
> Yes, it will suck to see the crypto overhead, but hey, it's also real
> life for real loads, so...

Honestly, my knee jerk reaction was "pfft I don't think so" as it's not
an interesting use case to me. I'd be very surprised if it wasn't all
lower hanging DM related fruits here, and maybe it's things like a
single decrypt/encrypt pipeline. Maybe that's out of necessity, maybe
it's an implementation detail that could get fixed.

That said, it certainly would be interesting to look at. But also
probably something that require rewriting it from scratch, probably as a
dm-crypt-v2 or something. Maybe? Pure handwaving.

What would make me do that is if I had to use it myself. Without that
motivation, there's not a lot of time leftover to throw at an area where
I suspect performance is perhaps Good Enough that people don't complain
about it, particularly because the use case is what it is.
Jens Axboe March 12, 2024, 1:03 a.m. UTC | #10
On 3/11/24 6:28 PM, Mike Snitzer wrote:
> On Mon, Mar 11 2024 at  8:21P -0400,
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> Just revert that commit it for now.
>>
>> Done.
>>
>> I have to say, this is *not* some odd config here. It's literally a
>> default Fedora setup with encrypted volumes.
>>
>> So the fact that this got reported after I merged this shows a
>> complete lack of testing.
> 
> Please see my other reply just now.  This breakage is new.  Obviously
> cryptsetup's testsuite is lacking too because there wasn't any issue
> for weeks now.

Yep, agree on that, if it breaks maybe the first few booting it with
dm-crypt, then the testing certainly should've caught it.

>> It also makes me suspect that you do all your performance-testing on
>> something that may show great performance, but isn't perhaps the best
>> thing to actually use.
>>
>> May I suggest you start looking at encrypted volumes, and do your
>> performance work on those for a while?
>>
>> Yes, it will suck to see the crypto overhead, but hey, it's also real
>> life for real loads, so...
> 
> All for Jens being made to suffer with dm-crypt but I think we need a
> proper root cause of what is happening for you and Johannes ;)

Hah, yes. Does current -git without that revert boot for you? I'm
assuming you have a dm-crypt setup on your laptop :-)
Christoph Hellwig March 12, 2024, 1:09 a.m. UTC | #11
On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote:
> All for Jens being made to suffer with dm-crypt but I think we need a
> proper root cause of what is happening for you and Johannes ;)

I'm going to try to stay out of the cranking, but I think the reason is
that the limits stacking inherits the max_segment_size, nvme has weird
rules for them due their odd PRPs, and dm-crypt set it's own
max_segment_size to split out each page.  The regression here is
that we now actually verify that conflict.

So this happens only for dm-crypt on nvme.  The fix is probably
to not inherit low-level limits like max_segment_size, but I need
to think about it a bit more and come up with an automated test case
using say nvme-loop.

So for now the revert is the right thing.
Jens Axboe March 12, 2024, 1:17 a.m. UTC | #12
On 3/11/24 7:09 PM, Christoph Hellwig wrote:
> On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote:
>> All for Jens being made to suffer with dm-crypt but I think we need a
>> proper root cause of what is happening for you and Johannes ;)
> 
> I'm going to try to stay out of the cranking, but I think the reason is
> that the limits stacking inherits the max_segment_size, nvme has weird
> rules for them due their odd PRPs, and dm-crypt set it's own
> max_segment_size to split out each page.  The regression here is
> that we now actually verify that conflict.
> 
> So this happens only for dm-crypt on nvme.  The fix is probably
> to not inherit low-level limits like max_segment_size, but I need
> to think about it a bit more and come up with an automated test case
> using say nvme-loop.

That does seem like the most plausible explanation, I'm just puzzled why
nobody hit it before it landed in Linus's tree. I know linux-next isn't
THAT well runtime tested, but still. That aside, obviously the usual
test cases should've hit it. Unless that was all on non-nvme storage,
which is of course possible.

> So for now the revert is the right thing.

Yup
Linus Torvalds March 12, 2024, 1:20 a.m. UTC | #13
On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote:
>
> That does seem like the most plausible explanation, I'm just puzzled why
> nobody hit it before it landed in Linus's tree.

Yeah, who _doesn't_ have nvme drives in their system today?

What odd hardware are people running?

             Linus
Jens Axboe March 12, 2024, 1:23 a.m. UTC | #14
On 3/11/24 7:20 PM, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> That does seem like the most plausible explanation, I'm just puzzled why
>> nobody hit it before it landed in Linus's tree.
> 
> Yeah, who _doesn't_ have nvme drives in their system today?
> 
> What odd hardware are people running?

Maybe older SATA based flash? But I haven't seen any of those in years.
Or, god forbid, rotational storage?

Various NVMe devices do have different limits for things like max
transfer size etc, so if it's related to that, then it is possible that
nvme was used but just didn't trigger on that test case. Out of
curiosity, on your box where it broken, what does:

grep . /sys/block/nvme0n1/queue/*

say?
Linus Torvalds March 12, 2024, 1:28 a.m. UTC | #15
On Mon, 11 Mar 2024 at 18:23, Jens Axboe <axboe@kernel.dk> wrote:
>
> > What odd hardware are people running?
>
> Maybe older SATA based flash? But I haven't seen any of those in years.
> Or, god forbid, rotational storage?

Christ. I haven't touched rotating rust in like twenty years by now.

I feel dirty just thinking about it.

> Out of curiosity, on your box where it broken, what does:
>
> grep . /sys/block/nvme0n1/queue/*
>
> say?

Appended.

FWIW, it's a 4TB Samsung 990 PRO (and not in a laptop, this is my Threadripper).

                     Linus

---
/sys/block/nvme0n1/queue/add_random:0
/sys/block/nvme0n1/queue/chunk_sectors:0
/sys/block/nvme0n1/queue/dax:0
/sys/block/nvme0n1/queue/discard_granularity:512
/sys/block/nvme0n1/queue/discard_max_bytes:2199023255040
/sys/block/nvme0n1/queue/discard_max_hw_bytes:2199023255040
/sys/block/nvme0n1/queue/discard_zeroes_data:0
/sys/block/nvme0n1/queue/dma_alignment:3
/sys/block/nvme0n1/queue/fua:1
/sys/block/nvme0n1/queue/hw_sector_size:512
/sys/block/nvme0n1/queue/io_poll:0
/sys/block/nvme0n1/queue/io_poll_delay:-1
/sys/block/nvme0n1/queue/iostats:1
/sys/block/nvme0n1/queue/io_timeout:30000
/sys/block/nvme0n1/queue/logical_block_size:512
/sys/block/nvme0n1/queue/max_discard_segments:256
/sys/block/nvme0n1/queue/max_hw_sectors_kb:128
/sys/block/nvme0n1/queue/max_integrity_segments:1
/sys/block/nvme0n1/queue/max_sectors_kb:128
/sys/block/nvme0n1/queue/max_segments:33
/sys/block/nvme0n1/queue/max_segment_size:4294967295
/sys/block/nvme0n1/queue/minimum_io_size:512
/sys/block/nvme0n1/queue/nomerges:0
/sys/block/nvme0n1/queue/nr_requests:1023
/sys/block/nvme0n1/queue/nr_zones:0
/sys/block/nvme0n1/queue/optimal_io_size:0
/sys/block/nvme0n1/queue/physical_block_size:512
/sys/block/nvme0n1/queue/read_ahead_kb:128
/sys/block/nvme0n1/queue/rotational:0
/sys/block/nvme0n1/queue/rq_affinity:1
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline kyber bfq
/sys/block/nvme0n1/queue/stable_writes:0
/sys/block/nvme0n1/queue/virt_boundary_mask:4095
/sys/block/nvme0n1/queue/wbt_lat_usec:2000
/sys/block/nvme0n1/queue/write_cache:write back
/sys/block/nvme0n1/queue/write_same_max_bytes:0
/sys/block/nvme0n1/queue/write_zeroes_max_bytes:0
/sys/block/nvme0n1/queue/zone_append_max_bytes:0
/sys/block/nvme0n1/queue/zoned:none
/sys/block/nvme0n1/queue/zone_write_granularity:0
Jens Axboe March 12, 2024, 1:37 a.m. UTC | #16
On 3/11/24 7:28 PM, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 18:23, Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> What odd hardware are people running?
>>
>> Maybe older SATA based flash? But I haven't seen any of those in years.
>> Or, god forbid, rotational storage?
> 
> Christ. I haven't touched rotating rust in like twenty years by now.
> 
> I feel dirty just thinking about it.
> 
>> Out of curiosity, on your box where it broken, what does:
>>
>> grep . /sys/block/nvme0n1/queue/*
>>
>> say?
> 
> Appended.
> 
> FWIW, it's a 4TB Samsung 990 PRO (and not in a laptop, this is my
> Threadripper).

Summary is that this is obviously a pretty normal drive, and has the
128K transfer limit that's common there. So doesn't really explain
anything in that regard. The segment size is also a bit odd at 33. The
only samsung I have here is a 980 pro, which has a normal 512K limit and
128 segments.

Oh well, we'll figure out what the hell went wrong, side channels are
ongoing.
Christoph Hellwig March 12, 2024, 11:52 a.m. UTC | #17
On Mon, Mar 11, 2024 at 06:20:01PM -0700, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote:
> >
> > That does seem like the most plausible explanation, I'm just puzzled why
> > nobody hit it before it landed in Linus's tree.
> 
> Yeah, who _doesn't_ have nvme drives in their system today?
> 
> What odd hardware are people running?

Whatever shows up in the test VMs, or what is set up by the automated
tests.
Christoph Hellwig March 12, 2024, 11:53 a.m. UTC | #18
On Mon, Mar 11, 2024 at 07:23:41PM -0600, Jens Axboe wrote:
> Various NVMe devices do have different limits for things like max
> transfer size etc, so if it's related to that, then it is possible that
> nvme was used but just didn't trigger on that test case. Out of
> curiosity, on your box where it broken, what does:

All nvme-pci setups with dm-crypt would trigger this.
Mike Snitzer March 12, 2024, 3:22 p.m. UTC | #19
On Mon, Mar 11 2024 at  9:09P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote:
> > All for Jens being made to suffer with dm-crypt but I think we need a
> > proper root cause of what is happening for you and Johannes ;)
> 
> I'm going to try to stay out of the cranking, but I think the reason is
> that the limits stacking inherits the max_segment_size, nvme has weird
> rules for them due their odd PRPs, and dm-crypt set it's own
> max_segment_size to split out each page.  The regression here is
> that we now actually verify that conflict.
> 
> So this happens only for dm-crypt on nvme.  The fix is probably
> to not inherit low-level limits like max_segment_size, but I need
> to think about it a bit more and come up with an automated test case
> using say nvme-loop.

Yeah, I generally agree.

I looked at the latest code to more fully understand why this failed.

1) dm-crypt.c:crypt_io_hints() sets limits->max_segment_size = PAGE_SIZE;

2) drivers/nvme/host/core.c:nvme_set_ctrl_limits() sets:
   lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
   lim->max_segment_size = UINT_MAX;

3) blk_stack_limits(t=dm-crypt, b=nvme-pci) will combine limits:
        t->virt_boundary_mask = min_not_zero(t->virt_boundary_mask,
                                            b->virt_boundary_mask);
        t->max_segment_size = min_not_zero(t->max_segment_size,
                                           b->max_segment_size);

4) blk_validate_limits() will reject the limits that
   blk_stack_limits() created:
        /*
         * Devices that require a virtual boundary do not support scatter/gather
         * I/O natively, but instead require a descriptor list entry for each
         * page (which might not be identical to the Linux PAGE_SIZE).  Because
         * of that they are not limited by our notion of "segment size".
         */
	if (lim->virt_boundary_mask) {
                if (WARN_ON_ONCE(lim->max_segment_size &&
                                 lim->max_segment_size != UINT_MAX))
                        return -EINVAL;
                lim->max_segment_size = UINT_MAX;
	} else {
                /*
                 * The maximum segment size has an odd historic 64k default that
                 * drivers probably should override.  Just like the I/O size we
                 * require drivers to at least handle a full page per segment.
                 */
		if (!lim->max_segment_size)
                        lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
                if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
                	return -EINVAL;
        }

blk_validate_limits() is currently very pedantic. I discussed with Jens
briefly and we're thinking it might make sense for blk_validate_limits()
to be more forgiving by _not_ imposing hard -EINVAL failure.  That in
the interim, during this transition to more curated and atomic limits, a
WARN_ON_ONCE() splat should serve as enough notice to developers (be it
lower level nvme or higher-level virtual devices like DM).

BUT for this specific max_segment_size case, the constraints of dm-crypt
are actually more conservative due to crypto requirements. Yet nvme's
more general "don't care, but will care if non-nvme driver does" for
this particular max_segment_size limit is being imposed when validating
the combined limits that dm-crypt will impose at the top-level.

All said, the above "if (lim->virt_boundary_mask)" check in
blk_validate_limits() looks bogus for stacked device limits.

Mike
Jens Axboe March 12, 2024, 3:25 p.m. UTC | #20
On 3/12/24 5:53 AM, Christoph Hellwig wrote:
> On Mon, Mar 11, 2024 at 07:23:41PM -0600, Jens Axboe wrote:
>> Various NVMe devices do have different limits for things like max
>> transfer size etc, so if it's related to that, then it is possible that
>> nvme was used but just didn't trigger on that test case. Out of
>> curiosity, on your box where it broken, what does:
> 
> All nvme-pci setups with dm-crypt would trigger this.

This is most likely the key, basically all test suites are run in a vm
these days, and not on raw nvme devices...
Keith Busch March 12, 2024, 4:28 p.m. UTC | #21
On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote:
> 4) blk_validate_limits() will reject the limits that
>    blk_stack_limits() created:
>         /*
>          * Devices that require a virtual boundary do not support scatter/gather
>          * I/O natively, but instead require a descriptor list entry for each
>          * page (which might not be identical to the Linux PAGE_SIZE).  Because
>          * of that they are not limited by our notion of "segment size".
>          */
> 	if (lim->virt_boundary_mask) {
>                 if (WARN_ON_ONCE(lim->max_segment_size &&
>                                  lim->max_segment_size != UINT_MAX))
>                         return -EINVAL;
>                 lim->max_segment_size = UINT_MAX;
> 	} else {
>                 /*
>                  * The maximum segment size has an odd historic 64k default that
>                  * drivers probably should override.  Just like the I/O size we
>                  * require drivers to at least handle a full page per segment.
>                  */
> 		if (!lim->max_segment_size)
>                         lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
>                 if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
>                 	return -EINVAL;
>         }
> 
> blk_validate_limits() is currently very pedantic. I discussed with Jens
> briefly and we're thinking it might make sense for blk_validate_limits()
> to be more forgiving by _not_ imposing hard -EINVAL failure.  That in
> the interim, during this transition to more curated and atomic limits, a
> WARN_ON_ONCE() splat should serve as enough notice to developers (be it
> lower level nvme or higher-level virtual devices like DM).
> 
> BUT for this specific max_segment_size case, the constraints of dm-crypt
> are actually more conservative due to crypto requirements. Yet nvme's
> more general "don't care, but will care if non-nvme driver does" for
> this particular max_segment_size limit is being imposed when validating
> the combined limits that dm-crypt will impose at the top-level.
> 
> All said, the above "if (lim->virt_boundary_mask)" check in
> blk_validate_limits() looks bogus for stacked device limits.

Yes, I think you're right. I can't tell why this check makes sense for
any device, not just stacked ones. It could verify lim->max_segment_size
is >= virt_boundary_mask, but to require it be UINT_MAX doesn't look
necessary.
Keith Busch March 12, 2024, 4:39 p.m. UTC | #22
On Mon, Mar 11, 2024 at 07:37:06PM -0600, Jens Axboe wrote:
> Summary is that this is obviously a pretty normal drive, and has the
> 128K transfer limit that's common there. So doesn't really explain
> anything in that regard. The segment size is also a bit odd at 33.

That's "max_segments" at 33, not segment size. Max payload is 128k,
divide by 4k nvme page size = 32 nvme pages. +1 to allow a first segment
offset, so 33 max segments for this device.
Christoph Hellwig March 12, 2024, 9:10 p.m. UTC | #23
On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote:
> blk_validate_limits() is currently very pedantic. I discussed with Jens
> briefly and we're thinking it might make sense for blk_validate_limits()
> to be more forgiving by _not_ imposing hard -EINVAL failure.  That in
> the interim, during this transition to more curated and atomic limits, a
> WARN_ON_ONCE() splat should serve as enough notice to developers (be it
> lower level nvme or higher-level virtual devices like DM).

I guess.  And it more closely matches the status quo.  That being said
I want to move to hard rejection eventually to catch all the issues.

> BUT for this specific max_segment_size case, the constraints of dm-crypt
> are actually more conservative due to crypto requirements.

Honestly, to me the dm-crypt requirement actually doesn't make much
sense: max_segment_size is for hardware drivers that have requirements
for SGLs or equivalent hardware interfaces.  If dm-crypt never wants to
see more than a single page per bio_vec it should just always iterate
them using bio_for_each_segment.

> Yet nvme's
> more general "don't care, but will care if non-nvme driver does" for
> this particular max_segment_size limit is being imposed when validating
> the combined limits that dm-crypt will impose at the top-level.

The real problem is that we combine the limits while we shouldn't.
Every since we've supported immutable biovecs and do the splitting
down in blk-mq there is no point to even inherit such limits in the
upper drivers.
Mike Snitzer March 12, 2024, 10:22 p.m. UTC | #24
On Tue, Mar 12 2024 at  5:10P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote:
> > blk_validate_limits() is currently very pedantic. I discussed with Jens
> > briefly and we're thinking it might make sense for blk_validate_limits()
> > to be more forgiving by _not_ imposing hard -EINVAL failure.  That in
> > the interim, during this transition to more curated and atomic limits, a
> > WARN_ON_ONCE() splat should serve as enough notice to developers (be it
> > lower level nvme or higher-level virtual devices like DM).
> 
> I guess.  And it more closely matches the status quo.  That being said
> I want to move to hard rejection eventually to catch all the issues.
> 
> > BUT for this specific max_segment_size case, the constraints of dm-crypt
> > are actually more conservative due to crypto requirements.
> 
> Honestly, to me the dm-crypt requirement actually doesn't make much
> sense: max_segment_size is for hardware drivers that have requirements
> for SGLs or equivalent hardware interfaces.  If dm-crypt never wants to
> see more than a single page per bio_vec it should just always iterate
> them using bio_for_each_segment.
> 
> > Yet nvme's
> > more general "don't care, but will care if non-nvme driver does" for
> > this particular max_segment_size limit is being imposed when validating
> > the combined limits that dm-crypt will impose at the top-level.
> 
> The real problem is that we combine the limits while we shouldn't.
> Every since we've supported immutable biovecs and do the splitting
> down in blk-mq there is no point to even inherit such limits in the
> upper drivers.

immutable biovecs, late splitting and blk-mq aren't a factor.

dm-crypt has to contend with the crypto subsystem and HW crypto
engines that have their own constraints.
Christoph Hellwig March 12, 2024, 10:30 p.m. UTC | #25
On Tue, Mar 12, 2024 at 06:22:21PM -0400, Mike Snitzer wrote:
> > The real problem is that we combine the limits while we shouldn't.
> > Every since we've supported immutable biovecs and do the splitting
> > down in blk-mq there is no point to even inherit such limits in the
> > upper drivers.
> 
> immutable biovecs, late splitting and blk-mq aren't a factor.
> 
> dm-crypt has to contend with the crypto subsystem and HW crypto
> engines that have their own constraints.

Yes, they are.  The limit for underlying device does not matter for
an upper devіce as it will split later.  And that's not just my
opinion, you also clearly stated that in the commit adding the
limits (586b286b110e94e).  We should have stopped inheriting all
these limits only relevant for splitting when we switched to
immutable bvecs.  I don't know why we didn't, but a big part of
that might be that we never made clear which limits these are.
Mike Snitzer March 12, 2024, 10:50 p.m. UTC | #26
On Tue, Mar 12 2024 at  6:30P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Mar 12, 2024 at 06:22:21PM -0400, Mike Snitzer wrote:
> > > The real problem is that we combine the limits while we shouldn't.
> > > Every since we've supported immutable biovecs and do the splitting
> > > down in blk-mq there is no point to even inherit such limits in the
> > > upper drivers.
> > 
> > immutable biovecs, late splitting and blk-mq aren't a factor.
> > 
> > dm-crypt has to contend with the crypto subsystem and HW crypto
> > engines that have their own constraints.
> 
> Yes, they are.  The limit for underlying device does not matter for
> an upper devіce as it will split later.  And that's not just my
> opinion, you also clearly stated that in the commit adding the
> limits (586b286b110e94e).  We should have stopped inheriting all
> these limits only relevant for splitting when we switched to
> immutable bvecs.  I don't know why we didn't, but a big part of
> that might be that we never made clear which limits these are.

Wow, using my 8+ year old commit message against me ;)

I've honestly paged most of this out but I'll revisit, likely with
Mikulas, to pin this down better and then see what possible.
Christoph Hellwig March 12, 2024, 10:58 p.m. UTC | #27
On Tue, Mar 12, 2024 at 06:50:51PM -0400, Mike Snitzer wrote:
> Wow, using my 8+ year old commit message against me ;)

Or for you :)

> I've honestly paged most of this out but I'll revisit, likely with
> Mikulas, to pin this down better and then see what possible.

FYI, I don't think this is really a dm issue, but one of block
infrastructure.  But looping in the dm and md maintainers as well
as Martin who wrote the stacking code originally is definitively
a good idea.
Ming Lei March 13, 2024, 1:11 p.m. UTC | #28
On Tue, Mar 12, 2024 at 02:10:13PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote:
> > blk_validate_limits() is currently very pedantic. I discussed with Jens
> > briefly and we're thinking it might make sense for blk_validate_limits()
> > to be more forgiving by _not_ imposing hard -EINVAL failure.  That in
> > the interim, during this transition to more curated and atomic limits, a
> > WARN_ON_ONCE() splat should serve as enough notice to developers (be it
> > lower level nvme or higher-level virtual devices like DM).
> 
> I guess.  And it more closely matches the status quo.  That being said
> I want to move to hard rejection eventually to catch all the issues.
> 
> > BUT for this specific max_segment_size case, the constraints of dm-crypt
> > are actually more conservative due to crypto requirements.
> 
> Honestly, to me the dm-crypt requirement actually doesn't make much
> sense: max_segment_size is for hardware drivers that have requirements
> for SGLs or equivalent hardware interfaces.  If dm-crypt never wants to
> see more than a single page per bio_vec it should just always iterate
> them using bio_for_each_segment.
> 
> > Yet nvme's
> > more general "don't care, but will care if non-nvme driver does" for
> > this particular max_segment_size limit is being imposed when validating
> > the combined limits that dm-crypt will impose at the top-level.
> 
> The real problem is that we combine the limits while we shouldn't.
> Every since we've supported immutable biovecs and do the splitting
> down in blk-mq there is no point to even inherit such limits in the
> upper drivers.

In theory, it is yes, DM even doesn't use the combined segment size
& virt boundary, but MD uses that(maybe unnecessarily), however
the two are often stacked.

There may be corner cases, and removing the two limits combination can
be one big change for DM/MD since this way has been used long time.

The warning & failure in blk_validate_limits() can fail any MD/DM
which is over scsi & nvme, so I'd suggest to remove the 'warning &
-EINVAL' first, otherwise more complaints may follow.


Thanks,
Ming
Mike Snitzer April 11, 2024, 8:15 p.m. UTC | #29
Hi,

I'd like to get extra review and testing for these changes given how
DM's use of queue_limits_set broke Linus's dm-crypt on NVMe setup
during the 6.9 merge window.

These changes have been staged in linux-next via linux-dm.git and
while they should apply cleanly on 6.9-rcX they have been applied
ontop of dm-6.10, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.10

Thanks,
Mike

Christoph Hellwig (1):
  dm: use queue_limits_set

Mike Snitzer (1):
  dm-crypt: stop constraining max_segment_size to PAGE_SIZE

 drivers/md/dm-crypt.c | 12 ++----------
 drivers/md/dm-table.c | 27 ++++++++++++---------------
 2 files changed, 14 insertions(+), 25 deletions(-)