mbox series

[GIT,PULL,for,v6.7] vfs misc updates

Message ID 20231027-vfs-misc-7ebef2b5a462@brauner (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL,for,v6.7] vfs misc updates | expand

Pull-request

git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7.misc

Message

Christian Brauner Oct. 27, 2023, 2:30 p.m. UTC
Hey Linus,

/* Summary */
This contains the usual miscellaneous features, cleanups, and fixes for
vfs and individual fses.

Features
========

* Rename and export helpers that get write access to a mount. They are
  used in overlayfs to get write access to the upper mount.
* Print the pretty name of the root device on boot failure. This helps
  in scenarios where we would usually only print "unknown-block(1,2)".
* Add an internal SB_I_NOUMASK flag. This is another part in the endless
  POSIX ACL saga in a way.

  When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the
  umask because if the relevant inode has POSIX ACLs set it might take
  the umask from there. But if the inode doesn't have any POSIX ACLs
  set then we apply the umask in the filesytem itself. So we end up with:

  (1) no SB_POSIXACL -> strip umask in vfs
  (2) SB_POSIXACL    -> strip umask in filesystem

  The umask semantics associated with SB_POSIXACL allowed filesystems
  that don't even support POSIX ACLs at all to raise SB_POSIXACL purely
  to avoid umask stripping. That specifically means NFS v4 and
  Overlayfs. NFS v4 does it because it delegates this to the server and
  Overlayfs because it needs to delegate umask stripping to the upper
  filesystem, i.e., the filesystem used as the writable layer.

  This went so far that SB_POSIXACL is raised eve on kernels that don't
  even have POSIX ACL support at all.

  Stop this blatant abuse and add SB_I_NOUMASK which is an internal
  superblock flag that filesystems can raise to opt out of umask
  handling. That should really only be the two mentioned above. It's not
  that we want any filesystems to do this. Ideally we have all umask
  handling always in the vfs.
* Make overlayfs use SB_I_NOUMASK too.
* Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
  IS_POSIXACL() if the kernel doesn't have support for it. This is a
  very old patch but it's only possible to do this now with the wider
  cleanup that was done.
* Follow-up work on fake path handling from last cycle. Citing mostly
  from Amir:

  When overlayfs was first merged, overlayfs files of regular files and
  directories, the ones that are installed in file table, had a "fake"
  path, namely, f_path is the overlayfs path and f_inode is the "real"
  inode on the underlying filesystem.

  In v6.5, we took another small step by introducing of the backing_file
  container and the file_real_path() helper.  This change allowed vfs
  and filesystem code to get the "real" path of an overlayfs backing
  file. With this change, we were able to make fsnotify work correctly
  and report events on the "real" filesystem objects that were accessed
  via overlayfs.

  This method works fine, but it still leaves the vfs vulnerable to new
  code that is not aware of files with fake path.  A recent example is
  commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the
  i_version"). This commit uses direct referencing to f_path in IMA code
  that otherwise uses file_inode() and file_dentry() to reference the
  filesystem objects that it is measuring.

  This pull request contains work to switch things around: instead of
  having filesystem code opt-in to get the "real" path, have generic
  code opt-in for the "fake" path in the few places that it is needed.

  Is it far more likely that new filesystems code that does not use the
  file_dentry() and file_real_path() helpers will end up causing crashes
  or averting LSM/audit rules if we keep the "fake" path exposed by
  default.

  This change already makes file_dentry() moot, but for now we did not
  change this helper just added a WARN_ON() in ovl_d_real() to catch if
  we have made any wrong assumptions.

  After the dust settles on this change, we can make file_dentry() a
  plain accessor and we can drop the inode argument to ->d_real().
* Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks
  like a small change but it really isn't and I would like to see
  everyone on their tippie toes for any possible bugs from this work.

  Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
  files since a very long time because of the nasty interactions between
  the SCM_RIGHTS file descriptor garbage collection. So extending it
  makes a lot of sense but it is a subtle change. There are almost no
  places that fiddle with file rcu semantics directly and the ones that
  did mess around with struct file internal under rcu have been made to
  stop doing that because it really was always dodgy.

  I forgot to put in the link tag for this change and the
  discussion in the commit so adding it into the merge message:
  https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com

Cleanups
========

* Various smaller pipe cleanups including the removal of a spin lock
  that was only used to protect against writes without pipe_lock() from
  O_NOTIFICATION_PIPE aka watch queues. As that was never implemented
  remove the additional locking from pipe_write().
* Annotate struct watch_filter with the new __counted_by attribute.
* Clarify do_unlinkat() cleanup so that it doesn't look like an extra
  iput() is done that would cause issues.
* Simplify file cleanup when the file has never been opened.
* Use module helper instead of open-coding it.
* Predict error unlikely for stale retry.
* Use WRITE_ONCE() for mount expiry field instead of just commenting
  that one hopes the compiler doesn't get smart.

Fixes
=====

* Fix readahead on block devices.
* Fix writeback when layztime is enabled and inodes whose timestamp is
  the only thing that changed reside on wb->b_dirty_time. This caused
  excessively large zombie memory cgroup when lazytime was enabled as
  such inodes weren't handled fast enough.
* Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups().

/* Testing */
clang: Debian clang version 16.0.6 (16)
gcc: gcc (Debian 13.2.0-5) 13.2.0

All patches are based on v6.6-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.

/* Conflicts */

## Merge Conflicts with other trees

[1] linux-next: manual merge of the integrity tree with the vfs-brauner tree
    https://lore.kernel.org/r/20231027131137.3051da98@canb.auug.org.au

At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit 0bb80ecc33a8fb5a682236443c1e740d5c917d1d:

  Linux 6.6-rc1 (2023-09-10 16:28:41 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7.misc

for you to fetch changes up to 61d4fb0b349ec1b33119913c3b0bd109de30142c:

  file, i915: fix file reference for mmap_singleton() (2023-10-25 22:17:04 +0200)

Please consider pulling these changes from the signed vfs-6.7.misc tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.7.misc

----------------------------------------------------------------
Amir Goldstein (5):
      fs: rename __mnt_{want,drop}_write*() helpers
      fs: export mnt_{get,put}_write_access() to modules
      fs: get mnt_writers count for an open backing file's real path
      fs: create helper file_user_path() for user displayed mapped file path
      fs: store real path instead of fake path in backing file f_path

Bernd Schubert (1):
      vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups

Christian Brauner (5):
      file: convert to SLAB_TYPESAFE_BY_RCU
      io_uring: use files_lookup_fd_locked()
      backing file: free directly
      ovl: rely on SB_I_NOUMASK
      file, i915: fix file reference for mmap_singleton()

Jeff Layton (1):
      fs: add a new SB_I_NOUMASK flag

Jianyong Wu (1):
      init/mount: print pretty name of root device when panics

Jingbo Xu (1):
      writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs

Kees Cook (1):
      watch_queue: Annotate struct watch_filter with __counted_by

Luís Henriques (1):
      fs: simplify misleading code to remove ambiguity regarding ihold()/iput()

Mateusz Guzik (3):
      vfs: shave work on failed file open
      vfs: predict the error in retry_estale as unlikely
      vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked

Max Kellermann (5):
      pipe: reduce padding in struct pipe_inode_info
      fs/pipe: move check to pipe_has_watch_queue()
      fs/pipe: remove unnecessary spinlock from pipe_write()
      fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
      fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n

Reuben Hawkins (1):
      vfs: fix readahead(2) on block devices

Uwe Kleine-König (1):
      chardev: Simplify usage of try_module_get()

 Documentation/filesystems/files.rst          |  53 +++++-----
 arch/arc/kernel/troubleshoot.c               |   6 +-
 arch/powerpc/platforms/cell/spufs/coredump.c |  11 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c     |   6 +-
 fs/char_dev.c                                |   2 +-
 fs/file.c                                    | 153 +++++++++++++++++++++++----
 fs/file_table.c                              |  49 +++++----
 fs/fs-writeback.c                            |  41 ++++---
 fs/gfs2/glock.c                              |  11 +-
 fs/init.c                                    |   6 +-
 fs/inode.c                                   |   8 +-
 fs/internal.h                                |  22 ++--
 fs/namei.c                                   |  31 ++----
 fs/namespace.c                               |  40 +++----
 fs/nfs/super.c                               |   2 +-
 fs/notify/dnotify/dnotify.c                  |   6 +-
 fs/open.c                                    |  52 ++++++---
 fs/overlayfs/super.c                         |  24 ++++-
 fs/pipe.c                                    |  64 ++++++-----
 fs/proc/base.c                               |   2 +-
 fs/proc/fd.c                                 |  11 +-
 fs/proc/nommu.c                              |   2 +-
 fs/proc/task_mmu.c                           |   4 +-
 fs/proc/task_nommu.c                         |   2 +-
 include/linux/fdtable.h                      |  17 +--
 include/linux/fs.h                           |  35 +++---
 include/linux/fsnotify.h                     |   3 +-
 include/linux/mount.h                        |   4 +-
 include/linux/namei.h                        |  26 ++++-
 include/linux/pipe_fs_i.h                    |  22 +++-
 include/linux/watch_queue.h                  |   2 +-
 init/do_mounts.c                             |   2 +-
 io_uring/openclose.c                         |   9 +-
 kernel/acct.c                                |   4 +-
 kernel/bpf/task_iter.c                       |   4 +-
 kernel/fork.c                                |   4 +-
 kernel/kcmp.c                                |   4 +-
 kernel/trace/trace_output.c                  |   2 +-
 mm/readahead.c                               |   3 +-
 39 files changed, 479 insertions(+), 270 deletions(-)

Comments

pr-tracker-bot@kernel.org Oct. 30, 2023, 8:05 p.m. UTC | #1
The pull request you sent on Fri, 27 Oct 2023 16:30:46 +0200:

> https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/3b3f874cc1d074bdcffc224d683925fd11808fe7

Thank you!