[059/163] tmpfs: per-superblock i_ino support

Message ID	20200807062020.G315bvcEC%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=U9ty=BR=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 425AA22C9F Date: Thu, 06 Aug 2020 23:20:20 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, amir73il@gmail.com, chris@chrisdown.name, hannes@cmpxchg.org, hughd@google.com, jlayton@kernel.org, linux-mm@kvack.org, mm-commits@vger.kernel.org, tj@kernel.org, torvalds@linux-foundation.org, viro@zeniv.linux.org.uk, willy@infradead.org Subject: [patch 059/163] tmpfs: per-superblock i_ino support Message-ID: <20200807062020.G315bvcEC%akpm@linux-foundation.org> In-Reply-To: <20200806231643.a2711a608dd0f18bff2caf2b@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/163] mm/memory.c: avoid access flag update TLB flush for retried page fault \| expand [001/163] mm/memory.c: avoid access flag update TLB flush for retried page fault [002/163] mm/migrate: fix migrate_pgmap_owner w/o CONFIG_MMU_NOTIFIER [003/163] mm/shuffle: don't move pages between zones and don't read garbage memmaps [004/163] mm: fix kthread_use_mm() vs TLB invalidate [005/163] kthread: remove incorrect comment in kthread_create_on_cpu() [006/163] tools/: replace HTTP links with HTTPS ones [007/163] tools/testing/selftests/cgroup/cgroup_util.c: cg_read_strcmp: fix null pointer dereference [008/163] scripts/tags.sh: collect compiled source precisely [009/163] scripts/bloat-o-meter: Support comparing library archives [010/163] scripts/decode_stacktrace.sh: skip missing symbols [011/163] scripts/decode_stacktrace.sh: guess basepath if not specified [012/163] scripts/decode_stacktrace.sh: guess path to modules [013/163] scripts/decode_stacktrace.sh: guess path to vmlinux by release name [014/163] const_structs.checkpatch: add regulator_ops [015/163] scripts/spelling.txt: add more spellings to spelling.txt [016/163] ntfs: fix ntfs_test_inode and ntfs_init_locked_inode function type [017/163] ocfs2: fix remounting needed after setfacl command [018/163] ocfs2: suballoc.h: delete a duplicated word [019/163] ocfs2: change slot number type s16 to u16 [020/163] ocfs2: replace HTTP links with HTTPS ones [021/163] ocfs2: fix unbalanced locking [022/163] mm, treewide: rename kzfree() to kfree_sensitive() [023/163] mm: ksize() should silently accept a NULL pointer [024/163] mm/slab: expand CONFIG_SLAB_FREELIST_HARDENED to include SLAB [025/163] mm/slab: add naive detection of double free [026/163] mm, slab: check GFP_SLAB_BUG_MASK before alloc_pages in kmalloc_order [027/163] mm/slab.c: update outdated kmem_list3 in a comment [028/163] mm, slub: extend slub_debug syntax for multiple blocks [029/163] mm, slub: make some slub_debug related attributes read-only [030/163] mm, slub: remove runtime allocation order changes [031/163] mm, slub: make remaining slub_debug related attributes read-only [032/163] mm, slub: make reclaim_account attribute read-only [033/163] mm, slub: introduce static key for slub_debug() [034/163] mm, slub: introduce kmem_cache_debug_flags() [035/163] mm, slub: extend checks guarded by slub_debug static key [036/163] mm, slab/slub: move and improve cache_from_obj() [037/163] mm, slab/slub: improve error reporting and overhead of cache_from_obj() [038/163] mm/slub.c: drop lockdep_assert_held() from put_map() [039/163] mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS" [040/163] mm/debug_vm_pgtable: add tests validating arch helpers for core MM features [041/163] mm/debug_vm_pgtable: add tests validating advanced arch page table helpers [042/163] mm/debug_vm_pgtable: add debug prints for individual tests [043/163] Documentation/mm: add descriptions for arch page table helpers [044/163] mm/debug: handle page->mapping better in dump_page [045/163] mm/debug: dump compound page information on a second line [046/163] mm/debug: print head flags in dump_page [047/163] mm/debug: switch dump_page to get_kernel_nofault [048/163] mm/debug: print the inode number in dump_page [049/163] mm/debug: print hashed address of struct page [050/163] mm, dump_page: do not crash with bad compound_mapcount() [051/163] mm: filemap: clear idle flag for writes [052/163] mm: filemap: add missing FGP_ flags in kerneldoc comment for pagecache_get_page [053/163] mm/gup.c: fix the comment of return value for populate_vma_page_range() [054/163] mm/swap_slots.c: simplify alloc_swap_slot_cache() [055/163] mm/swap_slots.c: simplify enable_swap_slots_cache() [056/163] mm/swap_slots.c: remove redundant check for swap_slot_cache_initialized [057/163] mm: swap: fix kerneldoc of swap_vma_readahead() [058/163] mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io [059/163] tmpfs: per-superblock i_ino support [060/163] tmpfs: support 64-bit inums per-sb [061/163] mm: kmem: make memcg_kmem_enabled() irreversible [062/163] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() [063/163] mm: memcg: prepare for byte-sized vmstat items [064/163] mm: memcg: convert vmstat slab counters to bytes [065/163] mm: slub: implement SLUB version of obj_to_index() [066/163] mm: memcontrol: decouple reference counting from page accounting [067/163] mm: memcg/slab: obj_cgroup API [068/163] mm: memcg/slab: allocate obj_cgroups for non-root slab pages [069/163] mm: memcg/slab: save obj_cgroup for non-root slab objects [071/163] mm: memcg/slab: deprecate memory.kmem.slabinfo [072/163] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h [073/163] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations [074/163] mm: memcg/slab: simplify memcg cache creation [075/163] mm: memcg/slab: remove memcg_kmem_get_cache() [076/163] mm: memcg/slab: deprecate slab_root_caches [077/163] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() [078/163] mm: memcg/slab: use a single set of kmem_caches for all allocations [081/163] mm: memcontrol: account kernel stack per node [082/163] mm: memcg/slab: remove unused argument by charge_slab_page() [083/163] mm: slab: rename (un)charge_slab_page() to (un)account_slab_page() [084/163] mm: kmem: switch to static_branch_likely() in memcg_kmem_enabled() [085/163] mm: memcontrol: avoid workload stalls when lowering memory.high [086/163] mm, memcg: reclaim more aggressively before high allocator throttling [087/163] mm, memcg: unify reclaim retry limits with page allocator [088/163] mm, memcg: avoid stale protection values when cgroup is above protection [089/163] mm, memcg: decouple e{low,min} state mutations from protection checks [090/163] memcg, oom: check memcg margin for parallel oom [091/163] mm: memcontrol: restore proper dirty throttling when memory.high changes [092/163] mm: memcontrol: don't count limit-setting reclaim as memory pressure [093/163] mm/page_counter.c: fix protection usage propagation [094/163] mm: remove redundant check non_swap_entry() [095/163] mm/memory.c: make remap_pfn_range() reject unaligned addr [096/163] mm: remove unneeded includes of <asm/pgalloc.h> [097/163] opeinrisc: switch to generic version of pte allocation [098/163] xtensa: switch to generic version of pte allocation [099/163] asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one() [100/163] asm-generic: pgalloc: provide generic pud_alloc_one() and pud_free_one() [101/163] asm-generic: pgalloc: provide generic pgd_free() [102/163] mm: move lib/ioremap.c to mm/ [103/163] mm: move p?d_alloc_track to separate header file [104/163] mm/mmap: optimize a branch judgment in ksys_mmap_pgoff() [105/163] proc/meminfo: avoid open coded reading of vm_committed_as [106/163] mm/util.c: make vm_memory_committed() more accurate [107/163] percpu_counter: add percpu_counter_sync() [108/163] mm: adjust vm_committed_as_batch according to vm overcommit policy [109/163] mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages() [110/163] mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() [111/163] arm64/mm: enable vmem_altmap support for vmemmap mappings [112/163] mm: mmap: merge vma after call_mmap() if possible [113/163] mm: remove unnecessary wrapper function do_mmap_pgoff() [114/163] mm/mremap: it is sure to have enough space when extent meets requirement [115/163] mm/mremap: calculate extent in one place [116/163] mm/mremap: start addresses are properly aligned [117/163] selftests: add mincore() tests [118/163] mm/sparse: never partially remove memmap for early section [119/163] mm/sparse: only sub-section aligned range would be populated [120/163] mm/sparse: cleanup the code surrounding memory_present() [121/163] vmalloc: convert to XArray [122/163] mm/vmalloc: simplify merge_or_add_vmap_area() [153/163] mm/page_alloc: fallbacks at most has 3 elements [154/163] mm/page_alloc.c: skip setting nodemask when we are in interrupt [155/163] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs [156/163] mm: thp: replace HTTP links with HTTPS ones [157/163] mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible [158/163] khugepaged: collapse_pte_mapped_thp() flush the right range [159/163] khugepaged: collapse_pte_mapped_thp() protect the pmd lock [160/163] khugepaged: retract_page_tables() remember to test exit [161/163] khugepaged: khugepaged_test_exit() check mmget_still_valid() [162/163] mm/vmscan.c: fix typo [163/163] mm: vmscan: consistent update to pgrefill

Message ID

20200807062020.G315bvcEC%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 425AA22C9F
Date: Thu, 06 Aug 2020 23:20:20 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, amir73il@gmail.com, chris@chrisdown.name,
 hannes@cmpxchg.org, hughd@google.com, jlayton@kernel.org,
 linux-mm@kvack.org, mm-commits@vger.kernel.org, tj@kernel.org,
 torvalds@linux-foundation.org, viro@zeniv.linux.org.uk,
 willy@infradead.org
Subject: [patch 059/163] tmpfs: per-superblock i_ino support
Message-ID: <20200807062020.G315bvcEC%akpm@linux-foundation.org>
In-Reply-To: <20200806231643.a2711a608dd0f18bff2caf2b@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/163] mm/memory.c: avoid access flag update TLB flush for retried page fault | expand

Commit Message

Andrew Morton Aug. 7, 2020, 6:20 a.m. UTC

From: Chris Down <chris@chrisdown.name>
Subject: tmpfs: per-superblock i_ino support

Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.  On
affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even having
as many as 150 duplicated inode numbers with differing file content.

This causes actual, tangible problems in production.  For example, we have
complaints from those working on remote caches that their application is
reporting cache corruptions because it uses (device, inodenum) to
establish the identity of a particular cache object, but because it's not
unique any more, the application refuses to continue and reports cache
corruption.  Even worse, sometimes applications may not even detect the
corruption but may continue anyway, causing phantom and hard to debug
behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.  One
might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
   (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
   same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. Moving inum generation from being global to per-sb for tmpfs. This
   itself allows some reduction in i_ino churn. This works on both 64-
   and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
   64-bit ino_t only: we allow users to mount tmpfs with a new inode64
   option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

You can see how this compares to previous related patches which didn't
implement this per-superblock:

- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/

This patch (of 2):

get_next_ino has a number of problems:

- It uses and returns a uint, which is susceptible to become overflowed
  if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
  means it's not that difficult to cause inode number wraparounds on a
  single device, which can result in having multiple distinct inodes
  with the same inode number.

This patch adds a per-superblock counter that mitigates the second case. 
This design also allows us to later have a specific i_ino size per-device,
for example, allowing users to choose whether to use 32- or 64-bit inodes
for each tmpfs mount.  This is implemented in the next commit.

For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.

Link: http://lkml.kernel.org/r/cover.1594661218.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/fs.h       |   15 ++++++++
 include/linux/shmem_fs.h |    2 +
 mm/shmem.c               |   66 ++++++++++++++++++++++++++++++++++---
 3 files changed, 78 insertions(+), 5 deletions(-)

--- a/include/linux/fs.h~tmpfs-per-superblock-i_ino-support
+++ a/include/linux/fs.h
@@ -2946,6 +2946,21 @@  extern void discard_new_inode(struct ino
 extern unsigned int get_next_ino(void);
 extern void evict_inodes(struct super_block *sb);
 
+/*
+ * Userspace may rely on the the inode number being non-zero. For example, glibc
+ * simply ignores files with zero i_ino in unlink() and other places.
+ *
+ * As an additional complication, if userspace was compiled with
+ * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
+ * lower 32 bits, so we need to check that those aren't zero explicitly. With
+ * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
+ * better safe than sorry.
+ */
+static inline bool is_zero_ino(ino_t ino)
+{
+	return (u32)ino == 0;
+}
+
 extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
--- a/include/linux/shmem_fs.h~tmpfs-per-superblock-i_ino-support
+++ a/include/linux/shmem_fs.h
@@ -36,6 +36,8 @@  struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	ino_t next_ino;		    /* The next per-sb inode number to use */
+	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
--- a/mm/shmem.c~tmpfs-per-superblock-i_ino-support
+++ a/mm/shmem.c
@@ -260,18 +260,67 @@  bool vma_is_shmem(struct vm_area_struct
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
-static int shmem_reserve_inode(struct super_block *sb)
+/*
+ * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and
+ * produces a novel ino for the newly allocated inode.
+ *
+ * It may also be called when making a hard link to permit the space needed by
+ * each dentry. However, in that case, no new inode number is needed since that
+ * internally draws from another pool of inode numbers (currently global
+ * get_next_ino()). This case is indicated by passing NULL as inop.
+ */
+#define SHMEM_INO_BATCH 1024
+static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
-	if (sbinfo->max_inodes) {
+	ino_t ino;
+
+	if (!(sb->s_flags & SB_KERNMOUNT)) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
 			return -ENOSPC;
 		}
 		sbinfo->free_inodes--;
+		if (inop) {
+			ino = sbinfo->next_ino++;
+			if (unlikely(is_zero_ino(ino)))
+				ino = sbinfo->next_ino++;
+			if (unlikely(ino > UINT_MAX)) {
+				/*
+				 * Emulate get_next_ino uint wraparound for
+				 * compatibility
+				 */
+				ino = 1;
+			}
+			*inop = ino;
+		}
 		spin_unlock(&sbinfo->stat_lock);
+	} else if (inop) {
+		/*
+		 * __shmem_file_setup, one of our callers, is lock-free: it
+		 * doesn't hold stat_lock in shmem_reserve_inode since
+		 * max_inodes is always 0, and is called from potentially
+		 * unknown contexts. As such, use a per-cpu batched allocator
+		 * which doesn't require the per-sb stat_lock unless we are at
+		 * the batch boundary.
+		 */
+		ino_t *next_ino;
+		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
+		ino = *next_ino;
+		if (unlikely(ino % SHMEM_INO_BATCH == 0)) {
+			spin_lock(&sbinfo->stat_lock);
+			ino = sbinfo->next_ino;
+			sbinfo->next_ino += SHMEM_INO_BATCH;
+			spin_unlock(&sbinfo->stat_lock);
+			if (unlikely(is_zero_ino(ino)))
+				ino++;
+		}
+		*inop = ino;
+		*next_ino = ++ino;
+		put_cpu();
 	}
+
 	return 0;
 }
 
@@ -2222,13 +2271,14 @@  static struct inode *shmem_get_inode(str
 	struct inode *inode;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+	ino_t ino;
 
-	if (shmem_reserve_inode(sb))
+	if (shmem_reserve_inode(sb, &ino))
 		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		inode->i_ino = ino;
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -2932,7 +2982,7 @@  static int shmem_link(struct dentry *old
 	 * first link must skip that, to get the accounting right.
 	 */
 	if (inode->i_nlink) {
-		ret = shmem_reserve_inode(inode->i_sb);
+		ret = shmem_reserve_inode(inode->i_sb, NULL);
 		if (ret)
 			goto out;
 	}
@@ -3584,6 +3634,7 @@  static void shmem_put_super(struct super
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
+	free_percpu(sbinfo->ino_batch);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
 	kfree(sbinfo);
@@ -3626,6 +3677,11 @@  static int shmem_fill_super(struct super
 #endif
 	sbinfo->max_blocks = ctx->blocks;
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
+	if (sb->s_flags & SB_KERNMOUNT) {
+		sbinfo->ino_batch = alloc_percpu(ino_t);
+		if (!sbinfo->ino_batch)
+			goto failed;
+	}
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
 	sbinfo->mode = ctx->mode;