Message ID | 20230309230545.2930737-7-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | tmpfs: add the option to disable swap | expand |
On Thu, 9 Mar 2023, Luis Chamberlain wrote: > In doing experimentations with shmem having the option to avoid swap > becomes a useful mechanism. One of the *raves* about brd over shmem is > you can avoid swap, but that's not really a good reason to use brd if > we can instead use shmem. Using brd has its own good reasons to exist, > but just because "tmpfs" doesn't let you do that is not a great reason > to avoid it if we can easily add support for it. > > I don't add support for reconfiguring incompatible options, but if > we really wanted to we can add support for that. > > To avoid swap we use mapping_set_unevictable() upon inode creation, > and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. I have one big question here, which betrays my ignorance: I hope that you or Christian can reassure me on this. tmpfs has fs_flags FS_USERNS_MOUNT. I know nothing about namespaces, nothing; but from overhearings, wonder if an ordinary user in a namespace might be able to mount their own tmpfs with "noswap", and thereby evade all accounting of the locked memory. That would be an absolute no-no for this patch; but I assume that even if so, it can be easily remedied by inserting an appropriate (unknown to me!) privilege check where the "noswap" option is validated. I did idly wonder what happens with "noswap" when CONFIG_SWAP is not enabled, or no swap is enabled; but I think it would be a waste of time and code to worry over doing anything different from whatever behaviour falls out trivially. You'll be sending a manpage update to Alejandro in due course, I think. Thanks, Hugh > > Acked-by: Christian Brauner <brauner@kernel.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > Documentation/filesystems/tmpfs.rst | 9 ++++++--- > Documentation/mm/unevictable-lru.rst | 2 ++ > include/linux/shmem_fs.h | 1 + > mm/shmem.c | 28 +++++++++++++++++++++++++++- > 4 files changed, 36 insertions(+), 4 deletions(-) > > diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst > index 1ec9a9f8196b..f18f46be5c0c 100644 > --- a/Documentation/filesystems/tmpfs.rst > +++ b/Documentation/filesystems/tmpfs.rst > @@ -13,7 +13,8 @@ everything stored therein is lost. > > tmpfs puts everything into the kernel internal caches and grows and > shrinks to accommodate the files it contains and is able to swap > -unneeded pages out to swap space, and supports THP. > +unneeded pages out to swap space, if swap was enabled for the tmpfs > +mount. tmpfs also supports THP. > > tmpfs extends ramfs with a few userspace configurable options listed and > explained further below, some of which can be reconfigured dynamically on the > @@ -33,8 +34,8 @@ configured in size at initialization and you cannot dynamically resize them. > Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the > block layer at all. > > -Since tmpfs lives completely in the page cache and on swap, all tmpfs > -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in > +Since tmpfs lives completely in the page cache and optionally on swap, > +all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in > free(1). Notice that these counters also include shared memory > (shmem, see ipcs(1)). The most reliable way to get the count is > using df(1) and du(1). > @@ -83,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default > is half of the number of your physical RAM pages, or (on a > machine with highmem) the number of lowmem RAM pages, > whichever is the lower. > +noswap Disables swap. Remounts must respect the original settings. > + By default swap is enabled. > ========= ============================================================ > > These parameters accept a suffix k, m or g for kilo, mega and giga and > diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst > index 92ac5dca420c..d5ac8511eb67 100644 > --- a/Documentation/mm/unevictable-lru.rst > +++ b/Documentation/mm/unevictable-lru.rst > @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages: > > * Those owned by ramfs. > > + * Those owned by tmpfs with the noswap mount option. > + > * Those mapped into SHM_LOCK'd shared memory regions. > > * Those mapped into VM_LOCKED [mlock()ed] VMAs. > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index 103d1000a5a2..50bf82b36995 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -45,6 +45,7 @@ struct shmem_sb_info { > kuid_t uid; /* Mount uid for root directory */ > kgid_t gid; /* Mount gid for root directory */ > bool full_inums; /* If i_ino should be uint or ino_t */ > + bool noswap; /* ignores VM reclaim / swap requests */ > ino_t next_ino; /* The next per-sb inode number to use */ > ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ > struct mempolicy *mpol; /* default memory policy for mappings */ > diff --git a/mm/shmem.c b/mm/shmem.c > index dfd995da77b4..2e122c72b375 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -119,10 +119,12 @@ struct shmem_options { > bool full_inums; > int huge; > int seen; > + bool noswap; > #define SHMEM_SEEN_BLOCKS 1 > #define SHMEM_SEEN_INODES 2 > #define SHMEM_SEEN_HUGE 4 > #define SHMEM_SEEN_INUMS 8 > +#define SHMEM_SEEN_NOSWAP 16 > }; > > #ifdef CONFIG_TMPFS > @@ -1337,6 +1339,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) > struct address_space *mapping = folio->mapping; > struct inode *inode = mapping->host; > struct shmem_inode_info *info = SHMEM_I(inode); > + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); > swp_entry_t swap; > pgoff_t index; > > @@ -1350,7 +1353,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) > if (WARN_ON_ONCE(!wbc->for_reclaim)) > goto redirty; > > - if (WARN_ON_ONCE(info->flags & VM_LOCKED)) > + if (WARN_ON_ONCE((info->flags & VM_LOCKED) || sbinfo->noswap)) > goto redirty; > > if (!total_swap_pages) > @@ -2487,6 +2490,8 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block > shmem_set_inode_flags(inode, info->fsflags); > INIT_LIST_HEAD(&info->shrinklist); > INIT_LIST_HEAD(&info->swaplist); > + if (sbinfo->noswap) > + mapping_set_unevictable(inode->i_mapping); > simple_xattrs_init(&info->xattrs); > cache_no_acl(inode); > mapping_set_large_folios(inode->i_mapping); > @@ -3574,6 +3579,7 @@ enum shmem_param { > Opt_uid, > Opt_inode32, > Opt_inode64, > + Opt_noswap, > }; > > static const struct constant_table shmem_param_enums_huge[] = { > @@ -3595,6 +3601,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = { > fsparam_u32 ("uid", Opt_uid), > fsparam_flag ("inode32", Opt_inode32), > fsparam_flag ("inode64", Opt_inode64), > + fsparam_flag ("noswap", Opt_noswap), > {} > }; > > @@ -3678,6 +3685,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) > ctx->full_inums = true; > ctx->seen |= SHMEM_SEEN_INUMS; > break; > + case Opt_noswap: > + ctx->noswap = true; > + ctx->seen |= SHMEM_SEEN_NOSWAP; > + break; > } > return 0; > > @@ -3776,6 +3787,14 @@ static int shmem_reconfigure(struct fs_context *fc) > err = "Current inum too high to switch to 32-bit inums"; > goto out; > } > + if ((ctx->seen & SHMEM_SEEN_NOSWAP) && ctx->noswap && !sbinfo->noswap) { > + err = "Cannot disable swap on remount"; > + goto out; > + } > + if (!(ctx->seen & SHMEM_SEEN_NOSWAP) && !ctx->noswap && sbinfo->noswap) { > + err = "Cannot enable swap on remount if it was disabled on first mount"; > + goto out; > + } > > if (ctx->seen & SHMEM_SEEN_HUGE) > sbinfo->huge = ctx->huge; > @@ -3796,6 +3815,10 @@ static int shmem_reconfigure(struct fs_context *fc) > sbinfo->mpol = ctx->mpol; /* transfers initial ref */ > ctx->mpol = NULL; > } > + > + if (ctx->noswap) > + sbinfo->noswap = true; > + > raw_spin_unlock(&sbinfo->stat_lock); > mpol_put(mpol); > return 0; > @@ -3850,6 +3873,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) > seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); > #endif > shmem_show_mpol(seq, sbinfo->mpol); > + if (sbinfo->noswap) > + seq_printf(seq, ",noswap"); > return 0; > } > > @@ -3893,6 +3918,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) > ctx->inodes = shmem_default_max_inodes(); > if (!(ctx->seen & SHMEM_SEEN_INUMS)) > ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); > + sbinfo->noswap = ctx->noswap; > } else { > sb->s_flags |= SB_NOUSER; > } > -- > 2.39.1
On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote: > On Thu, 9 Mar 2023, Luis Chamberlain wrote: > > > In doing experimentations with shmem having the option to avoid swap > > becomes a useful mechanism. One of the *raves* about brd over shmem is > > you can avoid swap, but that's not really a good reason to use brd if > > we can instead use shmem. Using brd has its own good reasons to exist, > > but just because "tmpfs" doesn't let you do that is not a great reason > > to avoid it if we can easily add support for it. > > > > I don't add support for reconfiguring incompatible options, but if > > we really wanted to we can add support for that. > > > > To avoid swap we use mapping_set_unevictable() upon inode creation, > > and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. > > I have one big question here, which betrays my ignorance: > I hope that you or Christian can reassure me on this. > > tmpfs has fs_flags FS_USERNS_MOUNT. I know nothing about namespaces, > nothing; but from overhearings, wonder if an ordinary user in a namespace > might be able to mount their own tmpfs with "noswap", and thereby evade > all accounting of the locked memory. > > That would be an absolute no-no for this patch; but I assume that even > if so, it can be easily remedied by inserting an appropriate (unknown > to me!) privilege check where the "noswap" option is validated. Oh, good catch. Thanks! So you would just need sm like: diff --git a/mm/shmem.c b/mm/shmem.c index 787e83791eb5..21ce9b26bb4d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3571,6 +3571,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) ctx->seen |= SHMEM_SEEN_INUMS; break; case Opt_noswap: + if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) { + return invalfc(fc, + "Turning off swap in unprivileged tmpfs mounts unsupported"); + } ctx->noswap = true; ctx->seen |= SHMEM_SEEN_NOSWAP; break; The fc->user_ns is the userns that the tmpfs mount will be mounted in, i.e., fc->user_ns will become sb->s_user_ns if FS_USERNS_MOUNT is raised. So with the check above we require that the tmpfs instance must ultimately belong to the initial userns and that the caller has CAP_SYS_ADMIN in the initial userns (CAP_SYS_ADMIN guards swapon and swapoff) according to capabilities(7).
On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote:
> You'll be sending a manpage update to Alejandro in due course, I think.
Sure thing! Just need a git tree. I can send the updates as we reach
a consensus on where to store / share huge page shmem updates.
Luis
On 4/18/23 14:22, Luis Chamberlain wrote: > On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote: >> You'll be sending a manpage update to Alejandro in due course, I think. > > Sure thing! Just need a git tree. I can send the updates as we reach > a consensus on where to store / share huge page shmem updates. > > Luis From the latest man-page announcement: man-pages-6.04 - manual pages for GNU/Linux The release tarball is already available at <kernel.org>. Tarball download: <https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/> Git repository: <https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>
On Tue, Apr 18, 2023 at 09:38:10AM +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 10:50:59PM -0700, Hugh Dickins wrote: > > On Thu, 9 Mar 2023, Luis Chamberlain wrote: > > > > > In doing experimentations with shmem having the option to avoid swap > > > becomes a useful mechanism. One of the *raves* about brd over shmem is > > > you can avoid swap, but that's not really a good reason to use brd if > > > we can instead use shmem. Using brd has its own good reasons to exist, > > > but just because "tmpfs" doesn't let you do that is not a great reason > > > to avoid it if we can easily add support for it. > > > > > > I don't add support for reconfiguring incompatible options, but if > > > we really wanted to we can add support for that. > > > > > > To avoid swap we use mapping_set_unevictable() upon inode creation, > > > and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. > > > > I have one big question here, which betrays my ignorance: > > I hope that you or Christian can reassure me on this. > > > > tmpfs has fs_flags FS_USERNS_MOUNT. I know nothing about namespaces, > > nothing; but from overhearings, wonder if an ordinary user in a namespace > > might be able to mount their own tmpfs with "noswap", and thereby evade > > all accounting of the locked memory. > > > > That would be an absolute no-no for this patch; but I assume that even > > if so, it can be easily remedied by inserting an appropriate (unknown > > to me!) privilege check where the "noswap" option is validated. > > Oh, good catch. Thanks! So you would just need sm like: > > diff --git a/mm/shmem.c b/mm/shmem.c > index 787e83791eb5..21ce9b26bb4d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3571,6 +3571,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) > ctx->seen |= SHMEM_SEEN_INUMS; > break; > case Opt_noswap: > + if ((fc->user_ns != &init_user_ns) || !capable(CAP_SYS_ADMIN)) { > + return invalfc(fc, > + "Turning off swap in unprivileged tmpfs mounts unsupported"); > + } > ctx->noswap = true; > ctx->seen |= SHMEM_SEEN_NOSWAP; > break; > > The fc->user_ns is the userns that the tmpfs mount will be mounted in, i.e., > fc->user_ns will become sb->s_user_ns if FS_USERNS_MOUNT is raised. So with the > check above we require that the tmpfs instance must ultimately belong to the > initial userns and that the caller has CAP_SYS_ADMIN in the initial userns > (CAP_SYS_ADMIN guards swapon and swapoff) according to capabilities(7). Christian, mind sending this as a fix? Luis
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 1ec9a9f8196b..f18f46be5c0c 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -13,7 +13,8 @@ everything stored therein is lost. tmpfs puts everything into the kernel internal caches and grows and shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space, and supports THP. +unneeded pages out to swap space, if swap was enabled for the tmpfs +mount. tmpfs also supports THP. tmpfs extends ramfs with a few userspace configurable options listed and explained further below, some of which can be reconfigured dynamically on the @@ -33,8 +34,8 @@ configured in size at initialization and you cannot dynamically resize them. Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the block layer at all. -Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +Since tmpfs lives completely in the page cache and optionally on swap, +all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in free(1). Notice that these counters also include shared memory (shmem, see ipcs(1)). The most reliable way to get the count is using df(1) and du(1). @@ -83,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a machine with highmem) the number of lowmem RAM pages, whichever is the lower. +noswap Disables swap. Remounts must respect the original settings. + By default swap is enabled. ========= ============================================================ These parameters accept a suffix k, m or g for kilo, mega and giga and diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 92ac5dca420c..d5ac8511eb67 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages: * Those owned by ramfs. + * Those owned by tmpfs with the noswap mount option. + * Those mapped into SHM_LOCK'd shared memory regions. * Those mapped into VM_LOCKED [mlock()ed] VMAs. diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 103d1000a5a2..50bf82b36995 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -45,6 +45,7 @@ struct shmem_sb_info { kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ bool full_inums; /* If i_ino should be uint or ino_t */ + bool noswap; /* ignores VM reclaim / swap requests */ ino_t next_ino; /* The next per-sb inode number to use */ ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ diff --git a/mm/shmem.c b/mm/shmem.c index dfd995da77b4..2e122c72b375 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -119,10 +119,12 @@ struct shmem_options { bool full_inums; int huge; int seen; + bool noswap; #define SHMEM_SEEN_BLOCKS 1 #define SHMEM_SEEN_INODES 2 #define SHMEM_SEEN_HUGE 4 #define SHMEM_SEEN_INUMS 8 +#define SHMEM_SEEN_NOSWAP 16 }; #ifdef CONFIG_TMPFS @@ -1337,6 +1339,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) struct address_space *mapping = folio->mapping; struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); swp_entry_t swap; pgoff_t index; @@ -1350,7 +1353,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) if (WARN_ON_ONCE(!wbc->for_reclaim)) goto redirty; - if (WARN_ON_ONCE(info->flags & VM_LOCKED)) + if (WARN_ON_ONCE((info->flags & VM_LOCKED) || sbinfo->noswap)) goto redirty; if (!total_swap_pages) @@ -2487,6 +2490,8 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block shmem_set_inode_flags(inode, info->fsflags); INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); + if (sbinfo->noswap) + mapping_set_unevictable(inode->i_mapping); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -3574,6 +3579,7 @@ enum shmem_param { Opt_uid, Opt_inode32, Opt_inode64, + Opt_noswap, }; static const struct constant_table shmem_param_enums_huge[] = { @@ -3595,6 +3601,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = { fsparam_u32 ("uid", Opt_uid), fsparam_flag ("inode32", Opt_inode32), fsparam_flag ("inode64", Opt_inode64), + fsparam_flag ("noswap", Opt_noswap), {} }; @@ -3678,6 +3685,10 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) ctx->full_inums = true; ctx->seen |= SHMEM_SEEN_INUMS; break; + case Opt_noswap: + ctx->noswap = true; + ctx->seen |= SHMEM_SEEN_NOSWAP; + break; } return 0; @@ -3776,6 +3787,14 @@ static int shmem_reconfigure(struct fs_context *fc) err = "Current inum too high to switch to 32-bit inums"; goto out; } + if ((ctx->seen & SHMEM_SEEN_NOSWAP) && ctx->noswap && !sbinfo->noswap) { + err = "Cannot disable swap on remount"; + goto out; + } + if (!(ctx->seen & SHMEM_SEEN_NOSWAP) && !ctx->noswap && sbinfo->noswap) { + err = "Cannot enable swap on remount if it was disabled on first mount"; + goto out; + } if (ctx->seen & SHMEM_SEEN_HUGE) sbinfo->huge = ctx->huge; @@ -3796,6 +3815,10 @@ static int shmem_reconfigure(struct fs_context *fc) sbinfo->mpol = ctx->mpol; /* transfers initial ref */ ctx->mpol = NULL; } + + if (ctx->noswap) + sbinfo->noswap = true; + raw_spin_unlock(&sbinfo->stat_lock); mpol_put(mpol); return 0; @@ -3850,6 +3873,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); #endif shmem_show_mpol(seq, sbinfo->mpol); + if (sbinfo->noswap) + seq_printf(seq, ",noswap"); return 0; } @@ -3893,6 +3918,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) ctx->inodes = shmem_default_max_inodes(); if (!(ctx->seen & SHMEM_SEEN_INUMS)) ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); + sbinfo->noswap = ctx->noswap; } else { sb->s_flags |= SB_NOUSER; }