shmem: add support for user extended attributes

Message ID	20230720065430.2178136-1-ovt@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 20 Jul 2023 06:54:27 +0000 Mime-Version: 1.0 Message-ID: <20230720065430.2178136-1-ovt@google.com> Subject: [PATCH] shmem: add support for user extended attributes From: Oleksandr Tymoshenko <ovt@google.com> To: Jonathan Corbet <corbet@lwn.net>, Hugh Dickins <hughd@google.com>, Andrew Morton <akpm@linux-foundation.org> Cc: ovt@google.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	shmem: add support for user extended attributes \| expand shmem: add support for user extended attributes

Oleksandr Tymoshenko July 20, 2023, 6:54 a.m. UTC

User extended attributes are not enabled in tmpfs because
the size of the value is not limited and the memory allocated
for it is not counted against any limit. Malicious
non-privileged user can exhaust kernel memory by creating
user.* extended attribute with very large value.

There are still situations when enabling suport for extended
user attributes on tmpfs is required and the attack vector
is not applicable, for instance batch jobs with trusted binaries.

This patch introduces two mount options to enable/disable
support for user.* extended attributes on tmpfs:

user_xattr    enable support for user extended aatributes
nouser_xattr  disable support for user extended attributes

The default behavior of the filesystem is not changed.

Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>
---
 Documentation/filesystems/tmpfs.rst | 12 ++++++++
 include/linux/shmem_fs.h            |  1 +
 mm/shmem.c                          | 45 +++++++++++++++++++++++++++++
 3 files changed, 58 insertions(+)

Hugh Dickins July 20, 2023, 4:56 p.m. UTC | #1

On Thu, 20 Jul 2023, Oleksandr Tymoshenko wrote:

> User extended attributes are not enabled in tmpfs because
> the size of the value is not limited and the memory allocated
> for it is not counted against any limit. Malicious
> non-privileged user can exhaust kernel memory by creating
> user.* extended attribute with very large value.
> 
> There are still situations when enabling suport for extended
> user attributes on tmpfs is required and the attack vector
> is not applicable, for instance batch jobs with trusted binaries.
> 
> This patch introduces two mount options to enable/disable
> support for user.* extended attributes on tmpfs:
> 
> user_xattr    enable support for user extended aatributes
> nouser_xattr  disable support for user extended attributes
> 
> The default behavior of the filesystem is not changed.
> 
> Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>

Thanks, but no.

This is not something we want mount options for:
we just want to limit the memory usage of tmpfs user xattrs.

I've had the patch to do that limiting (taking it out of the inode
space already limited by nr_inodes) in my test tree for 2.5 years now:
waiting to reach the top of the heap to pull together and submit.

Your sending this patch does help to raise the priority for my
sending that patch: thank you; but I cannot promise when that will be.

(And the way mm/shmem.c is conflicted between vfs and mm trees
is rather discouraging development there at the moment: I'm hoping
it can be mostly wrested back into the mm tree in the next cycle.)

Hugh

> ---
>  Documentation/filesystems/tmpfs.rst | 12 ++++++++
>  include/linux/shmem_fs.h            |  1 +
>  mm/shmem.c                          | 45 +++++++++++++++++++++++++++++
>  3 files changed, 58 insertions(+)
> 
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index f18f46be5c0c..5700ba72d095 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -215,6 +215,16 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB
>  RAM/SWAP in 10240 inodes and it is only accessible by root.
>  
>  
> +tmpfs, when compiled with CONFIG_TMPFS_XATTR, does not support
> +Extended User Attributes for security reasons. The support can be
> +enabled/disabled by two mount options:
> +
> +============  ===========================================
> +user_xattr    Enable support for Extended User Attributes
> +nouser_xattr  Disable upport for Extended User Attributes
> +============  ===========================================
> +
> +
>  :Author:
>     Christoph Rohland <cr@sap.com>, 1.12.01
>  :Updated:
> @@ -223,3 +233,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
>     KOSAKI Motohiro, 16 Mar 2010
>  :Updated:
>     Chris Down, 13 July 2020
> +:Updated:
> +   Oleksandr Tymoshenko, 19 July 2023
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9029abd29b1c..f06d18b9041c 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -53,6 +53,7 @@ struct shmem_sb_info {
>  	spinlock_t shrinklist_lock;   /* Protects shrinklist */
>  	struct list_head shrinklist;  /* List of shinkable inodes */
>  	unsigned long shrinklist_len; /* Length of shrinklist */
> +	bool user_xattr;	      /* user.* xattrs are allowed */
>  };
>  
>  static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2f2e0e618072..4f7d46d65494 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -85,6 +85,7 @@ static struct vfsmount *shm_mnt;
>  
>  #define BLOCKS_PER_PAGE  (PAGE_SIZE/512)
>  #define VM_ACCT(size)    (PAGE_ALIGN(size) >> PAGE_SHIFT)
> +#define TMPFS_USER_XATTR_INDEX 1
>  
>  /* Pretend that each entry is of this size in directory's i_size */
>  #define BOGO_DIRENT_SIZE 20
> @@ -116,11 +117,13 @@ struct shmem_options {
>  	int huge;
>  	int seen;
>  	bool noswap;
> +	bool user_xattr;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
>  #define SHMEM_SEEN_INUMS 8
>  #define SHMEM_SEEN_NOSWAP 16
> +#define SHMEM_SEEN_USER_XATTR 32
>  };
>  
>  #ifdef CONFIG_TMPFS
> @@ -3447,6 +3450,16 @@ static int shmem_xattr_handler_get(const struct xattr_handler *handler,
>  				   const char *name, void *buffer, size_t size)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> +	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +
> +	switch (handler->flags) {
> +	case TMPFS_USER_XATTR_INDEX:
> +		if (!sbinfo->user_xattr)
> +			return -EOPNOTSUPP;
> +		break;
> +	default:
> +		break;
> +	}
>  
>  	name = xattr_full_name(handler, name);
>  	return simple_xattr_get(&info->xattrs, name, buffer, size);
> @@ -3459,8 +3472,18 @@ static int shmem_xattr_handler_set(const struct xattr_handler *handler,
>  				   size_t size, int flags)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> +	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  	int err;
>  
> +	switch (handler->flags) {
> +	case TMPFS_USER_XATTR_INDEX:
> +		if (!sbinfo->user_xattr)
> +			return -EOPNOTSUPP;
> +		break;
> +	default:
> +		break;
> +	}
> +
>  	name = xattr_full_name(handler, name);
>  	err = simple_xattr_set(&info->xattrs, name, value, size, flags, NULL);
>  	if (!err) {
> @@ -3482,9 +3505,17 @@ static const struct xattr_handler shmem_trusted_xattr_handler = {
>  	.set = shmem_xattr_handler_set,
>  };
>  
> +static const struct xattr_handler shmem_user_xattr_handler = {
> +	.prefix = XATTR_USER_PREFIX,
> +	.flags = TMPFS_USER_XATTR_INDEX,
> +	.get = shmem_xattr_handler_get,
> +	.set = shmem_xattr_handler_set,
> +};
> +
>  static const struct xattr_handler *shmem_xattr_handlers[] = {
>  	&shmem_security_xattr_handler,
>  	&shmem_trusted_xattr_handler,
> +	&shmem_user_xattr_handler,
>  	NULL
>  };
>  
> @@ -3604,6 +3635,8 @@ enum shmem_param {
>  	Opt_inode32,
>  	Opt_inode64,
>  	Opt_noswap,
> +	Opt_user_xattr,
> +	Opt_nouser_xattr,
>  };
>  
>  static const struct constant_table shmem_param_enums_huge[] = {
> @@ -3626,6 +3659,8 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
>  	fsparam_flag  ("inode32",	Opt_inode32),
>  	fsparam_flag  ("inode64",	Opt_inode64),
>  	fsparam_flag  ("noswap",	Opt_noswap),
> +	fsparam_flag  ("user_xattr",	Opt_user_xattr),
> +	fsparam_flag  ("nouser_xattr",	Opt_nouser_xattr),
>  	{}
>  };
>  
> @@ -3717,6 +3752,14 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>  		ctx->noswap = true;
>  		ctx->seen |= SHMEM_SEEN_NOSWAP;
>  		break;
> +	case Opt_user_xattr:
> +		ctx->user_xattr = true;
> +		ctx->seen |= SHMEM_SEEN_USER_XATTR;
> +		break;
> +	case Opt_nouser_xattr:
> +		ctx->user_xattr = false;
> +		ctx->seen |= SHMEM_SEEN_USER_XATTR;
> +		break;
>  	}
>  	return 0;
>  
> @@ -3834,6 +3877,8 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		sbinfo->max_inodes  = ctx->inodes;
>  		sbinfo->free_inodes = ctx->inodes - inodes;
>  	}
> +	if (ctx->seen & SHMEM_SEEN_USER_XATTR)
> +		sbinfo->user_xattr = ctx->user_xattr;
>  
>  	/*
>  	 * Preserve previous mempolicy unless mpol remount option was specified.
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 
>

Oleksandr Tymoshenko July 20, 2023, 5:09 p.m. UTC | #2

Hi Hugh,

Could you share that patch?

On Thu, Jul 20, 2023 at 9:57 AM Hugh Dickins <hughd@google.com> wrote:

> On Thu, 20 Jul 2023, Oleksandr Tymoshenko wrote:
>
> > User extended attributes are not enabled in tmpfs because
> > the size of the value is not limited and the memory allocated
> > for it is not counted against any limit. Malicious
> > non-privileged user can exhaust kernel memory by creating
> > user.* extended attribute with very large value.
> >
> > There are still situations when enabling suport for extended
> > user attributes on tmpfs is required and the attack vector
> > is not applicable, for instance batch jobs with trusted binaries.
> >
> > This patch introduces two mount options to enable/disable
> > support for user.* extended attributes on tmpfs:
> >
> > user_xattr    enable support for user extended aatributes
> > nouser_xattr  disable support for user extended attributes
> >
> > The default behavior of the filesystem is not changed.
> >
> > Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>
>
> Thanks, but no.
>
> This is not something we want mount options for:
> we just want to limit the memory usage of tmpfs user xattrs.
>
> I've had the patch to do that limiting (taking it out of the inode
> space already limited by nr_inodes) in my test tree for 2.5 years now:
> waiting to reach the top of the heap to pull together and submit.
>
> Your sending this patch does help to raise the priority for my
> sending that patch: thank you; but I cannot promise when that will be.
>
> (And the way mm/shmem.c is conflicted between vfs and mm trees
> is rather discouraging development there at the moment: I'm hoping
> it can be mostly wrested back into the mm tree in the next cycle.)
>
> Hugh
>
> > ---
> >  Documentation/filesystems/tmpfs.rst | 12 ++++++++
> >  include/linux/shmem_fs.h            |  1 +
> >  mm/shmem.c                          | 45 +++++++++++++++++++++++++++++
> >  3 files changed, 58 insertions(+)
> >
> > diff --git a/Documentation/filesystems/tmpfs.rst
> b/Documentation/filesystems/tmpfs.rst
> > index f18f46be5c0c..5700ba72d095 100644
> > --- a/Documentation/filesystems/tmpfs.rst
> > +++ b/Documentation/filesystems/tmpfs.rst
> > @@ -215,6 +215,16 @@ will give you tmpfs instance on /mytmpfs which can
> allocate 10GB
> >  RAM/SWAP in 10240 inodes and it is only accessible by root.
> >
> >
> > +tmpfs, when compiled with CONFIG_TMPFS_XATTR, does not support
> > +Extended User Attributes for security reasons. The support can be
> > +enabled/disabled by two mount options:
> > +
> > +============  ===========================================
> > +user_xattr    Enable support for Extended User Attributes
> > +nouser_xattr  Disable upport for Extended User Attributes
> > +============  ===========================================
> > +
> > +
> >  :Author:
> >     Christoph Rohland <cr@sap.com>, 1.12.01
> >  :Updated:
> > @@ -223,3 +233,5 @@ RAM/SWAP in 10240 inodes and it is only accessible
> by root.
> >     KOSAKI Motohiro, 16 Mar 2010
> >  :Updated:
> >     Chris Down, 13 July 2020
> > +:Updated:
> > +   Oleksandr Tymoshenko, 19 July 2023
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index 9029abd29b1c..f06d18b9041c 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -53,6 +53,7 @@ struct shmem_sb_info {
> >       spinlock_t shrinklist_lock;   /* Protects shrinklist */
> >       struct list_head shrinklist;  /* List of shinkable inodes */
> >       unsigned long shrinklist_len; /* Length of shrinklist */
> > +     bool user_xattr;              /* user.* xattrs are allowed */
> >  };
> >
> >  static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 2f2e0e618072..4f7d46d65494 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -85,6 +85,7 @@ static struct vfsmount *shm_mnt;
> >
> >  #define BLOCKS_PER_PAGE  (PAGE_SIZE/512)
> >  #define VM_ACCT(size)    (PAGE_ALIGN(size) >> PAGE_SHIFT)
> > +#define TMPFS_USER_XATTR_INDEX 1
> >
> >  /* Pretend that each entry is of this size in directory's i_size */
> >  #define BOGO_DIRENT_SIZE 20
> > @@ -116,11 +117,13 @@ struct shmem_options {
> >       int huge;
> >       int seen;
> >       bool noswap;
> > +     bool user_xattr;
> >  #define SHMEM_SEEN_BLOCKS 1
> >  #define SHMEM_SEEN_INODES 2
> >  #define SHMEM_SEEN_HUGE 4
> >  #define SHMEM_SEEN_INUMS 8
> >  #define SHMEM_SEEN_NOSWAP 16
> > +#define SHMEM_SEEN_USER_XATTR 32
> >  };
> >
> >  #ifdef CONFIG_TMPFS
> > @@ -3447,6 +3450,16 @@ static int shmem_xattr_handler_get(const struct
> xattr_handler *handler,
> >                                  const char *name, void *buffer, size_t
> size)
> >  {
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> > +     struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > +
> > +     switch (handler->flags) {
> > +     case TMPFS_USER_XATTR_INDEX:
> > +             if (!sbinfo->user_xattr)
> > +                     return -EOPNOTSUPP;
> > +             break;
> > +     default:
> > +             break;
> > +     }
> >
> >       name = xattr_full_name(handler, name);
> >       return simple_xattr_get(&info->xattrs, name, buffer, size);
> > @@ -3459,8 +3472,18 @@ static int shmem_xattr_handler_set(const struct
> xattr_handler *handler,
> >                                  size_t size, int flags)
> >  {
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> > +     struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> >       int err;
> >
> > +     switch (handler->flags) {
> > +     case TMPFS_USER_XATTR_INDEX:
> > +             if (!sbinfo->user_xattr)
> > +                     return -EOPNOTSUPP;
> > +             break;
> > +     default:
> > +             break;
> > +     }
> > +
> >       name = xattr_full_name(handler, name);
> >       err = simple_xattr_set(&info->xattrs, name, value, size, flags,
> NULL);
> >       if (!err) {
> > @@ -3482,9 +3505,17 @@ static const struct xattr_handler
> shmem_trusted_xattr_handler = {
> >       .set = shmem_xattr_handler_set,
> >  };
> >
> > +static const struct xattr_handler shmem_user_xattr_handler = {
> > +     .prefix = XATTR_USER_PREFIX,
> > +     .flags = TMPFS_USER_XATTR_INDEX,
> > +     .get = shmem_xattr_handler_get,
> > +     .set = shmem_xattr_handler_set,
> > +};
> > +
> >  static const struct xattr_handler *shmem_xattr_handlers[] = {
> >       &shmem_security_xattr_handler,
> >       &shmem_trusted_xattr_handler,
> > +     &shmem_user_xattr_handler,
> >       NULL
> >  };
> >
> > @@ -3604,6 +3635,8 @@ enum shmem_param {
> >       Opt_inode32,
> >       Opt_inode64,
> >       Opt_noswap,
> > +     Opt_user_xattr,
> > +     Opt_nouser_xattr,
> >  };
> >
> >  static const struct constant_table shmem_param_enums_huge[] = {
> > @@ -3626,6 +3659,8 @@ const struct fs_parameter_spec
> shmem_fs_parameters[] = {
> >       fsparam_flag  ("inode32",       Opt_inode32),
> >       fsparam_flag  ("inode64",       Opt_inode64),
> >       fsparam_flag  ("noswap",        Opt_noswap),
> > +     fsparam_flag  ("user_xattr",    Opt_user_xattr),
> > +     fsparam_flag  ("nouser_xattr",  Opt_nouser_xattr),
> >       {}
> >  };
> >
> > @@ -3717,6 +3752,14 @@ static int shmem_parse_one(struct fs_context *fc,
> struct fs_parameter *param)
> >               ctx->noswap = true;
> >               ctx->seen |= SHMEM_SEEN_NOSWAP;
> >               break;
> > +     case Opt_user_xattr:
> > +             ctx->user_xattr = true;
> > +             ctx->seen |= SHMEM_SEEN_USER_XATTR;
> > +             break;
> > +     case Opt_nouser_xattr:
> > +             ctx->user_xattr = false;
> > +             ctx->seen |= SHMEM_SEEN_USER_XATTR;
> > +             break;
> >       }
> >       return 0;
> >
> > @@ -3834,6 +3877,8 @@ static int shmem_reconfigure(struct fs_context *fc)
> >               sbinfo->max_inodes  = ctx->inodes;
> >               sbinfo->free_inodes = ctx->inodes - inodes;
> >       }
> > +     if (ctx->seen & SHMEM_SEEN_USER_XATTR)
> > +             sbinfo->user_xattr = ctx->user_xattr;
> >
> >       /*
> >        * Preserve previous mempolicy unless mpol remount option was
> specified.
> > --
> > 2.41.0.255.g8b1d071c50-goog
> >
> >
>

Hugh Dickins July 20, 2023, 5:42 p.m. UTC | #3

On Thu, 20 Jul 2023, Oleksandr Tymoshenko wrote:

> Hi Hugh,
> 
> Could you share that patch?

When I'm ready to, of course.

Hugh

Franklin “Snaipe” Mathieu Aug. 14, 2023, 8:23 a.m. UTC | #4

> Your sending this patch does help to raise the priority for my
> sending that patch: thank you; but I cannot promise when that will be.

Hey Hugh,

Just as an additional data point, if it helps with priority :)

The lack of support of user xattrs on tmpfs our last remaining blocker for
using unprivileged overlayfs mounts that use a tmpfs for the upper dir & work
dir. Not that it isn't possible to use unprivileged overlayfs mounts in general,
but not having this option means that use-cases for discardable upper layer
changes are harder to clean up correctly when not on an tmpfs mount whose
lifetime is bound to a mount namespace.

I don't think there's any rush; we can live with rmdir failing with EIO for now,
but it'd be great to see this fixed rather than having to implement expensive
cleanup routines that have to remove the upper+work dirs recursively with
CAP_DAC_OVERRIDE.

Cheers,

Hugh Dickins Aug. 15, 2023, 3:51 a.m. UTC | #5

On Mon, 14 Aug 2023, Snaipe wrote:

> > Your sending this patch does help to raise the priority for my
> > sending that patch: thank you; but I cannot promise when that will be.
> 
> Hey Hugh,
> 
> Just as an additional data point, if it helps with priority :)
> 
> The lack of support of user xattrs on tmpfs our last remaining blocker for
> using unprivileged overlayfs mounts that use a tmpfs for the upper dir & work
> dir. Not that it isn't possible to use unprivileged overlayfs mounts in general,
> but not having this option means that use-cases for discardable upper layer
> changes are harder to clean up correctly when not on an tmpfs mount whose
> lifetime is bound to a mount namespace.
> 
> I don't think there's any rush; we can live with rmdir failing with EIO for now,
> but it'd be great to see this fixed rather than having to implement expensive
> cleanup routines that have to remove the upper+work dirs recursively with
> CAP_DAC_OVERRIDE.

Thanks for the encouragement.  At the time that I wrote that (20 July)
I did not expect to get around to it for months.  But there happens to
have been various VFS-involving works going on in mm/shmem.c recently,
targeting v6.6: which caused me to rearrange priorities, and join the
party with tmpfs user xattrs, see

https://lore.kernel.org/linux-fsdevel/e92a4d33-f97-7c84-95ad-4fed8e84608c@google.com/

Which Christian Brauner quickly put into his vfs.git (vfs.tmpfs branch):
so unless something goes horribly wrong, you can expect them in v6.6.

There's a lot that you wrote above which I have no understanding of
whatsoever (why would user xattrs stop rmdir failing?? it's okay, don't
try to educate me, I don't need to know, I'm just glad if they help you).

Though your mention of "unprivileged" does make me shiver a little:
Christian will understand the implications when I do not, but I wonder
if my effort to limit the memory usage of user xattrs to "inode space"
can be be undermined by unprivileged mounts with unlimited (or default,
that's bad enough) nr_inodes.

If so, that won't endanger the tmpfs user xattrs implementation, since
the problem would already go beyond those: can an unprivileged mount of
tmpfs allow its user to gobble up much more memory than is good for the
rest of the system?

Hugh

Franklin “Snaipe” Mathieu Aug. 15, 2023, 7:46 a.m. UTC | #6

On Tue, Aug 15, 2023 at 5:52 AM Hugh Dickins <hughd@google.com> wrote:
>
> Thanks for the encouragement.  At the time that I wrote that (20 July)
> I did not expect to get around to it for months.  But there happens to
> have been various VFS-involving works going on in mm/shmem.c recently,
> targeting v6.6: which caused me to rearrange priorities, and join the
> party with tmpfs user xattrs, see
>
> https://lore.kernel.org/linux-fsdevel/e92a4d33-f97-7c84-95ad-4fed8e84608c@google.com/
>
> Which Christian Brauner quickly put into his vfs.git (vfs.tmpfs branch):
> so unless something goes horribly wrong, you can expect them in v6.6.

That's great to hear, thanks!

> There's a lot that you wrote above which I have no understanding of
> whatsoever (why would user xattrs stop rmdir failing?? it's okay, don't
> try to educate me, I don't need to know, I'm just glad if they help you).
>
> Though your mention of "unprivileged" does make me shiver a little:
> Christian will understand the implications when I do not, but I wonder
> if my effort to limit the memory usage of user xattrs to "inode space"
> can be be undermined by unprivileged mounts with unlimited (or default,
> that's bad enough) nr_inodes.
>
> If so, that won't endanger the tmpfs user xattrs implementation, since
> the problem would already go beyond those: can an unprivileged mount of
> tmpfs allow its user to gobble up much more memory than is good for the
> rest of the system?

I don't actually know; I'm no expert in that area. That said, these
tmpfses are themselves attached to an unprivileged mount namespace, so
it would certainly be my assumption that in the case of an OOM
condition, the OOM killer would keep trying to kill processes in that
mount namespace until nothing else references it and all tmpfs mounts
can be reclaimed, but then again, that's only my assumption and not
necessarily reality.

That said, I got curious and decided to experiment; I booted a kernel
in a VM with 1GiB of memory and ran the following commands:

    $ unshare -Umfr bash
    # mount -t tmpfs tmp /mnt -o size=1g
    # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000

After about a second, the OOM killer woke up and killed bash then dd,
causing the mount namespace to be collected (and with it the tmpfs).
So far, so good.

I got suspicious that what I was seeing was that these were the only
reasonable candidates for the OOM killer, because there were no other
processes in that VM besides them & init, so I modified slightly the
experiment:

    $ dd if=/dev/zero of=/dev/null bs=10M count=10000000000 &
    $ unshare -Umfr bash
    # mount -t tmpfs tmp /mnt -o size=1g
    # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000

The intent being that the first dd would have a larger footprint than
the second because of the large block size, yet it shouldn't be killed
if the tmpfs usage was accounted for in processes in the mount
namespace. What happened however is that both the outer dd and the
outer shell got terminated, causing init to exit and with it the VM.

So, it's likely that there's some more work to do in that area; I'd
certainly expect the OOM killer to take the overall memory footprint
of mount namespaces into account when selecting which processes to
kill. It's also possible my experiment was flawed and not
representative of a real-life scenario, as I clearly have interacted
with misbehaving containers before, which got killed when they wrote
too much to tmpfs. But then again, my experiment also didn't take
memory cgroups into account.

--
Snaipe

Christian Brauner Aug. 15, 2023, 8:23 a.m. UTC | #7

On Tue, Aug 15, 2023 at 09:46:22AM +0200, Franklin “Snaipe” Mathieu wrote:
> On Tue, Aug 15, 2023 at 5:52 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Thanks for the encouragement.  At the time that I wrote that (20 July)
> > I did not expect to get around to it for months.  But there happens to
> > have been various VFS-involving works going on in mm/shmem.c recently,
> > targeting v6.6: which caused me to rearrange priorities, and join the
> > party with tmpfs user xattrs, see
> >
> > https://lore.kernel.org/linux-fsdevel/e92a4d33-f97-7c84-95ad-4fed8e84608c@google.com/
> >
> > Which Christian Brauner quickly put into his vfs.git (vfs.tmpfs branch):
> > so unless something goes horribly wrong, you can expect them in v6.6.
> 
> That's great to hear, thanks!
> 
> > There's a lot that you wrote above which I have no understanding of
> > whatsoever (why would user xattrs stop rmdir failing?? it's okay, don't
> > try to educate me, I don't need to know, I'm just glad if they help you).
> >
> > Though your mention of "unprivileged" does make me shiver a little:
> > Christian will understand the implications when I do not, but I wonder
> > if my effort to limit the memory usage of user xattrs to "inode space"
> > can be be undermined by unprivileged mounts with unlimited (or default,
> > that's bad enough) nr_inodes.
> >
> > If so, that won't endanger the tmpfs user xattrs implementation, since
> > the problem would already go beyond those: can an unprivileged mount of
> > tmpfs allow its user to gobble up much more memory than is good for the
> > rest of the system?
> 
> I don't actually know; I'm no expert in that area. That said, these
> tmpfses are themselves attached to an unprivileged mount namespace, so
> it would certainly be my assumption that in the case of an OOM
> condition, the OOM killer would keep trying to kill processes in that
> mount namespace until nothing else references it and all tmpfs mounts
> can be reclaimed, but then again, that's only my assumption and not
> necessarily reality.
> 
> That said, I got curious and decided to experiment; I booted a kernel
> in a VM with 1GiB of memory and ran the following commands:
> 
>     $ unshare -Umfr bash
>     # mount -t tmpfs tmp /mnt -o size=1g
>     # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000
> 
> After about a second, the OOM killer woke up and killed bash then dd,
> causing the mount namespace to be collected (and with it the tmpfs).
> So far, so good.
> 
> I got suspicious that what I was seeing was that these were the only
> reasonable candidates for the OOM killer, because there were no other
> processes in that VM besides them & init, so I modified slightly the
> experiment:
> 
>     $ dd if=/dev/zero of=/dev/null bs=10M count=10000000000 &
>     $ unshare -Umfr bash
>     # mount -t tmpfs tmp /mnt -o size=1g
>     # dd if=/dev/urandom of=/mnt/oversize bs=1M count=1000
> 
> The intent being that the first dd would have a larger footprint than
> the second because of the large block size, yet it shouldn't be killed
> if the tmpfs usage was accounted for in processes in the mount
> namespace. What happened however is that both the outer dd and the
> outer shell got terminated, causing init to exit and with it the VM.
> 
> So, it's likely that there's some more work to do in that area; I'd
> certainly expect the OOM killer to take the overall memory footprint
> of mount namespaces into account when selecting which processes to
> kill. It's also possible my experiment was flawed and not
> representative of a real-life scenario, as I clearly have interacted
> with misbehaving containers before, which got killed when they wrote
> too much to tmpfs. But then again, my experiment also didn't take
> memory cgroups into account.

So mount namespaces are orthogonal to that and they would be the wrong
layer to handle this.

Note that an unprivileged user (regular or via containers) on the system
can just exhaust all memory in various ways. Ultimately the container or
user would likely be taken down by in-kernel OOM or systemd-oomd or
similar tools under memory pressure.

Of course, all that means is that untrusted workloads need to have
cgroup memory limits. That also limits tmpfs instances and prevents
unprivileged user from using all memory.

If you don't set a memory limit then yes, the container might be able to
exhaust all memory but that's a bug in the container runtime. Also, at
some point the OOM killer or related userspace tools will select the
container init process for termination at which point all the namespaces
and mounts go away. That's probably what you experience as misbehaving
containers. The real bug there is probably that they're allowed to run
without memory limits in the first place.

Hugh Dickins Aug. 21, 2023, 5:52 p.m. UTC | #8

On Tue, 15 Aug 2023, Christian Brauner wrote:
> On Tue, Aug 15, 2023 at 09:46:22AM +0200, Franklin “Snaipe” Mathieu wrote:
> > 
> > So, it's likely that there's some more work to do in that area; I'd
> > certainly expect the OOM killer to take the overall memory footprint
> > of mount namespaces into account when selecting which processes to
> > kill. It's also possible my experiment was flawed and not
> > representative of a real-life scenario, as I clearly have interacted
> > with misbehaving containers before, which got killed when they wrote
> > too much to tmpfs. But then again, my experiment also didn't take
> > memory cgroups into account.
> 
> So mount namespaces are orthogonal to that and they would be the wrong
> layer to handle this.
> 
> Note that an unprivileged user (regular or via containers) on the system
> can just exhaust all memory in various ways. Ultimately the container or
> user would likely be taken down by in-kernel OOM or systemd-oomd or
> similar tools under memory pressure.
> 
> Of course, all that means is that untrusted workloads need to have
> cgroup memory limits. That also limits tmpfs instances and prevents
> unprivileged user from using all memory.
> 
> If you don't set a memory limit then yes, the container might be able to
> exhaust all memory but that's a bug in the container runtime. Also, at
> some point the OOM killer or related userspace tools will select the
> container init process for termination at which point all the namespaces
> and mounts go away. That's probably what you experience as misbehaving
> containers. The real bug there is probably that they're allowed to run
> without memory limits in the first place.

Thanks, this was a good reminder that I very much needed to look back at
the memory cgroup limiting of xattrs on tmpfs - I'd had the patch in the
original series, then was alarmed to find shmem_alloc_inode() using
GFP_KERNEL, so there seemed no point in accounting the xattrs if the
inodes were not being accounted: so dropped it temporarily.  I had
forgotten that SLAB_ACCOUNT on the kmem_cache ensures that accounting.

"tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs" just sent to fix it:
https://lore.kernel.org/linux-fsdevel/f6953e5a-4183-8314-38f2-40be60998615@google.com/

Thanks,
Hugh

Christian Brauner Aug. 22, 2023, 9:01 a.m. UTC | #9

> "tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs" just sent to fix it:
> https://lore.kernel.org/linux-fsdevel/f6953e5a-4183-8314-38f2-40be60998615@google.com/

Thank you. I snatched it now.

Oleksandr Tymoshenko Aug. 22, 2023, 5:50 p.m. UTC | #10

Thanks for working on this.

On Mon, Aug 21, 2023 at 10:52 AM Hugh Dickins <hughd@google.com> wrote:

> On Tue, 15 Aug 2023, Christian Brauner wrote:
> > On Tue, Aug 15, 2023 at 09:46:22AM +0200, Franklin “Snaipe” Mathieu
> wrote:
> > >
> > > So, it's likely that there's some more work to do in that area; I'd
> > > certainly expect the OOM killer to take the overall memory footprint
> > > of mount namespaces into account when selecting which processes to
> > > kill. It's also possible my experiment was flawed and not
> > > representative of a real-life scenario, as I clearly have interacted
> > > with misbehaving containers before, which got killed when they wrote
> > > too much to tmpfs. But then again, my experiment also didn't take
> > > memory cgroups into account.
> >
> > So mount namespaces are orthogonal to that and they would be the wrong
> > layer to handle this.
> >
> > Note that an unprivileged user (regular or via containers) on the system
> > can just exhaust all memory in various ways. Ultimately the container or
> > user would likely be taken down by in-kernel OOM or systemd-oomd or
> > similar tools under memory pressure.
> >
> > Of course, all that means is that untrusted workloads need to have
> > cgroup memory limits. That also limits tmpfs instances and prevents
> > unprivileged user from using all memory.
> >
> > If you don't set a memory limit then yes, the container might be able to
> > exhaust all memory but that's a bug in the container runtime. Also, at
> > some point the OOM killer or related userspace tools will select the
> > container init process for termination at which point all the namespaces
> > and mounts go away. That's probably what you experience as misbehaving
> > containers. The real bug there is probably that they're allowed to run
> > without memory limits in the first place.
>
> Thanks, this was a good reminder that I very much needed to look back at
> the memory cgroup limiting of xattrs on tmpfs - I'd had the patch in the
> original series, then was alarmed to find shmem_alloc_inode() using
> GFP_KERNEL, so there seemed no point in accounting the xattrs if the
> inodes were not being accounted: so dropped it temporarily.  I had
> forgotten that SLAB_ACCOUNT on the kmem_cache ensures that accounting.
>
> "tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs" just sent to fix it:
>
> https://lore.kernel.org/linux-fsdevel/f6953e5a-4183-8314-38f2-40be60998615@google.com/
>
> Thanks,
> Hugh

shmem: add support for user extended attributes

Commit Message

Comments

Patch