[RFC,3/3] ima: make the integrity inode cache per namespace

Message ID	20211127164549.2571457-4-James.Bottomley@HansenPartnership.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-integrity-owner@kernel.org> From: James Bottomley <James.Bottomley@HansenPartnership.com> To: linux-integrity@vger.kernel.org Cc: containers@lists.linux.dev, Mimi Zohar <zohar@linux.ibm.com>, Dmitry Kasatkin <dmitry.kasatkin@gmail.com>, Stefan Berger <stefanb@linux.ibm.com>, "Eric W . Biederman" <ebiederm@xmission.com>, krzysztof.struczynski@huawei.com, Roberto Sassu <roberto.sassu@huawei.com>, "Serge E . Hallyn" <serge@hallyn.com>, Michael Peters <mpeters@redhat.com>, Luke Hinds <lhinds@redhat.com>, Lily Sturmann <lsturman@redhat.com>, Patrick Uiterwijk <puiterwi@redhat.com>, Christian Brauner <christian@brauner.io> Subject: [RFC 3/3] ima: make the integrity inode cache per namespace Date: Sat, 27 Nov 2021 16:45:49 +0000 Message-Id: <20211127164549.2571457-4-James.Bottomley@HansenPartnership.com> In-Reply-To: <20211127164549.2571457-1-James.Bottomley@HansenPartnership.com> References: <20211127164549.2571457-1-James.Bottomley@HansenPartnership.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Namespace IMA \| expand [RFC,0/3] Namespace IMA [RFC,1/3] userns: add uuid field [RFC,2/3] ima: Namespace IMA [RFC,3/3] ima: make the integrity inode cache per namespace

James Bottomley Nov. 27, 2021, 4:45 p.m. UTC

Currently we get one entry in the IMA log per unique file event.  So,
if you have a measurement policy and it measures a particular binary
it will not get measured again if it is subsequently executed. For
Namespaced IMA, the correct behaviour seems to be to log once per
inode per namespace (so every unique execution in a namespace gets a
separate log entry).  Since logging once per inode per namespace is
different from current behaviour, it is only activated if the
namespace appears in the log template (so there's no behaviour change
for any of the original templates).

Expand the iint cache to have a list of namespaces, per iint entry,
the inode has been seen in by moving the action flags and the
measured_pcrs into a per-namespace structure.  The lifetime of these
additional list entries is tied to the lifetime of the iint entry and
the namespace, so if either is deleted, the new entry is.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 include/linux/ima.h                       |  13 ++-
 include/linux/user_namespace.h            |   5 +
 kernel/user.c                             |   1 +
 kernel/user_namespace.c                   |   6 ++
 security/integrity/iint.c                 |   4 +-
 security/integrity/ima/Makefile           |   2 +-
 security/integrity/ima/ima.h              |  21 +++-
 security/integrity/ima/ima_api.c          |   7 +-
 security/integrity/ima/ima_main.c         |  21 ++--
 security/integrity/ima/ima_ns.c           | 115 ++++++++++++++++++++++
 security/integrity/ima/ima_policy.c       |   2 +-
 security/integrity/ima/ima_template_lib.c |   2 +
 security/integrity/integrity.h            |  11 ++-
 13 files changed, 192 insertions(+), 18 deletions(-)
 create mode 100644 security/integrity/ima/ima_ns.c

Serge Hallyn Nov. 29, 2021, 4:58 a.m. UTC | #1

On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
> Currently we get one entry in the IMA log per unique file event.  So,
> if you have a measurement policy and it measures a particular binary
> it will not get measured again if it is subsequently executed. For
> Namespaced IMA, the correct behaviour seems to be to log once per
> inode per namespace (so every unique execution in a namespace gets a
> separate log entry).  Since logging once per inode per namespace is

I suspect I'll need to do a more in depth reading of the existing
code, but I'll ask the lazy question anyway (since you say "the
correct behavior seems to be") - is it actually important that
files which were appraised under a parent namespace's policy already
should be logged again?  Since you used the word "log" I'm assuming
this isn't building a running hash like a tpm pcr where skipping one
would invalidate rmeote attestation?

> different from current behaviour, it is only activated if the
> namespace appears in the log template (so there's no behaviour change
> for any of the original templates).
> 
> Expand the iint cache to have a list of namespaces, per iint entry,
> the inode has been seen in by moving the action flags and the
> measured_pcrs into a per-namespace structure.  The lifetime of these
> additional list entries is tied to the lifetime of the iint entry and
> the namespace, so if either is deleted, the new entry is.
> 
> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
> ---
>  include/linux/ima.h                       |  13 ++-
>  include/linux/user_namespace.h            |   5 +
>  kernel/user.c                             |   1 +
>  kernel/user_namespace.c                   |   6 ++
>  security/integrity/iint.c                 |   4 +-
>  security/integrity/ima/Makefile           |   2 +-
>  security/integrity/ima/ima.h              |  21 +++-
>  security/integrity/ima/ima_api.c          |   7 +-
>  security/integrity/ima/ima_main.c         |  21 ++--
>  security/integrity/ima/ima_ns.c           | 115 ++++++++++++++++++++++
>  security/integrity/ima/ima_policy.c       |   2 +-
>  security/integrity/ima/ima_template_lib.c |   2 +
>  security/integrity/integrity.h            |  11 ++-
>  13 files changed, 192 insertions(+), 18 deletions(-)
>  create mode 100644 security/integrity/ima/ima_ns.c
> 
> diff --git a/include/linux/ima.h b/include/linux/ima.h
> index 09b14b73889e..dbbc0257d065 100644
> --- a/include/linux/ima.h
> +++ b/include/linux/ima.h
> @@ -40,7 +40,8 @@ extern int ima_measure_critical_data(const char *event_label,
>  				     const char *event_name,
>  				     const void *buf, size_t buf_len,
>  				     bool hash, u8 *digest, size_t digest_len);
> -
> +extern void ima_init_user_ns(struct user_namespace *ns);
> +extern void ima_free_user_ns(struct user_namespace *ns);
>  #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
>  extern void ima_appraise_parse_cmdline(void);
>  #else
> @@ -154,6 +155,16 @@ static inline int ima_measure_critical_data(const char *event_label,
>  	return -ENOENT;
>  }
>  
> +static inline void ima_init_user_ns(struct user_namespace *ns)
> +{
> +	return;
> +}
> +
> +static inline void ima_free_user_ns(struct user_namespace *ns)
> +{
> +	return;
> +}
> +
>  #endif /* CONFIG_IMA */
>  
>  #ifndef CONFIG_IMA_KEXEC
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index d155788abdc1..52968764b195 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  #include <linux/workqueue.h>
>  #include <linux/rwsem.h>
> +#include <linux/rwlock.h>
>  #include <linux/sysctl.h>
>  #include <linux/err.h>
>  #include <linux/uuid.h>
> @@ -101,6 +102,10 @@ struct user_namespace {
>  	struct ucounts		*ucounts;
>  	long ucount_max[UCOUNT_COUNTS];
>  	uuid_t			uuid;
> +#ifdef CONFIG_IMA
> +	struct list_head	ima_inode_list;
> +	rwlock_t		ima_inode_list_lock;
> +#endif
>  } __randomize_layout;
>  
>  struct ucounts {
> diff --git a/kernel/user.c b/kernel/user.c
> index bf9ae1d0b670..670d3c41c817 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -68,6 +68,7 @@ struct user_namespace init_user_ns = {
>  	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
>  #endif
>  	/* .uuid is initialized in user_namespaces_init() */
> +	/* all IMA fields are initialized by ima_init_user_ns() */
>  };
>  EXPORT_SYMBOL_GPL(init_user_ns);
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 8ce57c16ddd3..8afa5dc0b992 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  
>  #include <linux/export.h>
> +#include <linux/ima.h>
>  #include <linux/nsproxy.h>
>  #include <linux/slab.h>
>  #include <linux/sched/signal.h>
> @@ -144,6 +145,9 @@ int create_user_ns(struct cred *new)
>  	uuid_gen(&ns->uuid);
>  
>  	set_cred_user_ns(new, ns);
> +
> +	ima_init_user_ns(ns);
> +
>  	return 0;
>  fail_keyring:
>  #ifdef CONFIG_PERSISTENT_KEYRINGS
> @@ -200,6 +204,7 @@ static void free_user_ns(struct work_struct *work)
>  		}
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
> +		ima_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
>  		kmem_cache_free(user_ns_cachep, ns);
>  		dec_user_namespaces(ucounts);
> @@ -1389,6 +1394,7 @@ static __init int user_namespaces_init(void)
>  {
>  	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	uuid_gen(&init_user_ns.uuid);
> +	ima_init_user_ns(&init_user_ns);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> diff --git a/security/integrity/iint.c b/security/integrity/iint.c
> index 8638976f7990..f714532feb7d 100644
> --- a/security/integrity/iint.c
> +++ b/security/integrity/iint.c
> @@ -71,6 +71,7 @@ struct integrity_iint_cache *integrity_iint_find(struct inode *inode)
>  static void iint_free(struct integrity_iint_cache *iint)
>  {
>  	kfree(iint->ima_hash);
> +	ima_ns_iint_list_free(iint);
>  	iint->ima_hash = NULL;
>  	iint->version = 0;
>  	iint->flags = 0UL;
> @@ -81,7 +82,7 @@ static void iint_free(struct integrity_iint_cache *iint)
>  	iint->ima_read_status = INTEGRITY_UNKNOWN;
>  	iint->ima_creds_status = INTEGRITY_UNKNOWN;
>  	iint->evm_status = INTEGRITY_UNKNOWN;
> -	iint->measured_pcrs = 0;
> +	INIT_LIST_HEAD(&iint->ns_list);
>  	kmem_cache_free(iint_cache, iint);
>  }
>  
> @@ -170,6 +171,7 @@ static void init_once(void *foo)
>  	iint->ima_creds_status = INTEGRITY_UNKNOWN;
>  	iint->evm_status = INTEGRITY_UNKNOWN;
>  	mutex_init(&iint->mutex);
> +	INIT_LIST_HEAD(&iint->ns_list);
>  }
>  
>  static int __init integrity_iintcache_init(void)
> diff --git a/security/integrity/ima/Makefile b/security/integrity/ima/Makefile
> index 2499f2485c04..1741aa5d97bc 100644
> --- a/security/integrity/ima/Makefile
> +++ b/security/integrity/ima/Makefile
> @@ -7,7 +7,7 @@
>  obj-$(CONFIG_IMA) += ima.o
>  
>  ima-y := ima_fs.o ima_queue.o ima_init.o ima_main.o ima_crypto.o ima_api.o \
> -	 ima_policy.o ima_template.o ima_template_lib.o
> +	 ima_policy.o ima_template.o ima_template_lib.o ima_ns.o
>  ima-$(CONFIG_IMA_APPRAISE) += ima_appraise.o
>  ima-$(CONFIG_IMA_APPRAISE_MODSIG) += ima_modsig.o
>  ima-$(CONFIG_HAVE_IMA_KEXEC) += ima_kexec.o
> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
> index be965a8715e4..047256be7195 100644
> --- a/security/integrity/ima/ima.h
> +++ b/security/integrity/ima/ima.h
> @@ -119,6 +119,16 @@ struct ima_kexec_hdr {
>  	u64 count;
>  };
>  
> +/* IMA cache of per-user-namespace flags */
> +struct ima_ns_cache {
> +	struct user_namespace *ns;
> +	struct integrity_iint_cache *iint;
> +	struct list_head ns_list;
> +	struct list_head iint_list;
> +	unsigned long measured_pcrs;
> +	unsigned long flags;
> +};
> +
>  extern const int read_idmap[];
>  
>  #ifdef CONFIG_HAVE_IMA_KEXEC
> @@ -263,7 +273,8 @@ int ima_must_measure(struct inode *inode, int mask, enum ima_hooks func);
>  int ima_collect_measurement(struct integrity_iint_cache *iint,
>  			    struct file *file, void *buf, loff_t size,
>  			    enum hash_algo algo, struct modsig *modsig);
> -void ima_store_measurement(struct integrity_iint_cache *iint, struct file *file,
> +void ima_store_measurement(struct integrity_iint_cache *iint,
> +			   struct ima_ns_cache *nsc, struct file *file,
>  			   const unsigned char *filename,
>  			   struct evm_ima_xattr_data *xattr_value,
>  			   int xattr_len, const struct modsig *modsig, int pcr,
> @@ -301,6 +312,14 @@ void *ima_policy_next(struct seq_file *m, void *v, loff_t *pos);
>  void ima_policy_stop(struct seq_file *m, void *v);
>  int ima_policy_show(struct seq_file *m, void *v);
>  
> +/* IMA Namespace related functions */
> +extern bool ima_ns_in_template;
> +struct ima_ns_cache *ima_ns_cache_get(struct integrity_iint_cache *iint,
> +				      struct user_namespace *ns);
> +void ima_ns_cache_clear(struct integrity_iint_cache *iint);
> +void ima_init_ns(void);
> +
> +
>  /* Appraise integrity measurements */
>  #define IMA_APPRAISE_ENFORCE	0x01
>  #define IMA_APPRAISE_FIX	0x02
> diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
> index a64fb0130b01..d613ea1ee378 100644
> --- a/security/integrity/ima/ima_api.c
> +++ b/security/integrity/ima/ima_api.c
> @@ -298,6 +298,7 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
>   * Must be called with iint->mutex held.
>   */
>  void ima_store_measurement(struct integrity_iint_cache *iint,
> +			   struct ima_ns_cache *nsc,
>  			   struct file *file, const unsigned char *filename,
>  			   struct evm_ima_xattr_data *xattr_value,
>  			   int xattr_len, const struct modsig *modsig, int pcr,
> @@ -322,7 +323,7 @@ void ima_store_measurement(struct integrity_iint_cache *iint,
>  	 * appraisal, but a file measurement from earlier might already exist in
>  	 * the measurement list.
>  	 */
> -	if (iint->measured_pcrs & (0x1 << pcr) && !modsig)
> +	if (nsc->measured_pcrs & (0x1 << pcr) && !modsig)
>  		return;
>  
>  	result = ima_alloc_init_template(&event_data, &entry, template_desc);
> @@ -334,8 +335,8 @@ void ima_store_measurement(struct integrity_iint_cache *iint,
>  
>  	result = ima_store_template(entry, violation, inode, filename, pcr);
>  	if ((!result || result == -EEXIST) && !(file->f_flags & O_DIRECT)) {
> -		iint->flags |= IMA_MEASURED;
> -		iint->measured_pcrs |= (0x1 << pcr);
> +		nsc->flags |= IMA_MEASURED;
> +		nsc->measured_pcrs |= (0x1 << pcr);
>  	}
>  	if (result < 0)
>  		ima_free_template_entry(entry);
> diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
> index 465865412100..049710203fac 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -168,8 +168,8 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
>  		if (!IS_I_VERSION(inode) ||
>  		    !inode_eq_iversion(inode, iint->version) ||
>  		    (iint->flags & IMA_NEW_FILE)) {
> -			iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
> -			iint->measured_pcrs = 0;
> +			iint->flags &= ~IMA_NEW_FILE;
> +			ima_ns_cache_clear(iint);
>  			if (update)
>  				ima_update_xattr(iint, file);
>  		}
> @@ -204,6 +204,7 @@ static int process_measurement(struct file *file, const struct cred *cred,
>  {
>  	struct inode *inode = file_inode(file);
>  	struct integrity_iint_cache *iint = NULL;
> +	struct ima_ns_cache *nsc = NULL;
>  	struct ima_template_desc *template_desc = NULL;
>  	char *pathbuf = NULL;
>  	char filename[NAME_MAX];
> @@ -274,20 +275,20 @@ static int process_measurement(struct file *file, const struct cred *cred,
>  	    ((inode->i_sb->s_iflags & SB_I_IMA_UNVERIFIABLE_SIGNATURE) &&
>  	     !(inode->i_sb->s_iflags & SB_I_UNTRUSTED_MOUNTER) &&
>  	     !(action & IMA_FAIL_UNVERIFIABLE_SIGS))) {
> -		iint->flags &= ~IMA_DONE_MASK;
> -		iint->measured_pcrs = 0;
> +		ima_ns_cache_clear(iint);
>  	}
> +	nsc = ima_ns_cache_get(iint, current_user_ns());
>  
>  	/* Determine if already appraised/measured based on bitmask
>  	 * (IMA_MEASURE, IMA_MEASURED, IMA_XXXX_APPRAISE, IMA_XXXX_APPRAISED,
>  	 *  IMA_AUDIT, IMA_AUDITED)
>  	 */
> -	iint->flags |= action;
> +	nsc->flags |= action;
>  	action &= IMA_DO_MASK;
> -	action &= ~((iint->flags & (IMA_DONE_MASK ^ IMA_MEASURED)) >> 1);
> +	action &= ~((nsc->flags & (IMA_DONE_MASK ^ IMA_MEASURED)) >> 1);
>  
>  	/* If target pcr is already measured, unset IMA_MEASURE action */
> -	if ((action & IMA_MEASURE) && (iint->measured_pcrs & (0x1 << pcr)))
> +	if ((action & IMA_MEASURE) && (nsc->measured_pcrs & (0x1 << pcr)))
>  		action ^= IMA_MEASURE;
>  
>  	/* HASH sets the digital signature and update flags, nothing else */
> @@ -297,7 +298,7 @@ static int process_measurement(struct file *file, const struct cred *cred,
>  		if ((xattr_value && xattr_len > 2) &&
>  		    (xattr_value->type == EVM_IMA_XATTR_DIGSIG))
>  			set_bit(IMA_DIGSIG, &iint->atomic_flags);
> -		iint->flags |= IMA_HASHED;
> +		nsc->flags |= IMA_HASHED;
>  		action ^= IMA_HASH;
>  		set_bit(IMA_UPDATE_XATTR, &iint->atomic_flags);
>  	}
> @@ -342,7 +343,7 @@ static int process_measurement(struct file *file, const struct cred *cred,
>  		pathname = ima_d_path(&file->f_path, &pathbuf, filename);
>  
>  	if (action & IMA_MEASURE)
> -		ima_store_measurement(iint, file, pathname,
> +		ima_store_measurement(iint, nsc, file, pathname,
>  				      xattr_value, xattr_len, modsig, pcr,
>  				      template_desc);
>  	if (rc == 0 && (action & IMA_APPRAISE_SUBMASK)) {
> @@ -1047,6 +1048,8 @@ static int __init init_ima(void)
>  	if (error)
>  		return error;
>  
> +	ima_init_ns();
> +
>  	error = register_blocking_lsm_notifier(&ima_lsm_policy_notifier);
>  	if (error)
>  		pr_warn("Couldn't register LSM notifier, error %d\n", error);
> diff --git a/security/integrity/ima/ima_ns.c b/security/integrity/ima/ima_ns.c
> new file mode 100644
> index 000000000000..7134ba7f3544
> --- /dev/null
> +++ b/security/integrity/ima/ima_ns.c
> @@ -0,0 +1,115 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * IMA Namespace Support routines
> + */
> +
> +#include <linux/ima.h>
> +#include <linux/user_namespace.h>
> +
> +#include "ima.h"
> +
> +static struct kmem_cache *ns_cache __read_mostly;
> +bool ima_ns_in_template __read_mostly;
> +
> +void __init ima_init_ns(void)
> +{
> +	ns_cache = KMEM_CACHE(ima_ns_cache, SLAB_PANIC);
> +}
> +
> +void ima_init_user_ns(struct user_namespace *ns)
> +{
> +	INIT_LIST_HEAD(&ns->ima_inode_list);
> +}
> +
> +void ima_free_user_ns(struct user_namespace *ns)
> +{
> +	struct ima_ns_cache *entry;
> +
> +	/* no refs to ns left, so no need to lock */
> +	while (!list_empty(&ns->ima_inode_list)) {
> +		entry = list_entry(ns->ima_inode_list.next, struct ima_ns_cache,
> +				   ns_list);
> +
> +		/* iint cache entry is still active to lock to delete */
> +		write_lock(&entry->iint->ns_list_lock);
> +		list_del(&entry->iint_list);
> +		write_unlock(&entry->iint->ns_list_lock);
> +
> +		list_del(&entry->ns_list);
> +		kmem_cache_free(ns_cache, entry);
> +	}
> +}
> +
> +struct ima_ns_cache *ima_ns_cache_get(struct integrity_iint_cache *iint,
> +				      struct user_namespace *ns)
> +{
> +	struct ima_ns_cache *entry = NULL;
> +
> +	if (!ima_ns_in_template)
> +		/*
> +		 * if we're not logging the namespace, don't separate the
> +		 * iint cache per namespace.  This preserves original
> +		 * behaviour for the non-ns case.
> +		 */
> +		ns = &init_user_ns;
> +
> +	read_lock(&iint->ns_list_lock);
> +	list_for_each_entry(entry, &iint->ns_list, ns_list)
> +		if (entry->ns == ns)
> +			break;
> +	read_unlock(&iint->ns_list_lock);
> +
> +	if (entry && entry->ns == ns)
> +		return entry;
> +
> +	entry = kmem_cache_zalloc(ns_cache, GFP_NOFS);
> +	if (!entry)
> +		return NULL;
> +
> +	/* no refs taken: entry is freed on either ns delete or iint delete */
> +	entry->ns = ns;
> +	entry->iint = iint;
> +
> +	write_lock(&iint->ns_list_lock);
> +	list_add_tail(&entry->iint_list, &iint->ns_list);
> +	write_unlock(&iint->ns_list_lock);
> +
> +	write_lock(&ns->ima_inode_list_lock);
> +	list_add_tail(&entry->ns_list, &ns->ima_inode_list);
> +	write_unlock(&ns->ima_inode_list_lock);
> +
> +	return entry;
> +}
> +
> +/* clear the flags and measured PCR for every entry in the iint  */
> +void ima_ns_cache_clear(struct integrity_iint_cache *iint)
> +{
> +	struct ima_ns_cache *entry;
> +
> +	read_lock(&iint->ns_list_lock);
> +	list_for_each_entry(entry, &iint->ns_list, ns_list) {
> +		entry->flags = 0;
> +		entry->measured_pcrs = 0;
> +	}
> +	read_unlock(&iint->ns_list_lock);
> +}
> +
> +void ima_ns_iint_list_free(struct integrity_iint_cache *iint)
> +{
> +	struct ima_ns_cache *entry;
> +
> +	/* iint locking unnecessary; no-one should have acces to the list */
> +	while (!list_empty(&iint->ns_list)) {
> +		entry = list_entry(iint->ns_list.next, struct ima_ns_cache,
> +				   iint_list);
> +
> +		/* namespace is still active to lock to delete */
> +		write_lock(&entry->ns->ima_inode_list_lock);
> +		list_del(&entry->ns_list);
> +		write_unlock(&entry->ns->ima_inode_list_lock);
> +
> +		list_del(&entry->iint_list);
> +		kmem_cache_free(ns_cache, entry);
> +	}
> +}
> diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
> index 320ca80aacab..9434a1064da6 100644
> --- a/security/integrity/ima/ima_policy.c
> +++ b/security/integrity/ima/ima_policy.c
> @@ -50,7 +50,7 @@
>  #define DONT_HASH	0x0200
>  
>  #define INVALID_PCR(a) (((a) < 0) || \
> -	(a) >= (sizeof_field(struct integrity_iint_cache, measured_pcrs) * 8))
> +	(a) >= (sizeof_field(struct ima_ns_cache, measured_pcrs) * 8))
>  
>  int ima_policy_flag;
>  static int temp_ima_appraise;
> diff --git a/security/integrity/ima/ima_template_lib.c b/security/integrity/ima/ima_template_lib.c
> index ebd54c1b5206..d1a381add127 100644
> --- a/security/integrity/ima/ima_template_lib.c
> +++ b/security/integrity/ima/ima_template_lib.c
> @@ -703,6 +703,8 @@ int ima_eventinodexattrvalues_init(struct ima_event_data *event_data,
>  int ima_ns_init(struct ima_event_data *event_data,
>  		struct ima_field_data *field_data)
>  {
> +	if (unlikely(!ima_ns_in_template))
> +		ima_ns_in_template = true;
>  	return ima_write_template_field_data(&current_user_ns()->uuid,
>  					     UUID_SIZE, DATA_FMT_UUID,
>  					     field_data);
> diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
> index 547425c20e11..8755635da3ff 100644
> --- a/security/integrity/integrity.h
> +++ b/security/integrity/integrity.h
> @@ -129,7 +129,6 @@ struct integrity_iint_cache {
>  	struct inode *inode;	/* back pointer to inode in question */
>  	u64 version;		/* track inode changes */
>  	unsigned long flags;
> -	unsigned long measured_pcrs;
>  	unsigned long atomic_flags;
>  	enum integrity_status ima_file_status:4;
>  	enum integrity_status ima_mmap_status:4;
> @@ -138,6 +137,8 @@ struct integrity_iint_cache {
>  	enum integrity_status ima_creds_status:4;
>  	enum integrity_status evm_status:4;
>  	struct ima_digest_data *ima_hash;
> +	struct list_head ns_list; /* list of namespaces inode seen in */
> +	rwlock_t ns_list_lock;
>  };
>  
>  /* rbtree tree calls to lookup, insert, delete
> @@ -225,6 +226,14 @@ static inline void ima_load_x509(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_IMA
> +void ima_ns_iint_list_free(struct integrity_iint_cache *iint);
> +#else
> +void ima_ns_iint_list_free(struct integrity_iint_cache *iint)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_EVM_LOAD_X509
>  void __init evm_load_x509(void);
>  #else
> -- 
> 2.33.0

James Bottomley Nov. 29, 2021, 12:50 p.m. UTC | #2

On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
> > Currently we get one entry in the IMA log per unique file
> > event.  So, if you have a measurement policy and it measures a
> > particular binary it will not get measured again if it is
> > subsequently executed. For Namespaced IMA, the correct behaviour
> > seems to be to log once per inode per namespace (so every unique
> > execution in a namespace gets a separate log entry).  Since logging
> > once per inode per namespace is
> 
> I suspect I'll need to do a more in depth reading of the existing
> code, but I'll ask the lazy question anyway (since you say "the
> correct behavior seems to be") - is it actually important that
> files which were appraised under a parent namespace's policy already
> should be logged again?

I think so.  For a couple of reasons, assuming the namespace eventually
gets its own log entries, which the next incremental patch proposed to
do by virtualizing the securityfs entries.  If you don't do this:

   1. The namespace log will be incomplete in a random way depending on
      what execution preceded it.  This violates the principle of least
      surprise, because you at least expect the IMA log to be consistent
      per set of executions.
   2. A malicious namespace owner can use the missing information in their
      log to infer the execution of others, which is an information leak
      IMA tries to prevent by keeping the log readable only by root.

>   Since you used the word "log" I'm assuming this isn't building a
> running hash like a tpm pcr where skipping one would invalidate
> rmeote attestation?

Their is only one IMA log, so each entry must be extended through the
PCR to keep the quote correct.  That means each per namespace entry
does extend the PCR.  However, since the entire entry is logged and the
namespace uuid is part of the entry, the entry hash is different for
each, so the IMA rule about duplicate log entries doesn't apply, if
that was the worry?

James

Stefan Berger Nov. 29, 2021, 1:53 p.m. UTC | #3

On 11/29/21 07:50, James Bottomley wrote:
> On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
>> On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
>>> Currently we get one entry in the IMA log per unique file
>>> event.  So, if you have a measurement policy and it measures a
>>> particular binary it will not get measured again if it is
>>> subsequently executed. For Namespaced IMA, the correct behaviour
>>> seems to be to log once per inode per namespace (so every unique
>>> execution in a namespace gets a separate log entry).  Since logging
>>> once per inode per namespace is
>> I suspect I'll need to do a more in depth reading of the existing
>> code, but I'll ask the lazy question anyway (since you say "the
>> correct behavior seems to be") - is it actually important that
>> files which were appraised under a parent namespace's policy already
>> should be logged again?
> I think so.  For a couple of reasons, assuming the namespace eventually
> gets its own log entries, which the next incremental patch proposed to
> do by virtualizing the securityfs entries.  If you don't do this:

To avoid duplicate efforts, an implementation of a virtualized 
securityfs is in this series here:

https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3

It starts with 'securityfs: Prefix global variables with secruityfs_'

    Stefan

James Bottomley Nov. 29, 2021, 2:10 p.m. UTC | #4

On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> On 11/29/21 07:50, James Bottomley wrote:
> > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
> > > > Currently we get one entry in the IMA log per unique file
> > > > event.  So, if you have a measurement policy and it measures a
> > > > particular binary it will not get measured again if it is
> > > > subsequently executed. For Namespaced IMA, the correct
> > > > behaviour
> > > > seems to be to log once per inode per namespace (so every
> > > > unique
> > > > execution in a namespace gets a separate log entry).  Since
> > > > logging
> > > > once per inode per namespace is
> > > I suspect I'll need to do a more in depth reading of the existing
> > > code, but I'll ask the lazy question anyway (since you say "the
> > > correct behavior seems to be") - is it actually important that
> > > files which were appraised under a parent namespace's policy
> > > already
> > > should be logged again?
> > I think so.  For a couple of reasons, assuming the namespace
> > eventually
> > gets its own log entries, which the next incremental patch proposed
> > to
> > do by virtualizing the securityfs entries.  If you don't do this:
> 
> To avoid duplicate efforts, an implementation of a virtualized 
> securityfs is in this series here:
> 
> https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> 
> It starts with 'securityfs: Prefix global variables with secruityfs_'

That's quite a big patch series.  I already actually implemented this
as part of the RFC for getting the per namespace measurement log.  The
attached is basically what I did.

Most of the time we don't require namespacing the actual virtualfs
file, because it's world readable.  IMA has a special requirement in
this regard because the IMA files should be readable (and writeable
when we get around to policy updates) by the admin of the namespace but
their protection is 0640 or 0440.  I thought the simplest solution
would be to make an additional flag that coped with the permissions and
a per-inode flag way of making the file as "accessible by userns
admin".  Doing something simple like this gives a much smaller
diffstat:

 fs/stat.c                |    8 ++++++++
 include/linux/fs.h       |    1 +
 include/linux/security.h |    2 +-
 kernel/capability.c      |    6 ++++--
 security/inode.c         |    6 ++++--
 5 files changed, 18 insertions(+), 5 deletions(-)

The full diff is below so you can see how I did it.  There's a sort of
serendipity bit where it turns out the next i_flag entry is at bit 17
meaning it can be overloaded on the mode which is a u16, but other than
that I think it's pretty clean.

James

---

diff --git a/fs/stat.c b/fs/stat.c
index 28d2020ba1f4..2025dfea6720 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -49,6 +49,14 @@ void generic_fillattr(struct user_namespace *mnt_userns, struct inode *inode,
 	stat->nlink = inode->i_nlink;
 	stat->uid = i_uid_into_mnt(mnt_userns, inode);
 	stat->gid = i_gid_into_mnt(mnt_userns, inode);
+	if (inode->i_flags & S_NSUSER) {
+		struct user_namespace *ns = current_user_ns();
+
+		if (uid_eq(stat->uid, GLOBAL_ROOT_UID))
+			stat->uid = ns->owner;
+		if (gid_eq(stat->gid, GLOBAL_ROOT_GID))
+			stat->gid = ns->group;
+	}
 	stat->rdev = inode->i_rdev;
 	stat->size = i_size_read(inode);
 	stat->atime = inode->i_atime;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1cb616fc1105..8bebac8463ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2249,6 +2249,7 @@ struct super_operations {
 #define S_ENCRYPTED	(1 << 14) /* Encrypted file (using fs/crypto/) */
 #define S_CASEFOLD	(1 << 15) /* Casefolded file */
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
+#define S_NSUSER	(1 << 17) /* uid/gid changes with user_ns */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/security.h b/include/linux/security.h
index bbf44a466832..1cfe8832f5e0 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1919,7 +1919,7 @@ static inline void security_audit_rule_free(void *lsmrule)
 
 #ifdef CONFIG_SECURITYFS
 
-extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
+extern struct dentry *securityfs_create_file(const char *name, unsigned int mode,
 					     struct dentry *parent, void *data,
 					     const struct file_operations *fops);
 extern struct dentry *securityfs_create_dir(const char *name, struct dentry *parent);
diff --git a/kernel/capability.c b/kernel/capability.c
index 46a361dde042..54907ffa4947 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -506,8 +506,10 @@ bool capable_wrt_inode_uidgid(struct user_namespace *mnt_userns,
 {
 	struct user_namespace *ns = current_user_ns();
 
-	return ns_capable(ns, cap) &&
-	       privileged_wrt_inode_uidgid(ns, mnt_userns, inode);
+	if (ns_capable(ns, cap) &&
+	    privileged_wrt_inode_uidgid(ns, mnt_userns, inode))
+		return true;
+	return (inode->i_flags & S_NSUSER) && ns_capable(ns, cap);
 }
 EXPORT_SYMBOL(capable_wrt_inode_uidgid);
 
diff --git a/security/inode.c b/security/inode.c
index 6c326939750d..917514ddfce4 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -104,7 +104,7 @@ static struct file_system_type fs_type = {
  * If securityfs is not enabled in the kernel, the value %-ENODEV is
  * returned.
  */
-static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
+static struct dentry *securityfs_create_dentry(const char *name, unsigned int mode,
 					struct dentry *parent, void *data,
 					const struct file_operations *fops,
 					const struct inode_operations *iops)
@@ -112,6 +112,7 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 	struct dentry *dentry;
 	struct inode *dir, *inode;
 	int error;
+	unsigned int flags = mode & 0xffff0000;
 
 	if (!(mode & S_IFMT))
 		mode = (mode & S_IALLUGO) | S_IFREG;
@@ -147,6 +148,7 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 	inode->i_mode = mode;
 	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 	inode->i_private = data;
+	inode->i_flags = flags;
 	if (S_ISDIR(mode)) {
 		inode->i_op = &simple_dir_inode_operations;
 		inode->i_fop = &simple_dir_operations;
@@ -197,7 +199,7 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
  * If securityfs is not enabled in the kernel, the value %-ENODEV is
  * returned.
  */
-struct dentry *securityfs_create_file(const char *name, umode_t mode,
+struct dentry *securityfs_create_file(const char *name, unsigned int mode,
 				      struct dentry *parent, void *data,
 				      const struct file_operations *fops)
 {

Christian Brauner Nov. 29, 2021, 2:22 p.m. UTC | #5

On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > On 11/29/21 07:50, James Bottomley wrote:
> > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
> > > > > Currently we get one entry in the IMA log per unique file
> > > > > event.  So, if you have a measurement policy and it measures a
> > > > > particular binary it will not get measured again if it is
> > > > > subsequently executed. For Namespaced IMA, the correct
> > > > > behaviour
> > > > > seems to be to log once per inode per namespace (so every
> > > > > unique
> > > > > execution in a namespace gets a separate log entry).  Since
> > > > > logging
> > > > > once per inode per namespace is
> > > > I suspect I'll need to do a more in depth reading of the existing
> > > > code, but I'll ask the lazy question anyway (since you say "the
> > > > correct behavior seems to be") - is it actually important that
> > > > files which were appraised under a parent namespace's policy
> > > > already
> > > > should be logged again?
> > > I think so.  For a couple of reasons, assuming the namespace
> > > eventually
> > > gets its own log entries, which the next incremental patch proposed
> > > to
> > > do by virtualizing the securityfs entries.  If you don't do this:
> > 
> > To avoid duplicate efforts, an implementation of a virtualized 
> > securityfs is in this series here:
> > 
> > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > 
> > It starts with 'securityfs: Prefix global variables with secruityfs_'
> 
> That's quite a big patch series.  I already actually implemented this
> as part of the RFC for getting the per namespace measurement log.  The
> attached is basically what I did.
> 
> Most of the time we don't require namespacing the actual virtualfs
> file, because it's world readable.  IMA has a special requirement in
> this regard because the IMA files should be readable (and writeable
> when we get around to policy updates) by the admin of the namespace but
> their protection is 0640 or 0440.  I thought the simplest solution
> would be to make an additional flag that coped with the permissions and
> a per-inode flag way of making the file as "accessible by userns
> admin".  Doing something simple like this gives a much smaller
> diffstat:

That's a NAK from me. Stefan's series might be bigger but it does things
correctly. I appreciate the keep it simple attitude but no. I won't
speciale-case securityfs or similar stuff in core vfs helpers.

Christian

Stefan Berger Nov. 29, 2021, 2:30 p.m. UTC | #6

On 11/29/21 09:10, James Bottomley wrote:
> On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
>> On 11/29/21 07:50, James Bottomley wrote:
>>> On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
>>>> On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
>>>>> Currently we get one entry in the IMA log per unique file
>>>>> event.  So, if you have a measurement policy and it measures a
>>>>> particular binary it will not get measured again if it is
>>>>> subsequently executed. For Namespaced IMA, the correct
>>>>> behaviour
>>>>> seems to be to log once per inode per namespace (so every
>>>>> unique
>>>>> execution in a namespace gets a separate log entry).  Since
>>>>> logging
>>>>> once per inode per namespace is
>>>> I suspect I'll need to do a more in depth reading of the existing
>>>> code, but I'll ask the lazy question anyway (since you say "the
>>>> correct behavior seems to be") - is it actually important that
>>>> files which were appraised under a parent namespace's policy
>>>> already
>>>> should be logged again?
>>> I think so.  For a couple of reasons, assuming the namespace
>>> eventually
>>> gets its own log entries, which the next incremental patch proposed
>>> to
>>> do by virtualizing the securityfs entries.  If you don't do this:
>> To avoid duplicate efforts, an implementation of a virtualized
>> securityfs is in this series here:
>>
>> https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
>>
>> It starts with 'securityfs: Prefix global variables with secruityfs_'
> That's quite a big patch series.  I already actually implemented this
> as part of the RFC for getting the per namespace measurement log.  The
> attached is basically what I did.

I know it's big. I tried to avoid having to bind-mount the system-wide 
single securityfs into the container and inherit all the other security 
subsystems' files and directories (evm, TPM, safesetid, apparmor, tomoyo 
[ https://elixir.bootlin.com/linux/latest/C/ident/securityfs_create_dir 
]) and instead have a  'view' that is a bit more restricted to those 
subsystems that are namespaced. The securityfs_ns I created can be 
mounted into each user namespace individually and only shows what you're 
supposed to see without other filesystem tricks to hide files or so. It 
should be future-extensible for other subsystem to register themselves 
there if they have something to show to the user.

    Stefan

James Bottomley Nov. 29, 2021, 2:46 p.m. UTC | #7

On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > On 11/29/21 07:50, James Bottomley wrote:
> > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > wrote:
> > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > event.  So, if you have a measurement policy and it
> > > > > > measures a particular binary it will not get measured again
> > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > correct behaviour seems to be to log once per inode per
> > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > separate log entry).  Since logging once per inode per
> > > > > > namespace is
> > > > > I suspect I'll need to do a more in depth reading of the
> > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > you say "the correct behavior seems to be") - is it actually
> > > > > important that files which were appraised under a parent
> > > > > namespace's policy already should be logged again?
> > > > I think so.  For a couple of reasons, assuming the namespace
> > > > eventually gets its own log entries, which the next incremental
> > > > patch proposed to do by virtualizing the securityfs
> > > > entries.  If you don't do this:
> > > 
> > > To avoid duplicate efforts, an implementation of a virtualized 
> > > securityfs is in this series here:
> > > 
> > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > 
> > > It starts with 'securityfs: Prefix global variables with
> > > secruityfs_'
> > 
> > That's quite a big patch series.  I already actually implemented
> > this as part of the RFC for getting the per namespace measurement
> > log.  The attached is basically what I did.
> > 
> > Most of the time we don't require namespacing the actual virtualfs
> > file, because it's world readable.  IMA has a special requirement
> > in this regard because the IMA files should be readable (and
> > writeable when we get around to policy updates) by the admin of the
> > namespace but their protection is 0640 or 0440.  I thought the
> > simplest solution would be to make an additional flag that coped
> > with the permissions and a per-inode flag way of making the file as
> > "accessible by userns admin".  Doing something simple like this
> > gives a much smaller diffstat:
> 
> That's a NAK from me. Stefan's series might be bigger but it does
> things correctly. I appreciate the keep it simple attitude but no. I
> won't speciale-case securityfs or similar stuff in core vfs helpers.

Well, there's a reason it's an unpublished patch.  However, the more
important point is that namespacing IMA requires discussion of certain
points that we never seem to drive to a conclusion.  Using the akpm
method, I propose simple patches that drive the discussion.  I think
the points are:

   1. Should IMA be its own namespace or tied to the user namespace?  The
      previous patches all took the separate Namespace approach, but I
      think that should be reconsidered now keyrings are in the user
      namespace.
   2. How do we get a unique id for the IMA namespace to use in the log?
   3. how should we virtualize securityfs for IMA given the need of the
      namespace admin to read and write the IMA files?

And, of course, the fun ones we're coming to.

   1. Given that the current keyring namespacing doesn't give access to
      the system keyrings, how do we get per-namespace access for
      .ima/_ima system keyrings given that the namespace admin is going to
      want to set their own key for appraisal?
   2. What mechanism should we use for .ima/_ima key setting?  The current
      mechanism is must be signed by a key in the system keyrings sounds
      appropriate, but is problematic given most system owners don't
      actually have any private keys for keys in the system keyrings. 
      Hopefully the MoK keyring patches will help us have an easier
      approach to this.

James

James Bottomley Nov. 29, 2021, 3:08 p.m. UTC | #8

On Mon, 2021-11-29 at 09:30 -0500, Stefan Berger wrote:
> On 11/29/21 09:10, James Bottomley wrote:
> > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > On 11/29/21 07:50, James Bottomley wrote:
> > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > wrote:
> > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > event.  So, if you have a measurement policy and it
> > > > > > measures a particular binary it will not get measured again
> > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > correct behaviour seems to be to log once per inode per
> > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > separate log entry).  Since logging once per inode per
> > > > > > namespace is
> > > > > I suspect I'll need to do a more in depth reading of the
> > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > you say "the correct behavior seems to be") - is it actually
> > > > > important that files which were appraised under a parent
> > > > > namespace's policy already should be logged again?
> > > > I think so.  For a couple of reasons, assuming the namespace
> > > > eventually gets its own log entries, which the next incremental
> > > > patch proposed to do by virtualizing the securityfs
> > > > entries.  If you don't do this:
> > > To avoid duplicate efforts, an implementation of a virtualized
> > > securityfs is in this series here:
> > > 
> > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > 
> > > It starts with 'securityfs: Prefix global variables with
> > > secruityfs_' 

> > That's quite a big patch series.  I already actually implemented
> > this as part of the RFC for getting the per namespace measurement
> > log.  The attached is basically what I did.
> 
> I know it's big. I tried to avoid having to bind-mount the system-
> wide single securityfs into the container and inherit all the other
> security subsystems' files and directories (evm, TPM, safesetid,
> apparmor, tomoyo  [ 
> https://elixir.bootlin.com/linux/latest/C/ident/securityfs_create_dir
>  
> ]) and instead have a  'view' that is a bit more restricted to those 
> subsystems that are namespaced. The securityfs_ns I created can be 
> mounted into each user namespace individually and only shows what
> you're supposed to see without other filesystem tricks to hide files
> or so. It should be future-extensible for other subsystem to register
> themselves there if they have something to show to the user.

Using F_USER_NS for this is certainly what it was designed for.  I
don't think size is a problem as long as it's right sized to perform
the required function.  I usually find it easier to oversimplify and
work up, but that's certainly not the only approach.

James

Stefan Berger Nov. 29, 2021, 3:27 p.m. UTC | #9

On 11/29/21 09:46, James Bottomley wrote:
> On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
>> On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
>>> On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
>>>> On 11/29/21 07:50, James Bottomley wrote:
>>>>> On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
>>>>>> On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
>>>>>> wrote:
>>>>>>> Currently we get one entry in the IMA log per unique file
>>>>>>> event.  So, if you have a measurement policy and it
>>>>>>> measures a particular binary it will not get measured again
>>>>>>> if it is subsequently executed. For Namespaced IMA, the
>>>>>>> correct behaviour seems to be to log once per inode per
>>>>>>> namespace (so every unique execution in a namespace gets a
>>>>>>> separate log entry).  Since logging once per inode per
>>>>>>> namespace is
>>>>>> I suspect I'll need to do a more in depth reading of the
>>>>>> existing code, but I'll ask the lazy question anyway (since
>>>>>> you say "the correct behavior seems to be") - is it actually
>>>>>> important that files which were appraised under a parent
>>>>>> namespace's policy already should be logged again?
>>>>> I think so.  For a couple of reasons, assuming the namespace
>>>>> eventually gets its own log entries, which the next incremental
>>>>> patch proposed to do by virtualizing the securityfs
>>>>> entries.  If you don't do this:
>>>> To avoid duplicate efforts, an implementation of a virtualized
>>>> securityfs is in this series here:
>>>>
>>>> https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
>>>>
>>>> It starts with 'securityfs: Prefix global variables with
>>>> secruityfs_'
>>> That's quite a big patch series.  I already actually implemented
>>> this as part of the RFC for getting the per namespace measurement
>>> log.  The attached is basically what I did.
>>>
>>> Most of the time we don't require namespacing the actual virtualfs
>>> file, because it's world readable.  IMA has a special requirement
>>> in this regard because the IMA files should be readable (and
>>> writeable when we get around to policy updates) by the admin of the
>>> namespace but their protection is 0640 or 0440.  I thought the
>>> simplest solution would be to make an additional flag that coped
>>> with the permissions and a per-inode flag way of making the file as
>>> "accessible by userns admin".  Doing something simple like this
>>> gives a much smaller diffstat:
>> That's a NAK from me. Stefan's series might be bigger but it does
>> things correctly. I appreciate the keep it simple attitude but no. I
>> won't speciale-case securityfs or similar stuff in core vfs helpers.
> Well, there's a reason it's an unpublished patch.  However, the more
> important point is that namespacing IMA requires discussion of certain
> points that we never seem to drive to a conclusion.  Using the akpm
> method, I propose simple patches that drive the discussion.  I think
> the points are:
[...]
>
>
> And, of course, the fun ones we're coming to.
>
>     1. Given that the current keyring namespacing doesn't give access to
>        the system keyrings, how do we get per-namespace access for
>        .ima/_ima system keyrings given that the namespace admin is going to
>        want to set their own key for appraisal?
>     2. What mechanism should we use for .ima/_ima key setting?  The current
>        mechanism is must be signed by a key in the system keyrings sounds
>        appropriate, but is problematic given most system owners don't
>        actually have any private keys for keys in the system keyrings.
>        Hopefully the MoK keyring patches will help us have an easier
>        approach to this.

The approach we took in the previous implementation was to support BYOK 
(bring your own key) for every container. The (trusted) container 
runtime has to do what dracut would typically do, namely create the 
keyrings and load the keys onto it.

The container runtime would

1a) expect keys to be found inside a container's filesystem at the usual 
location (/etc/keys/ima) but then also allow for a CA key that is used 
to verify the signature of those keys; that CA key is typically baked 
into the Linux kernel when .ima is to be used, but for containers and 
BYOK it's an additional file in the container's filesystem

1b) passing in keys via command line should be possible as well but 
that's an implementation detail

2) container runtime sets up either a restricted keyring [ 
https://elixir.bootlin.com/linux/latest/source/Documentation/crypto/asymmetric-keys.rst#L359 
] if that CA key is found in the filesystem or a 'normal' keyring. The 
container runtime then loads the keys onto that keyring; call that 
keyring '.ima' or '_ima' for as long as the kernel knows what keyring to 
search for. We created that keyring under a session keyring. With the 
user namespace isolation and keyrings support in the user namespace the 
isolation of the IMA related keyrings between different user namespaces 
should be possible.


The same would be done for the IMA policy where the container runtime 
also needs to do some work that usually dracut would do:

- expect the IMA policy for the container at the usual location 
(/etc/ima/ima-policy) and load it into the container's 'securityfs' 
policy file

    Stefan

Serge Hallyn Nov. 29, 2021, 3:35 p.m. UTC | #10

On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > > On 11/29/21 07:50, James Bottomley wrote:
> > > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > > wrote:
> > > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > > event.  So, if you have a measurement policy and it
> > > > > > > measures a particular binary it will not get measured again
> > > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > > correct behaviour seems to be to log once per inode per
> > > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > > separate log entry).  Since logging once per inode per
> > > > > > > namespace is
> > > > > > I suspect I'll need to do a more in depth reading of the
> > > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > > you say "the correct behavior seems to be") - is it actually
> > > > > > important that files which were appraised under a parent
> > > > > > namespace's policy already should be logged again?
> > > > > I think so.  For a couple of reasons, assuming the namespace
> > > > > eventually gets its own log entries, which the next incremental
> > > > > patch proposed to do by virtualizing the securityfs
> > > > > entries.  If you don't do this:
> > > > 
> > > > To avoid duplicate efforts, an implementation of a virtualized 
> > > > securityfs is in this series here:
> > > > 
> > > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > > 
> > > > It starts with 'securityfs: Prefix global variables with
> > > > secruityfs_'
> > > 
> > > That's quite a big patch series.  I already actually implemented
> > > this as part of the RFC for getting the per namespace measurement
> > > log.  The attached is basically what I did.
> > > 
> > > Most of the time we don't require namespacing the actual virtualfs
> > > file, because it's world readable.  IMA has a special requirement
> > > in this regard because the IMA files should be readable (and
> > > writeable when we get around to policy updates) by the admin of the
> > > namespace but their protection is 0640 or 0440.  I thought the
> > > simplest solution would be to make an additional flag that coped
> > > with the permissions and a per-inode flag way of making the file as
> > > "accessible by userns admin".  Doing something simple like this
> > > gives a much smaller diffstat:
> > 
> > That's a NAK from me. Stefan's series might be bigger but it does
> > things correctly. I appreciate the keep it simple attitude but no. I
> > won't speciale-case securityfs or similar stuff in core vfs helpers.
> 
> Well, there's a reason it's an unpublished patch.  However, the more
> important point is that namespacing IMA requires discussion of certain
> points that we never seem to drive to a conclusion.  Using the akpm
> method, I propose simple patches that drive the discussion.  I think
> the points are:
> 
>    1. Should IMA be its own namespace or tied to the user namespace?  The
>       previous patches all took the separate Namespace approach, but I
>       think that should be reconsidered now keyrings are in the user
>       namespace.

Well that purely depends on the needed scope.

The audit container identifier is a neat thing.  But it absolutely must
be settable, so seems to conflict with your needs.

Your patch puts an identifier on the user_namespace.  I'm not quite sure,
does that satisfy Stefan's needs?  A new ima ns if and only if there is a
new user ns?

I think you two need to get together and discuss the requirements, and come
back with a brief but very precise document explaining what you need.

Are you both looking at the same use case?  Who is consuming the audit
log, and to what end?  Container administrators?  Any time they log in?
How do they assure themselves that the securityfs file they're reading
hasn't been overmounted?

I need to find a document to read about IMA's usage of PCRs.  For
namespacing, are you expecting each container to be hooked up to a
swtmp instance so they have their own PCR they can use?

>    2. How do we get a unique id for the IMA namespace to use in the log?
>    3. how should we virtualize securityfs for IMA given the need of the
>       namespace admin to read and write the IMA files?
> 
> And, of course, the fun ones we're coming to.
> 
>    1. Given that the current keyring namespacing doesn't give access to
>       the system keyrings, how do we get per-namespace access for
>       .ima/_ima system keyrings given that the namespace admin is going to
>       want to set their own key for appraisal?
>    2. What mechanism should we use for .ima/_ima key setting?  The current
>       mechanism is must be signed by a key in the system keyrings sounds
>       appropriate, but is problematic given most system owners don't
>       actually have any private keys for keys in the system keyrings. 
>       Hopefully the MoK keyring patches will help us have an easier
>       approach to this.
> 
> James
>

Stefan Berger Nov. 29, 2021, 4:07 p.m. UTC | #11

On 11/29/21 10:35, Serge E. Hallyn wrote:
> On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
>> On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
>>> On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
>>>> On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
>>>>> On 11/29/21 07:50, James Bottomley wrote:
>>>>>> On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
>>>>>>> On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
>>>>>>> wrote:
>>>>>>>> Currently we get one entry in the IMA log per unique file
>>>>>>>> event.  So, if you have a measurement policy and it
>>>>>>>> measures a particular binary it will not get measured again
>>>>>>>> if it is subsequently executed. For Namespaced IMA, the
>>>>>>>> correct behaviour seems to be to log once per inode per
>>>>>>>> namespace (so every unique execution in a namespace gets a
>>>>>>>> separate log entry).  Since logging once per inode per
>>>>>>>> namespace is
>>>>>>> I suspect I'll need to do a more in depth reading of the
>>>>>>> existing code, but I'll ask the lazy question anyway (since
>>>>>>> you say "the correct behavior seems to be") - is it actually
>>>>>>> important that files which were appraised under a parent
>>>>>>> namespace's policy already should be logged again?
>>>>>> I think so.  For a couple of reasons, assuming the namespace
>>>>>> eventually gets its own log entries, which the next incremental
>>>>>> patch proposed to do by virtualizing the securityfs
>>>>>> entries.  If you don't do this:
>>>>> To avoid duplicate efforts, an implementation of a virtualized
>>>>> securityfs is in this series here:
>>>>>
>>>>> https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
>>>>>
>>>>> It starts with 'securityfs: Prefix global variables with
>>>>> secruityfs_'
>>>> That's quite a big patch series.  I already actually implemented
>>>> this as part of the RFC for getting the per namespace measurement
>>>> log.  The attached is basically what I did.
>>>>
>>>> Most of the time we don't require namespacing the actual virtualfs
>>>> file, because it's world readable.  IMA has a special requirement
>>>> in this regard because the IMA files should be readable (and
>>>> writeable when we get around to policy updates) by the admin of the
>>>> namespace but their protection is 0640 or 0440.  I thought the
>>>> simplest solution would be to make an additional flag that coped
>>>> with the permissions and a per-inode flag way of making the file as
>>>> "accessible by userns admin".  Doing something simple like this
>>>> gives a much smaller diffstat:
>>> That's a NAK from me. Stefan's series might be bigger but it does
>>> things correctly. I appreciate the keep it simple attitude but no. I
>>> won't speciale-case securityfs or similar stuff in core vfs helpers.
>> Well, there's a reason it's an unpublished patch.  However, the more
>> important point is that namespacing IMA requires discussion of certain
>> points that we never seem to drive to a conclusion.  Using the akpm
>> method, I propose simple patches that drive the discussion.  I think
>> the points are:
>>
>>     1. Should IMA be its own namespace or tied to the user namespace?  The
>>        previous patches all took the separate Namespace approach, but I
>>        think that should be reconsidered now keyrings are in the user
>>        namespace.
> Well that purely depends on the needed scope.
>
> The audit container identifier is a neat thing.  But it absolutely must
> be settable, so seems to conflict with your needs.
>
> Your patch puts an identifier on the user_namespace.  I'm not quite sure,
> does that satisfy Stefan's needs?  A new ima ns if and only if there is a
> new user ns?
>
> I think you two need to get together and discuss the requirements, and come
> back with a brief but very precise document explaining what you need.

What would those want who look at audit messages? [Idk] Would they want 
a constant identifier for IMA audit messages in the audit log across all 
restarts of a container? Presumably that would make quick queries across 
restarts much easier. Or could they live with an audit message emitted 
from the container runtime indicating that this time the (IMA) audit 
messages from this container will have this UUID here?

I guess both would 'work.'

>
> Are you both looking at the same use case?  Who is consuming the audit
> log, and to what end?  Container administrators?  Any time they log in?
> How do they assure themselves that the securityfs file they're reading
> hasn't been overmounted?

The question is also should there only be one identifier or can there be 
two different one (one from audit patch series and uuid of user namespace).


>
> I need to find a document to read about IMA's usage of PCRs.  For
> namespacing, are you expecting each container to be hooked up to a
> swtmp instance so they have their own PCR they can use?

It's complicated and there's a bit more to this... I would try to 
architect it in a way that the IMA system policy can cover what's going 
on inside IMA namespaces, i.e., audit and measure and appraise file 
accesses occurring in those namespace. We call it hierarchical 
processing ( 
https://github.com/stefanberger/linux-ima-namespaces/commit/e88dc84ec97753fd65d302ee1bf03951001ab48f 
) where file access are evaluated against the current namespace's policy 
and then also evaluated against those of parent namespaces back to the 
init_ima_ns. The goal is to avoid evasion of measurements etc. by the 
user just by spawning new IMA namespaces. I think logging into the IMA 
system log will not scale well if there are hundreds of containers on 
the system using IMA and logging into the system log and hammering the 
TPM. So, the answer then is write your policy in such a way that it 
doesn't cover the IMA/user namespaces (containers) and have each 
container have its own IMA policy and IMA log and and an optional vTPM. 
So my answer would be 'optional swtpm.'

    Stefan

Christian Brauner Nov. 29, 2021, 4:16 p.m. UTC | #12

On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > > > On 11/29/21 07:50, James Bottomley wrote:
> > > > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > > > wrote:
> > > > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > > > event.  So, if you have a measurement policy and it
> > > > > > > > measures a particular binary it will not get measured again
> > > > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > > > correct behaviour seems to be to log once per inode per
> > > > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > > > separate log entry).  Since logging once per inode per
> > > > > > > > namespace is
> > > > > > > I suspect I'll need to do a more in depth reading of the
> > > > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > > > you say "the correct behavior seems to be") - is it actually
> > > > > > > important that files which were appraised under a parent
> > > > > > > namespace's policy already should be logged again?
> > > > > > I think so.  For a couple of reasons, assuming the namespace
> > > > > > eventually gets its own log entries, which the next incremental
> > > > > > patch proposed to do by virtualizing the securityfs
> > > > > > entries.  If you don't do this:
> > > > > 
> > > > > To avoid duplicate efforts, an implementation of a virtualized 
> > > > > securityfs is in this series here:
> > > > > 
> > > > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > > > 
> > > > > It starts with 'securityfs: Prefix global variables with
> > > > > secruityfs_'
> > > > 
> > > > That's quite a big patch series.  I already actually implemented
> > > > this as part of the RFC for getting the per namespace measurement
> > > > log.  The attached is basically what I did.
> > > > 
> > > > Most of the time we don't require namespacing the actual virtualfs
> > > > file, because it's world readable.  IMA has a special requirement
> > > > in this regard because the IMA files should be readable (and
> > > > writeable when we get around to policy updates) by the admin of the
> > > > namespace but their protection is 0640 or 0440.  I thought the
> > > > simplest solution would be to make an additional flag that coped
> > > > with the permissions and a per-inode flag way of making the file as
> > > > "accessible by userns admin".  Doing something simple like this
> > > > gives a much smaller diffstat:
> > > 
> > > That's a NAK from me. Stefan's series might be bigger but it does
> > > things correctly. I appreciate the keep it simple attitude but no. I
> > > won't speciale-case securityfs or similar stuff in core vfs helpers.
> > 
> > Well, there's a reason it's an unpublished patch.  However, the more
> > important point is that namespacing IMA requires discussion of certain
> > points that we never seem to drive to a conclusion.  Using the akpm
> > method, I propose simple patches that drive the discussion.  I think
> > the points are:
> > 
> >    1. Should IMA be its own namespace or tied to the user namespace?  The
> >       previous patches all took the separate Namespace approach, but I
> >       think that should be reconsidered now keyrings are in the user
> >       namespace.
> 
> Well that purely depends on the needed scope.
> 
> The audit container identifier is a neat thing.  But it absolutely must
> be settable, so seems to conflict with your needs.
> 
> Your patch puts an identifier on the user_namespace.  I'm not quite sure,
> does that satisfy Stefan's needs?  A new ima ns if and only if there is a
> new user ns?

I kept thinking about this question while I was out running and while I
admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
to me that this is the right approach after all. Making it part of
userns at least in this form isn't clean.

I think attaching a uuid to a userns alone for the sake of IMA is wrong.
Additionally, I think a uuid only for the userns is too limited. This is
similar to the problem of the audit container id. If we have some sort
of uuid for ima it will automatically evolve into something like a
container id (I'm not even arguing that this is necessarily wrong.).
We also have the issue that we then have the container audit id thing -
if this ever lands and the ima userns uuid. All that makes it quite
messy.

I think CLONE_NEWIMA is ultimately nicer and allows the implementation
to be decoupled from the userns and self-contained as possible. This
also means that ima ns works for privileged containers which sure is a
valid use-case.
It will also make securityfs namespacing easier as it can be a keyed
super where the key is the ima ns (similar to how we deal with e.g.
mqueue).

> 
> I think you two need to get together and discuss the requirements, and come
> back with a brief but very precise document explaining what you need.

Good idea.

Christian Brauner Nov. 29, 2021, 4:20 p.m. UTC | #13

On Mon, Nov 29, 2021 at 09:30:00AM -0500, Stefan Berger wrote:
> 
> On 11/29/21 09:10, James Bottomley wrote:
> > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > On 11/29/21 07:50, James Bottomley wrote:
> > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley wrote:
> > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > event.  So, if you have a measurement policy and it measures a
> > > > > > particular binary it will not get measured again if it is
> > > > > > subsequently executed. For Namespaced IMA, the correct
> > > > > > behaviour
> > > > > > seems to be to log once per inode per namespace (so every
> > > > > > unique
> > > > > > execution in a namespace gets a separate log entry).  Since
> > > > > > logging
> > > > > > once per inode per namespace is
> > > > > I suspect I'll need to do a more in depth reading of the existing
> > > > > code, but I'll ask the lazy question anyway (since you say "the
> > > > > correct behavior seems to be") - is it actually important that
> > > > > files which were appraised under a parent namespace's policy
> > > > > already
> > > > > should be logged again?
> > > > I think so.  For a couple of reasons, assuming the namespace
> > > > eventually
> > > > gets its own log entries, which the next incremental patch proposed
> > > > to
> > > > do by virtualizing the securityfs entries.  If you don't do this:
> > > To avoid duplicate efforts, an implementation of a virtualized
> > > securityfs is in this series here:
> > > 
> > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > 
> > > It starts with 'securityfs: Prefix global variables with secruityfs_'
> > That's quite a big patch series.  I already actually implemented this
> > as part of the RFC for getting the per namespace measurement log.  The
> > attached is basically what I did.
> 
> I know it's big. I tried to avoid having to bind-mount the system-wide
> single securityfs into the container and inherit all the other security
> subsystems' files and directories (evm, TPM, safesetid, apparmor, tomoyo [
> https://elixir.bootlin.com/linux/latest/C/ident/securityfs_create_dir ]) and
> instead have a  'view' that is a bit more restricted to those subsystems
> that are namespaced. The securityfs_ns I created can be mounted into each
> user namespace individually and only shows what you're supposed to see
> without other filesystem tricks to hide files or so. It should be
> future-extensible for other subsystem to register themselves there if they
> have something to show to the user.

From a quick glance, your implementation is doing the right thing.
You're creating a keyed super via get_tree_keyed().

Christian Brauner Nov. 29, 2021, 4:23 p.m. UTC | #14

On Mon, Nov 29, 2021 at 05:16:56PM +0100, Christian Brauner wrote:
> On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > > > > On 11/29/21 07:50, James Bottomley wrote:
> > > > > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > > > > wrote:
> > > > > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > > > > event.  So, if you have a measurement policy and it
> > > > > > > > > measures a particular binary it will not get measured again
> > > > > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > > > > correct behaviour seems to be to log once per inode per
> > > > > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > > > > separate log entry).  Since logging once per inode per
> > > > > > > > > namespace is
> > > > > > > > I suspect I'll need to do a more in depth reading of the
> > > > > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > > > > you say "the correct behavior seems to be") - is it actually
> > > > > > > > important that files which were appraised under a parent
> > > > > > > > namespace's policy already should be logged again?
> > > > > > > I think so.  For a couple of reasons, assuming the namespace
> > > > > > > eventually gets its own log entries, which the next incremental
> > > > > > > patch proposed to do by virtualizing the securityfs
> > > > > > > entries.  If you don't do this:
> > > > > > 
> > > > > > To avoid duplicate efforts, an implementation of a virtualized 
> > > > > > securityfs is in this series here:
> > > > > > 
> > > > > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > > > > 
> > > > > > It starts with 'securityfs: Prefix global variables with
> > > > > > secruityfs_'
> > > > > 
> > > > > That's quite a big patch series.  I already actually implemented
> > > > > this as part of the RFC for getting the per namespace measurement
> > > > > log.  The attached is basically what I did.
> > > > > 
> > > > > Most of the time we don't require namespacing the actual virtualfs
> > > > > file, because it's world readable.  IMA has a special requirement
> > > > > in this regard because the IMA files should be readable (and
> > > > > writeable when we get around to policy updates) by the admin of the
> > > > > namespace but their protection is 0640 or 0440.  I thought the
> > > > > simplest solution would be to make an additional flag that coped
> > > > > with the permissions and a per-inode flag way of making the file as
> > > > > "accessible by userns admin".  Doing something simple like this
> > > > > gives a much smaller diffstat:
> > > > 
> > > > That's a NAK from me. Stefan's series might be bigger but it does
> > > > things correctly. I appreciate the keep it simple attitude but no. I
> > > > won't speciale-case securityfs or similar stuff in core vfs helpers.
> > > 
> > > Well, there's a reason it's an unpublished patch.  However, the more
> > > important point is that namespacing IMA requires discussion of certain
> > > points that we never seem to drive to a conclusion.  Using the akpm
> > > method, I propose simple patches that drive the discussion.  I think
> > > the points are:
> > > 
> > >    1. Should IMA be its own namespace or tied to the user namespace?  The
> > >       previous patches all took the separate Namespace approach, but I
> > >       think that should be reconsidered now keyrings are in the user
> > >       namespace.
> > 
> > Well that purely depends on the needed scope.
> > 
> > The audit container identifier is a neat thing.  But it absolutely must
> > be settable, so seems to conflict with your needs.
> > 
> > Your patch puts an identifier on the user_namespace.  I'm not quite sure,
> > does that satisfy Stefan's needs?  A new ima ns if and only if there is a
> > new user ns?
> 
> I kept thinking about this question while I was out running and while I
> admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> to me that this is the right approach after all. Making it part of
> userns at least in this form isn't clean.
> 
> I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> Additionally, I think a uuid only for the userns is too limited. This is
> similar to the problem of the audit container id. If we have some sort
> of uuid for ima it will automatically evolve into something like a
> container id (I'm not even arguing that this is necessarily wrong.).
> We also have the issue that we then have the container audit id thing -
> if this ever lands and the ima userns uuid. All that makes it quite
> messy.
> 
> I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> to be decoupled from the userns and self-contained as possible. This
> also means that ima ns works for privileged containers which sure is a
> valid use-case.
> It will also make securityfs namespacing easier as it can be a keyed
> super where the key is the ima ns (similar to how we deal with e.g.
> mqueue).

s/ima ns/userns/g

which is what Stefan already did in the link he shared.

James Bottomley Nov. 29, 2021, 4:23 p.m. UTC | #15

On Mon, 2021-11-29 at 10:27 -0500, Stefan Berger wrote:
> On 11/29/21 09:46, James Bottomley wrote:
[...]
> > And, of course, the fun ones we're coming to.
> > 
> >     1. Given that the current keyring namespacing doesn't give
> > access to the system keyrings, how do we get per-namespace access
> > for .ima/_ima system keyrings given that the namespace admin is
> > going to want to set their own key for appraisal?
> >     2. What mechanism should we use for .ima/_ima key setting?  The
> > current mechanism is must be signed by a key in the system keyrings
> > sounds appropriate, but is problematic given most system owners
> > don't actually have any private keys for keys in the system
> > keyrings. Hopefully the MoK keyring patches will help us have an
> > easier approach to this.
> 
> The approach we took in the previous implementation was to support
> BYOK (bring your own key) for every container. The (trusted)
> container runtime has to do what dracut would typically do, namely
> create the keyrings and load the keys onto it.

Right, that's why I think the staged approach is the right way to do
this.  Let's first of all try to get an IMA namespace that can be used
to measure things and then worry about trying to add appraisal to the
namespace later, since it's a separable use case.  I think they how of
appraisal is one of the more complex issues, but it shouldn't hold up
actually starting to get an IMA namespace upstream and encouraging use
cases.

> The container runtime would
> 
> 1a) expect keys to be found inside a container's filesystem at the
> usual location (/etc/keys/ima) but then also allow for a CA key that
> is used to verify the signature of those keys; that CA key is
> typically baked  into the Linux kernel when .ima is to be used, but
> for containers and BYOK it's an additional file in the container's
> filesystem
> 
> 1b) passing in keys via command line should be possible as well but 
> that's an implementation detail

I do think the namespace should use a mechanism that's as close as
possible to the one on the physical system, so it shouldn't depend on
the orchestration system doing anything (there might not even be one).

The current way you install a new ima key is by adding it as root to
_ima or by adding one signed by a key on the system keyring to .ima; I
think this mechanism should be usable inside the namespace as well ...
that's not to say there can't be other mechanisms, but mirroring the
existing mechanism should be part of this.

> 2) container runtime sets up either a restricted keyring [ 
> https://elixir.bootlin.com/linux/latest/source/Documentation/crypto/asymmetric-keys.rst#L359 
> ] if that CA key is found in the filesystem or a 'normal' keyring.
> The container runtime then loads the keys onto that keyring; call
> that keyring '.ima' or '_ima' for as long as the kernel knows what
> keyring to search for.

and this keyring should be owned by the admin of the IMA namespace.

>  We created that keyring under a session keyring. With
> the user namespace isolation and keyrings support in the user
> namespace the  isolation of the IMA related keyrings between
> different user namespaces should be possible.

Yes, I think so.  Really system keyrings are keyrings root can't create
but which are owned by root, so it can see them.  This mechanism should
carry over to the admin of the IMA namespace, so I see no reason why we
can't set up mirrors of .ima/_ima in the keyring (user) namespace.

Part of the reason why I think using the user namespace for the IMA
namespace is because this will have to be done on first create of a new
IMA namespace, so it's much easier if the keyring and IMA namespaces
are the same namespace.

> The same would be done for the IMA policy where the container
> runtime also needs to do some work that usually dracut would do:
> 
> - expect the IMA policy for the container at the usual location 
> (/etc/ima/ima-policy) and load it into the container's 'securityfs' 
> policy file

Right, since non system containers don't boot, we discount the kernel
command line option.  Once we have the securityfs namespaced for the
IMA files, doing the policy write should be trivial.  The next question
would be how to process it so that we don't cause a security breach
(like turn of measurement on a filesystem where the original policy
required it).

James

James Bottomley Nov. 29, 2021, 4:44 p.m. UTC | #16

On Mon, 2021-11-29 at 09:35 -0600, Serge E. Hallyn wrote:
> On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
[...]
> > Well, there's a reason it's an unpublished patch.  However, the
> > more important point is that namespacing IMA requires discussion of
> > certain points that we never seem to drive to a conclusion.  Using
> > the akpm method, I propose simple patches that drive the
> > discussion.  I think the points are:
> > 
> >    1. Should IMA be its own namespace or tied to the user
> > namespace?  The previous patches all took the separate Namespace
> > approach, but I think that should be reconsidered now keyrings are
> > in the user namespace.
> 
> Well that purely depends on the needed scope.
> 
> The audit container identifier is a neat thing.  But it absolutely
> must be settable, so seems to conflict with your needs.

I think not allowing duplicate entries for the lifetime of the log is
required, which causes a problem since namespaces can die before this
lifetime ends.  I think there is a nice security benefit in making it
not user settable, but I don't think that's necessarily a requirement.

> Your patch puts an identifier on the user_namespace.  I'm not quite
> sure, does that satisfy Stefan's needs?  A new ima ns if and only if
> there is a new user ns?

Part of the problem is that IMA needs an admin user ... to be able to
read the log and set the policy and, when we get to appraisal, set and
read the keyrings.  IMA NS iff user ns satisfies this, but the
minimalist in me then asks why not make them the same thing?

> I think you two need to get together and discuss the requirements,
> and come back with a brief but very precise document explaining what
> you need.

> Are you both looking at the same use case?  Who is consuming the
> audit log, and to what end?  Container administrators?  Any time they
> log in? How do they assure themselves that the securityfs file
> they're reading hasn't been overmounted?

There are several different use cases.  I'm asking how I would use the
IMA namespace for the unprivileged containers I usually set up by
hand.  Stefan is looking at how docker/kubernetes would do it.  There's
also the Huawei use case which is a sort of attestation for function as
a service and there's the Red Hat use case for OpenShift.

However, the common denominator in all of these is they require a way
to uniquely distinguish the containers, which is why the patch series I
sent as an RFC starts that way.  If we can start with the common
elements, we can build towards something that satisfies all the use
cases ... and allow consensus to emerge as further patches are
discussed.

Part of my problem is I don't really know what I need, I just want IMA
namespaces to work easily for the unprivileged use case and I'll figure
it out as I play with it ... but to do that I need something to start
playing with.

> I need to find a document to read about IMA's usage of PCRs.  For
> namespacing, are you expecting each container to be hooked up to a
> swtmp instance so they have their own PCR they can use?

That's one of the most complicated things of all: trying to attach a
vTPM to store the local extensions and quote the IMA log in the
namespace.  The Huawei patch doesn't have it, because they don't really
need it (they just want global attestation to work), the IBM patches
do.  I think there are many ways of attesting to the subset of the log
in the namespace, so I don't think a vTPM is required.  I do, however,
think it should be supported for the use cases that need it.

James

Stefan Berger Nov. 29, 2021, 5:04 p.m. UTC | #17

On 11/29/21 11:16, Christian Brauner wrote:
> On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
>> On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
>>> On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
>>>> On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
>>>>
> I kept thinking about this question while I was out running and while I
> admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> to me that this is the right approach after all. Making it part of
> userns at least in this form isn't clean.
>
> I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> Additionally, I think a uuid only for the userns is too limited. This is
> similar to the problem of the audit container id. If we have some sort
> of uuid for ima it will automatically evolve into something like a
> container id (I'm not even arguing that this is necessarily wrong.).
> We also have the issue that we then have the container audit id thing -
> if this ever lands and the ima userns uuid. All that makes it quite
> messy.
>
> I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> to be decoupled from the userns and self-contained as possible. This
> also means that ima ns works for privileged containers which sure is a
> valid use-case.

The thing is that piggy backing on the user namespace at least allows us 
to 'always see' where IMA's keyring is (across setns()). If we were 
using an independent IMA namespace how would we guarantee that the user 
sees the keyring for IMA appraisal? We would at least have to take a 
reference (as in get_user_ns()) to the user namespace when the IMA 
namespace is created so that it at least the association of IMA 
namespace to user namespace remains a constant and the keyring IMA is 
using (and are held by the user namespace) is also constant across 
setns()'s. Otherwise it is possible that the user could do setns() to a 
different user namespace and be confused about the keys IMA is using. So 
at least by piggy backing we have them together. The aspect here may be 
'usability'.

I am somewhat sold on the USER namespace piggy backing thing... as 
suggested by James.

> It will also make securityfs namespacing easier as it can be a keyed
> super where the key is the ima ns (similar to how we deal with e.g.
> mqueue).

Yes, mqueue is where I got the (API usage) idea from how to switch out 
the reference to the user namespace needed for the 'keyed' approach.

I will massage my 20 patch-series a bit more and then post it (for 
reference....). It does have 'somewhat dramatic' things in there that 
stem from virtualizing securityfs for example and IMA being a dependent 
of the user namespace and taking one more reference to the user 
namespace (it is a dependent of) and then the user namespace not being 
able to easily delete:

It's this here to help with an early tear-down to decrease the reference 
counter.

- 
https://github.com/stefanberger/linux-ima-namespaces/commit/1a5d7f2598764ca6f1a8c5a391672543fef83f2c

- 
https://github.com/stefanberger/linux-ima-namespaces/commit/d246f501f977e64333ecbd8bb79994e23b552b9b

- 
https://github.com/stefanberger/linux-ima-namespaces/commit/3b82058936862d7623b3a06bc1749d5efc018ab1#diff-99458ca9139231ac3811dbb0c0fced442c46c7cfdb94e86e4553fc0329d3a79bR647-R651

The teardown variable makes this a controlled process but ... is it 
acceptable?

    Stefan

James Bottomley Nov. 29, 2021, 5:29 p.m. UTC | #18

On Mon, 2021-11-29 at 12:04 -0500, Stefan Berger wrote:
> On 11/29/21 11:16, Christian Brauner wrote:
[...]
> > I kept thinking about this question while I was out running and
> > while I admittedly have reacted poorly to CLONE_NEWIMA patches
> > before it feels to me that this is the right approach after all.
> > Making it part of userns at least in this form isn't clean.
> > 
> > I think attaching a uuid to a userns alone for the sake of IMA is
> > wrong. Additionally, I think a uuid only for the userns is too
> > limited. This is similar to the problem of the audit container id.
> > If we have some sort of uuid for ima it will automatically evolve
> > into something like a container id (I'm not even arguing that this
> > is necessarily wrong.). We also have the issue that we then have
> > the container audit id thing - if this ever lands and the ima
> > userns uuid. All that makes it quite messy.
> > 
> > I think CLONE_NEWIMA is ultimately nicer and allows the
> > implementation to be decoupled from the userns and self-contained
> > as possible.  This also means that ima ns works for privileged
> > containers which sure is a valid use-case.
> 
> The thing is that piggy backing on the user namespace at least allows
> us to 'always see' where IMA's keyring is (across setns()). If we
> were  using an independent IMA namespace how would we guarantee that
> the user sees the keyring for IMA appraisal? We would at least have
> to take a  reference (as in get_user_ns()) to the user namespace when
> the IMA namespace is created so that it at least the association of
> IMA namespace to user namespace remains a constant and the keyring
> IMA is using (and are held by the user namespace) is also constant
> across setns()'s. Otherwise it is possible that the user could do
> setns() to a different user namespace and be confused about the keys
> IMA is using. So at least by piggy backing we have them together. The
> aspect here may be 'usability'.
> 
> I am somewhat sold on the USER namespace piggy backing thing... as 
> suggested by James.

And to be clear, I don't think a separate IMA namespace is necessarily
wrong a priori.  However, all the original patch sets had the separate
namespace approach and I worry that the keyring namespace in user
namespace has forced our hand somewhat, so I wanted at least to post a
semi-usable patch set showing what would happen if the IMA namespace
were the user namespace for illustration so it would serve as a basis
for discussion.

James

Serge Hallyn Nov. 30, 2021, 4:42 a.m. UTC | #19

On Mon, Nov 29, 2021 at 11:07:18AM -0500, Stefan Berger wrote:
> 
> On 11/29/21 10:35, Serge E. Hallyn wrote:
> > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > On Mon, 2021-11-29 at 08:53 -0500, Stefan Berger wrote:
> > > > > > On 11/29/21 07:50, James Bottomley wrote:
> > > > > > > On Sun, 2021-11-28 at 22:58 -0600, Serge E. Hallyn wrote:
> > > > > > > > On Sat, Nov 27, 2021 at 04:45:49PM +0000, James Bottomley
> > > > > > > > wrote:
> > > > > > > > > Currently we get one entry in the IMA log per unique file
> > > > > > > > > event.  So, if you have a measurement policy and it
> > > > > > > > > measures a particular binary it will not get measured again
> > > > > > > > > if it is subsequently executed. For Namespaced IMA, the
> > > > > > > > > correct behaviour seems to be to log once per inode per
> > > > > > > > > namespace (so every unique execution in a namespace gets a
> > > > > > > > > separate log entry).  Since logging once per inode per
> > > > > > > > > namespace is
> > > > > > > > I suspect I'll need to do a more in depth reading of the
> > > > > > > > existing code, but I'll ask the lazy question anyway (since
> > > > > > > > you say "the correct behavior seems to be") - is it actually
> > > > > > > > important that files which were appraised under a parent
> > > > > > > > namespace's policy already should be logged again?
> > > > > > > I think so.  For a couple of reasons, assuming the namespace
> > > > > > > eventually gets its own log entries, which the next incremental
> > > > > > > patch proposed to do by virtualizing the securityfs
> > > > > > > entries.  If you don't do this:
> > > > > > To avoid duplicate efforts, an implementation of a virtualized
> > > > > > securityfs is in this series here:
> > > > > > 
> > > > > > https://github.com/stefanberger/linux-ima-namespaces/commits/v5.15%2Bimans.20211119.v3
> > > > > > 
> > > > > > It starts with 'securityfs: Prefix global variables with
> > > > > > secruityfs_'
> > > > > That's quite a big patch series.  I already actually implemented
> > > > > this as part of the RFC for getting the per namespace measurement
> > > > > log.  The attached is basically what I did.
> > > > > 
> > > > > Most of the time we don't require namespacing the actual virtualfs
> > > > > file, because it's world readable.  IMA has a special requirement
> > > > > in this regard because the IMA files should be readable (and
> > > > > writeable when we get around to policy updates) by the admin of the
> > > > > namespace but their protection is 0640 or 0440.  I thought the
> > > > > simplest solution would be to make an additional flag that coped
> > > > > with the permissions and a per-inode flag way of making the file as
> > > > > "accessible by userns admin".  Doing something simple like this
> > > > > gives a much smaller diffstat:
> > > > That's a NAK from me. Stefan's series might be bigger but it does
> > > > things correctly. I appreciate the keep it simple attitude but no. I
> > > > won't speciale-case securityfs or similar stuff in core vfs helpers.
> > > Well, there's a reason it's an unpublished patch.  However, the more
> > > important point is that namespacing IMA requires discussion of certain
> > > points that we never seem to drive to a conclusion.  Using the akpm
> > > method, I propose simple patches that drive the discussion.  I think
> > > the points are:
> > > 
> > >     1. Should IMA be its own namespace or tied to the user namespace?  The
> > >        previous patches all took the separate Namespace approach, but I
> > >        think that should be reconsidered now keyrings are in the user
> > >        namespace.
> > Well that purely depends on the needed scope.
> > 
> > The audit container identifier is a neat thing.  But it absolutely must
> > be settable, so seems to conflict with your needs.
> > 
> > Your patch puts an identifier on the user_namespace.  I'm not quite sure,
> > does that satisfy Stefan's needs?  A new ima ns if and only if there is a
> > new user ns?
> > 
> > I think you two need to get together and discuss the requirements, and come
> > back with a brief but very precise document explaining what you need.
> 
> What would those want who look at audit messages? [Idk] Would they want a

Oh, I think you may have misunderstood me.  I meant you and James.  I
didn't mean that you shoule have a common implementation with the
container-audit patchset.

> constant identifier for IMA audit messages in the audit log across all
> restarts of a container? Presumably that would make quick queries across
> restarts much easier. Or could they live with an audit message emitted from
> the container runtime indicating that this time the (IMA) audit messages
> from this container will have this UUID here?
> 
> I guess both would 'work.'
> 
> > 
> > Are you both looking at the same use case?  Who is consuming the audit
> > log, and to what end?  Container administrators?  Any time they log in?
> > How do they assure themselves that the securityfs file they're reading
> > hasn't been overmounted?
> 
> The question is also should there only be one identifier or can there be two
> different one (one from audit patch series and uuid of user namespace).
> 
> 
> > 
> > I need to find a document to read about IMA's usage of PCRs.  For
> > namespacing, are you expecting each container to be hooked up to a
> > swtmp instance so they have their own PCR they can use?
> 
> It's complicated and there's a bit more to this... I would try to architect
> it in a way that the IMA system policy can cover what's going on inside IMA
> namespaces, i.e., audit and measure and appraise file accesses occurring in
> those namespace. We call it hierarchical processing ( https://github.com/stefanberger/linux-ima-namespaces/commit/e88dc84ec97753fd65d302ee1bf03951001ab48f
> ) where file access are evaluated against the current namespace's policy and
> then also evaluated against those of parent namespaces back to the
> init_ima_ns. The goal is to avoid evasion of measurements etc. by the user
> just by spawning new IMA namespaces. I think logging into the IMA system log
> will not scale well if there are hundreds of containers on the system using
> IMA and logging into the system log and hammering the TPM. So, the answer
> then is write your policy in such a way that it doesn't cover the IMA/user
> namespaces (containers) and have each container have its own IMA policy and
> IMA log and and an optional vTPM. So my answer would be 'optional swtpm.'
> 
>    Stefan
>

Serge Hallyn Nov. 30, 2021, 4:59 a.m. UTC | #20

On Mon, Nov 29, 2021 at 11:44:35AM -0500, James Bottomley wrote:
> On Mon, 2021-11-29 at 09:35 -0600, Serge E. Hallyn wrote:
> > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> [...]
> > > Well, there's a reason it's an unpublished patch.  However, the
> > > more important point is that namespacing IMA requires discussion of
> > > certain points that we never seem to drive to a conclusion.  Using
> > > the akpm method, I propose simple patches that drive the
> > > discussion.  I think the points are:
> > > 
> > >    1. Should IMA be its own namespace or tied to the user
> > > namespace?  The previous patches all took the separate Namespace
> > > approach, but I think that should be reconsidered now keyrings are
> > > in the user namespace.
> > 
> > Well that purely depends on the needed scope.
> > 
> > The audit container identifier is a neat thing.  But it absolutely
> > must be settable, so seems to conflict with your needs.
> 
> I think not allowing duplicate entries for the lifetime of the log is
> required, which causes a problem since namespaces can die before this
> lifetime ends.  I think there is a nice security benefit in making it
> not user settable, but I don't think that's necessarily a requirement.
> 
> > Your patch puts an identifier on the user_namespace.  I'm not quite
> > sure, does that satisfy Stefan's needs?  A new ima ns if and only if
> > there is a new user ns?
> 
> Part of the problem is that IMA needs an admin user ... to be able to
> read the log and set the policy and, when we get to appraisal, set and
> read the keyrings.  IMA NS iff user ns satisfies this, but the
> minimalist in me then asks why not make them the same thing?
> 
> > I think you two need to get together and discuss the requirements,
> > and come back with a brief but very precise document explaining what
> > you need.
> 
> > Are you both looking at the same use case?  Who is consuming the
> > audit log, and to what end?  Container administrators?  Any time they
> > log in? How do they assure themselves that the securityfs file
> > they're reading hasn't been overmounted?
> 
> There are several different use cases.  I'm asking how I would use the
> IMA namespace for the unprivileged containers I usually set up by
> hand.  Stefan is looking at how docker/kubernetes would do it.  There's
> also the Huawei use case which is a sort of attestation for function as
> a service and there's the Red Hat use case for OpenShift.
> 
> However, the common denominator in all of these is they require a way
> to uniquely distinguish the containers, which is why the patch series I
> sent as an RFC starts that way.  If we can start with the common
> elements, we can build towards something that satisfies all the use
> cases ... and allow consensus to emerge as further patches are
> discussed.

The reason I asked this question in response to this patch is because
here I'm not picking at the userns->uuid, but rather it's the new linked
lists for the inode that feel wrong.  So if you can get what you need
some other way - maybe just "we opened all these files and got no
integrity failure messages", or a hash table keyed on (userns *, inode *)
instead of the linked lists to look up whether an inode has been measured,
or some userspace daemon to resubmit already logged approvals, which I
gather won't work for unpriv containers - that would be nice.

> Part of my problem is I don't really know what I need, I just want IMA
> namespaces to work easily for the unprivileged use case and I'll figure
> it out as I play with it ... but to do that I need something to start
> playing with.

But for that kind of research you use an out of tree patchset, not
speculative infrastructure in the kernel.  If that's what this
patchset is, then I'll (feel a little silly and) look over it with a
different set of eyes :)

> > I need to find a document to read about IMA's usage of PCRs.  For
> > namespacing, are you expecting each container to be hooked up to a
> > swtmp instance so they have their own PCR they can use?
> 
> That's one of the most complicated things of all: trying to attach a
> vTPM to store the local extensions and quote the IMA log in the
> namespace.  The Huawei patch doesn't have it, because they don't really
> need it (they just want global attestation to work), the IBM patches
> do.  I think there are many ways of attesting to the subset of the log
> in the namespace, so I don't think a vTPM is required.  I do, however,
> think it should be supported for the use cases that need it.
> 
> James
>

Serge Hallyn Nov. 30, 2021, 5:03 a.m. UTC | #21

On Mon, Nov 29, 2021 at 12:04:29PM -0500, Stefan Berger wrote:
> 
> On 11/29/21 11:16, Christian Brauner wrote:
> > On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> > > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > 
> > I kept thinking about this question while I was out running and while I
> > admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> > to me that this is the right approach after all. Making it part of
> > userns at least in this form isn't clean.
> > 
> > I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> > Additionally, I think a uuid only for the userns is too limited. This is
> > similar to the problem of the audit container id. If we have some sort
> > of uuid for ima it will automatically evolve into something like a
> > container id (I'm not even arguing that this is necessarily wrong.).
> > We also have the issue that we then have the container audit id thing -
> > if this ever lands and the ima userns uuid. All that makes it quite
> > messy.
> > 
> > I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> > to be decoupled from the userns and self-contained as possible. This
> > also means that ima ns works for privileged containers which sure is a
> > valid use-case.
> 
> The thing is that piggy backing on the user namespace at least allows us to
> 'always see' where IMA's keyring is (across setns()). If we were using an
> independent IMA namespace how would we guarantee that the user sees the
> keyring for IMA appraisal? We would at least have to take a reference (as in
> get_user_ns()) to the user namespace when the IMA namespace is created so
> that it at least the association of IMA namespace to user namespace remains

Maybe we pull they keyring info into a new struct which is referred
to and pinned by both user_ns and ima_ns?  (But I actually am ignorant
of how ima is using the keyrings, so again I need to go do some reading.)

More moving parts isn't my first choice.  But if you need namespaced IMA
for containers that aren't doing CLONE_NEWUSER, then a separate ima_ns is
your only option.  Is that a requirement for you?

Stefan Berger Nov. 30, 2021, 11:55 a.m. UTC | #22

On 11/30/21 00:03, Serge E. Hallyn wrote:
> On Mon, Nov 29, 2021 at 12:04:29PM -0500, Stefan Berger wrote:
>> On 11/29/21 11:16, Christian Brauner wrote:
>>> On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
>>>> On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
>>>>> On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
>>>>>> On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
>>>>>>
>>> I kept thinking about this question while I was out running and while I
>>> admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
>>> to me that this is the right approach after all. Making it part of
>>> userns at least in this form isn't clean.
>>>
>>> I think attaching a uuid to a userns alone for the sake of IMA is wrong.
>>> Additionally, I think a uuid only for the userns is too limited. This is
>>> similar to the problem of the audit container id. If we have some sort
>>> of uuid for ima it will automatically evolve into something like a
>>> container id (I'm not even arguing that this is necessarily wrong.).
>>> We also have the issue that we then have the container audit id thing -
>>> if this ever lands and the ima userns uuid. All that makes it quite
>>> messy.
>>>
>>> I think CLONE_NEWIMA is ultimately nicer and allows the implementation
>>> to be decoupled from the userns and self-contained as possible. This
>>> also means that ima ns works for privileged containers which sure is a
>>> valid use-case.
>> The thing is that piggy backing on the user namespace at least allows us to
>> 'always see' where IMA's keyring is (across setns()). If we were using an
>> independent IMA namespace how would we guarantee that the user sees the
>> keyring for IMA appraisal? We would at least have to take a reference (as in
>> get_user_ns()) to the user namespace when the IMA namespace is created so
>> that it at least the association of IMA namespace to user namespace remains
> Maybe we pull they keyring info into a new struct which is referred
> to and pinned by both user_ns and ima_ns?  (But I actually am ignorant
> of how ima is using the keyrings, so again I need to go do some reading.)
>
> More moving parts isn't my first choice.  But if you need namespaced IMA
> for containers that aren't doing CLONE_NEWUSER, then a separate ima_ns is
> your only option.  Is that a requirement for you?

I cannot think of a scenario where a user wouldn't want/refuse to open a 
user namespace to get an IMA namespace...

James Bottomley Nov. 30, 2021, 1 p.m. UTC | #23

On Mon, 2021-11-29 at 22:59 -0600, Serge E. Hallyn wrote:
> On Mon, Nov 29, 2021 at 11:44:35AM -0500, James Bottomley wrote:
> > On Mon, 2021-11-29 at 09:35 -0600, Serge E. Hallyn wrote:
> > > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > [...]
> > > > Well, there's a reason it's an unpublished patch.  However, the
> > > > more important point is that namespacing IMA requires
> > > > discussion of certain points that we never seem to drive to a
> > > > conclusion.  Using the akpm method, I propose simple patches
> > > > that drive the discussion.  I think the points are:
> > > > 
> > > >    1. Should IMA be its own namespace or tied to the user
> > > > namespace?  The previous patches all took the separate
> > > > Namespace approach, but I think that should be reconsidered now
> > > > keyrings are in the user namespace.
> > > 
> > > Well that purely depends on the needed scope.
> > > 
> > > The audit container identifier is a neat thing.  But it
> > > absolutely must be settable, so seems to conflict with your
> > > needs.
> > 
> > I think not allowing duplicate entries for the lifetime of the log
> > is required, which causes a problem since namespaces can die before
> > this lifetime ends.  I think there is a nice security benefit in
> > making it not user settable, but I don't think that's necessarily a
> > requirement.
> > 
> > > Your patch puts an identifier on the user_namespace.  I'm not
> > > quite sure, does that satisfy Stefan's needs?  A new ima ns if
> > > and only if there is a new user ns?
> > 
> > Part of the problem is that IMA needs an admin user ... to be able
> > to read the log and set the policy and, when we get to appraisal,
> > set and read the keyrings.  IMA NS iff user ns satisfies this, but
> > the minimalist in me then asks why not make them the same thing?
> > 
> > > I think you two need to get together and discuss the
> > > requirements, and come back with a brief but very precise
> > > document explaining what you need. Are you both looking at the
> > > same use case?  Who is consuming the audit log, and to what
> > > end?  Container administrators?  Any time they log in? How do
> > > they assure themselves that the securityfs file they're reading
> > > hasn't been overmounted?
> > 
> > There are several different use cases.  I'm asking how I would use
> > the IMA namespace for the unprivileged containers I usually set up
> > by hand.  Stefan is looking at how docker/kubernetes would do
> > it.  There's also the Huawei use case which is a sort of
> > attestation for function as a service and there's the Red Hat use
> > case for OpenShift.
> > 
> > However, the common denominator in all of these is they require a
> > way to uniquely distinguish the containers, which is why the patch
> > series I sent as an RFC starts that way.  If we can start with the
> > common elements, we can build towards something that satisfies all
> > the use cases ... and allow consensus to emerge as further patches
> > are discussed.
> 
> The reason I asked this question in response to this patch is because
> here I'm not picking at the userns->uuid, but rather it's the new
> linked lists for the inode that feel wrong.

Well that one's fully separable.  The first two patches could be
complete for this round.  However, if you believe there has to be one
entry per inode per namespace then the iint cache needs to accommodate
that.  The measurement is a function of the inode alone: if the inode
doesn't change that value is the same regardless of namespace.  it's
only whether it's been logged that's really per namespace, hence the
namespace list hanging off the iint entry ... if there were a huge
number of namespaces, perhaps it should be a btree instead of a list,
but it needs to be some type of two level thing, I think?

The design of patch 3 is mostly to get people to think about what
should be in the per namespace log.

>   So if you can get what you need some other way - maybe just "we
> opened all these files and got no integrity failure messages", or a
> hash table keyed on (userns *, inode *) instead of the linked lists
> to look up whether an inode has been measured, or some userspace
> daemon to resubmit already logged approvals, which I gather won't
> work for unpriv containers - that would be nice.

I could do a separate (userns *, inode *) btree for the measured part,
but we'd still have to have the per inode store of the measurement
because that's namespace blind.  Given this, it seems inefficient not
to make use of the tie between the two.

> > Part of my problem is I don't really know what I need, I just want
> > IMA namespaces to work easily for the unprivileged use case and
> > I'll figure it out as I play with it ... but to do that I need
> > something to start playing with.
> 
> But for that kind of research you use an out of tree patchset, not
> speculative infrastructure in the kernel.  If that's what this
> patchset is, then I'll (feel a little silly and) look over it with a
> different set of eyes :)

Well, no, you're looking with the right set of eyes.  The design of
this patch set is to begin incrementally with the pieces everyone can
agree on, so start as small as it can be: the namespace and a label as
a trial balloon.  If everyone had agreed it looked OK, there would be
no reason not to put it upstream and start on the next step.

James

Christian Brauner Nov. 30, 2021, 1:33 p.m. UTC | #24

On Tue, Nov 30, 2021 at 06:55:51AM -0500, Stefan Berger wrote:
> 
> On 11/30/21 00:03, Serge E. Hallyn wrote:
> > On Mon, Nov 29, 2021 at 12:04:29PM -0500, Stefan Berger wrote:
> > > On 11/29/21 11:16, Christian Brauner wrote:
> > > > On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> > > > > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > > > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > > > 
> > > > I kept thinking about this question while I was out running and while I
> > > > admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> > > > to me that this is the right approach after all. Making it part of
> > > > userns at least in this form isn't clean.
> > > > 
> > > > I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> > > > Additionally, I think a uuid only for the userns is too limited. This is
> > > > similar to the problem of the audit container id. If we have some sort
> > > > of uuid for ima it will automatically evolve into something like a
> > > > container id (I'm not even arguing that this is necessarily wrong.).
> > > > We also have the issue that we then have the container audit id thing -
> > > > if this ever lands and the ima userns uuid. All that makes it quite
> > > > messy.
> > > > 
> > > > I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> > > > to be decoupled from the userns and self-contained as possible. This
> > > > also means that ima ns works for privileged containers which sure is a
> > > > valid use-case.
> > > The thing is that piggy backing on the user namespace at least allows us to
> > > 'always see' where IMA's keyring is (across setns()). If we were using an
> > > independent IMA namespace how would we guarantee that the user sees the
> > > keyring for IMA appraisal? We would at least have to take a reference (as in
> > > get_user_ns()) to the user namespace when the IMA namespace is created so
> > > that it at least the association of IMA namespace to user namespace remains
> > Maybe we pull they keyring info into a new struct which is referred
> > to and pinned by both user_ns and ima_ns?  (But I actually am ignorant
> > of how ima is using the keyrings, so again I need to go do some reading.)
> > 
> > More moving parts isn't my first choice.  But if you need namespaced IMA
> > for containers that aren't doing CLONE_NEWUSER, then a separate ima_ns is
> > your only option.  Is that a requirement for you?
> 
> I cannot think of a scenario where a user wouldn't want/refuse to open a
> user namespace to get an IMA namespace...

The point is that unprivileged containers are still rare. And for quitea
few workloads they won't go away. Especially for a lot of Kubernetes
workloads - even if it finally has userns support - won't be able to
migrate to unprivileged. And I would think that they would want ima
support as well. But utlimately that's something the integrity folks
have to decide. If you decide that not having this feature for
non-CLONE_NEWUSER workloads then so be it.

Christian Brauner Nov. 30, 2021, 1:38 p.m. UTC | #25

On Mon, Nov 29, 2021 at 12:04:29PM -0500, Stefan Berger wrote:
> 
> On 11/29/21 11:16, Christian Brauner wrote:
> > On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> > > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > 
> > I kept thinking about this question while I was out running and while I
> > admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> > to me that this is the right approach after all. Making it part of
> > userns at least in this form isn't clean.
> > 
> > I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> > Additionally, I think a uuid only for the userns is too limited. This is
> > similar to the problem of the audit container id. If we have some sort
> > of uuid for ima it will automatically evolve into something like a
> > container id (I'm not even arguing that this is necessarily wrong.).
> > We also have the issue that we then have the container audit id thing -
> > if this ever lands and the ima userns uuid. All that makes it quite
> > messy.
> > 
> > I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> > to be decoupled from the userns and self-contained as possible. This
> > also means that ima ns works for privileged containers which sure is a
> > valid use-case.
> 
> The thing is that piggy backing on the user namespace at least allows us to
> 'always see' where IMA's keyring is (across setns()). If we were using an
> independent IMA namespace how would we guarantee that the user sees the
> keyring for IMA appraisal? We would at least have to take a reference (as in
> get_user_ns()) to the user namespace when the IMA namespace is created so
> that it at least the association of IMA namespace to user namespace remains
> a constant and the keyring IMA is using (and are held by the user namespace)

I don't think that this needs to be any different from other namespaces
that have an owning userns.

> is also constant across setns()'s. Otherwise it is possible that the user
> could do setns() to a different user namespace and be confused about the
> keys IMA is using. So at least by piggy backing we have them together. The
> aspect here may be 'usability'.

I mean, we do already have a dependence between pid namespaces and user
namespace etc., i.e. before you can join a pidns as an unpriv user you
need to join the userns. I think we can easily introduce a dependency
there. (Note also that a while back I extended setns() to take a pidfd
as an argument and you can now specify setns(pidfd, CLONE_NEWUSER |
CLONE_NEWPID | CLONE_NEWIMA).)

It's even worse in a sense since we can joing CLONE_NEWUSER in different
order depending whether we're privileging or devprivileging ourselves
(see the messy logic in nsenter and new*idmap for example). So there's
precedence for requiring dependencies between namespaces during setns().

> 
> I am somewhat sold on the USER namespace piggy backing thing... as suggested
> by James.
> 
> 
> > It will also make securityfs namespacing easier as it can be a keyed
> > super where the key is the ima ns (similar to how we deal with e.g.
> > mqueue).
> 
> Yes, mqueue is where I got the (API usage) idea from how to switch out the
> reference to the user namespace needed for the 'keyed' approach.
> 
> I will massage my 20 patch-series a bit more and then post it (for
> reference....). It does have 'somewhat dramatic' things in there that stem
> from virtualizing securityfs for example and IMA being a dependent of the
> user namespace and taking one more reference to the user namespace (it is a
> dependent of) and then the user namespace not being able to easily delete:
> 
> It's this here to help with an early tear-down to decrease the reference
> counter.
> 
> - https://github.com/stefanberger/linux-ima-namespaces/commit/1a5d7f2598764ca6f1a8c5a391672543fef83f2c
> 
> - https://github.com/stefanberger/linux-ima-namespaces/commit/d246f501f977e64333ecbd8bb79994e23b552b9b
> 
> - https://github.com/stefanberger/linux-ima-namespaces/commit/3b82058936862d7623b3a06bc1749d5efc018ab1#diff-99458ca9139231ac3811dbb0c0fced442c46c7cfdb94e86e4553fc0329d3a79bR647-R651
> 
> The teardown variable makes this a controlled process but ... is it
> acceptable?

I'll try to take a look later this week if that's ok.

Christian Brauner Nov. 30, 2021, 1:44 p.m. UTC | #26

On Mon, Nov 29, 2021 at 11:03:17PM -0600, Serge Hallyn wrote:
> On Mon, Nov 29, 2021 at 12:04:29PM -0500, Stefan Berger wrote:
> > 
> > On 11/29/21 11:16, Christian Brauner wrote:
> > > On Mon, Nov 29, 2021 at 09:35:39AM -0600, Serge Hallyn wrote:
> > > > On Mon, Nov 29, 2021 at 09:46:55AM -0500, James Bottomley wrote:
> > > > > On Mon, 2021-11-29 at 15:22 +0100, Christian Brauner wrote:
> > > > > > On Mon, Nov 29, 2021 at 09:10:29AM -0500, James Bottomley wrote:
> > > > > > 
> > > I kept thinking about this question while I was out running and while I
> > > admittedly have reacted poorly to CLONE_NEWIMA patches before it feels
> > > to me that this is the right approach after all. Making it part of
> > > userns at least in this form isn't clean.
> > > 
> > > I think attaching a uuid to a userns alone for the sake of IMA is wrong.
> > > Additionally, I think a uuid only for the userns is too limited. This is
> > > similar to the problem of the audit container id. If we have some sort
> > > of uuid for ima it will automatically evolve into something like a
> > > container id (I'm not even arguing that this is necessarily wrong.).
> > > We also have the issue that we then have the container audit id thing -
> > > if this ever lands and the ima userns uuid. All that makes it quite
> > > messy.
> > > 
> > > I think CLONE_NEWIMA is ultimately nicer and allows the implementation
> > > to be decoupled from the userns and self-contained as possible. This
> > > also means that ima ns works for privileged containers which sure is a
> > > valid use-case.
> > 
> > The thing is that piggy backing on the user namespace at least allows us to
> > 'always see' where IMA's keyring is (across setns()). If we were using an
> > independent IMA namespace how would we guarantee that the user sees the
> > keyring for IMA appraisal? We would at least have to take a reference (as in
> > get_user_ns()) to the user namespace when the IMA namespace is created so
> > that it at least the association of IMA namespace to user namespace remains
> 
> Maybe we pull they keyring info into a new struct which is referred
> to and pinned by both user_ns and ima_ns?  (But I actually am ignorant
> of how ima is using the keyrings, so again I need to go do some reading.)

That's one way of doing it and we'd be able to shrink struct
user_namespace because of it if we have to go down that road anyway.

> 
> More moving parts isn't my first choice.  But if you need namespaced IMA
> for containers that aren't doing CLONE_NEWUSER, then a separate ima_ns is
> your only option.  Is that a requirement for you?

[RFC,3/3] ima: make the integrity inode cache per namespace

Commit Message

Comments

Patch