Message ID | 20250308180602.129663-1-parav@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | RDMA/uverbs: Fix CAP_NET_RAW check for flow create in user namespace | expand |
On Sat, Mar 08, 2025 at 08:06:02PM +0200, Parav Pandit wrote: > A process running in a non-init user namespace possesses the > CAP_NET_RAW capability. However, the patch cited in the fixes > tag checks the capability in the default init user namespace. > Because of this, when the process was started by Podman in a > non-default user namespace, the flow creation failed. > > Fix this issue by checking the CAP_NET_RAW networking capability > in the owner user namespace that created the network namespace. Hi, you say > Fix this issue by checking the CAP_NET_RAW networking capability > in the owner user namespace that created the network namespace. But in fact you are checking the CAP_NET_RAW against the user's network namespace. That is usually not the same thing, although it is possible that in this case it is. What is cmd.flow_id? Is that guaranteed to represent something in the current process' network namespace? Or is it possible that a user without privilege in his user namespace could unshare userns+netns but then cause this fn to be called against a flow in the original network namespace? > > This change is similar to the following cited patches. > > commit 5e1fccc0bfac ("net: Allow userns root control of the core of the network stack.") > commit 52e804c6dfaa ("net: Allow userns root to control ipv4") > commit 59cd7377660a ("net: openvswitch: allow conntrack in non-initial user namespace") > commit 0a3deb11858a ("fs: Allow listmount() in foreign mount namespace") > commit dd7cb142f467 ("fs: relax permissions for listmount()") > > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > Signed-off-by: Parav Pandit <parav@nvidia.com> > > --- > I would like to have feedback from the LSM experts to make sure this > fix is correct. Given the widespread usage of the capable() call, > it makes me wonder if the patch right. > > Secondly, I wasn't able to determine which primary namespace (such as > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > (not directly related to this patch, but as concept) > --- > drivers/infiniband/core/uverbs_cmd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > index 5ad14c39d48c..8d6615f390f5 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct uverbs_attr_bundle *attrs) > if (cmd.comp_mask) > return -EINVAL; > > - if (!capable(CAP_NET_RAW)) > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > return -EPERM; > > if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED) > -- > 2.26.2 >
Hi, > From: Serge E. Hallyn <serge@hallyn.com> > Sent: Monday, March 10, 2025 7:01 PM > > On Sat, Mar 08, 2025 at 08:06:02PM +0200, Parav Pandit wrote: > > A process running in a non-init user namespace possesses the > > CAP_NET_RAW capability. However, the patch cited in the fixes tag > > checks the capability in the default init user namespace. > > Because of this, when the process was started by Podman in a > > non-default user namespace, the flow creation failed. > > > > Fix this issue by checking the CAP_NET_RAW networking capability in > > the owner user namespace that created the network namespace. > > Hi, > > you say > > > Fix this issue by checking the CAP_NET_RAW networking capability > in the > owner user namespace that created the network namespace. > > But in fact you are checking the CAP_NET_RAW against the user's network > namespace. I didn't understand your comment. The fix takes the current process's network namespace by referring to current->nsproxy->net_ns. Each net ns has its owning user namespace who has created it. So the patch is checking caps in the such user namespace. Can you please elaborate? > That is usually not the same thing, although it is possible that in > this case it is. > > What is cmd.flow_id? Is that guaranteed to represent something in the > current process' network namespace? The flow_id is namespaced in the rdma device. A network namespace can have multiple devices. So it cannot be unique because an rdma device in different namespace may have same flow_id. > Or is it possible that a user without > privilege in his user namespace could unshare userns+netns but then cause > this fn to be called against a flow in the original network namespace? > It can do unshare user_ns+netns, but I fail to see it can refer to the original (previous net_ns) in this call, because this call is always refers to current->nsproxy->net_us, which means it operates on the latest net ns (after unshare). > > > > This change is similar to the following cited patches. > > > > commit 5e1fccc0bfac ("net: Allow userns root control of the core of > > the network stack.") commit 52e804c6dfaa ("net: Allow userns root to > > control ipv4") commit 59cd7377660a ("net: openvswitch: allow conntrack > > in non-initial user namespace") commit 0a3deb11858a ("fs: Allow > > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: > > relax permissions for listmount()") > > > > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > > Signed-off-by: Parav Pandit <parav@nvidia.com> > > > > --- > > I would like to have feedback from the LSM experts to make sure this > > fix is correct. Given the widespread usage of the capable() call, it > > makes me wonder if the patch right. > > > > Secondly, I wasn't able to determine which primary namespace (such as > > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > > (not directly related to this patch, but as concept) > > --- > > drivers/infiniband/core/uverbs_cmd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/core/uverbs_cmd.c > > b/drivers/infiniband/core/uverbs_cmd.c > > index 5ad14c39d48c..8d6615f390f5 100644 > > --- a/drivers/infiniband/core/uverbs_cmd.c > > +++ b/drivers/infiniband/core/uverbs_cmd.c > > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct > uverbs_attr_bundle *attrs) > > if (cmd.comp_mask) > > return -EINVAL; > > > > - if (!capable(CAP_NET_RAW)) > > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > > return -EPERM; > > > > if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED) > > -- > > 2.26.2 > >
Parav Pandit <parav@nvidia.com> writes: > A process running in a non-init user namespace possesses the > CAP_NET_RAW capability. However, the patch cited in the fixes > tag checks the capability in the default init user namespace. > Because of this, when the process was started by Podman in a > non-default user namespace, the flow creation failed. This change isn't a bug fix. This change is a relaxation of permissions and it would be very good if this change description described why it is in fact safe. Many parts of the kernel are not safe for arbitrary users to use. In those cases an ordinary capable like you found is used. > Fix this issue by checking the CAP_NET_RAW networking capability > in the owner user namespace that created the network namespace. > > This change is similar to the following cited patches. > > commit 5e1fccc0bfac ("net: Allow userns root control of the core of the network stack.") > commit 52e804c6dfaa ("net: Allow userns root to control ipv4") > commit 59cd7377660a ("net: openvswitch: allow conntrack in non-initial user namespace") > commit 0a3deb11858a ("fs: Allow listmount() in foreign mount namespace") > commit dd7cb142f467 ("fs: relax permissions for listmount()") It is different in that hardware is involved. There is a fair amount of kernel bypass allowed by design in infiniband so this may indeed be safe to allow any user on the system to do. Still for someone who isn't intimate with infiniband this isn't clear. > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > Signed-off-by: Parav Pandit <parav@nvidia.com> > > --- > I would like to have feedback from the LSM experts to make sure this > fix is correct. Given the widespread usage of the capable() call, > it makes me wonder if the patch right. > > Secondly, I wasn't able to determine which primary namespace (such as > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > (not directly related to this patch, but as concept) I took a quick look and it appears that no one figures any of the CAP_IPC_LOCK capability checks are safe for anyone except the global root user. Allowing an arbitrary user to lock all of memory seems to defeat all of the safeguards that are in place to limiting memory locking. It looks like RLIMIT_MEMLOCK has been updated to be per user namespace (with hierachical limits), so I expect the most reasonable thing to do is to simply ensure the process that creates the user namespace has a large enough RLIMIT_MEMLOCK when the user namespace is created. > --- > drivers/infiniband/core/uverbs_cmd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c > index 5ad14c39d48c..8d6615f390f5 100644 > --- a/drivers/infiniband/core/uverbs_cmd.c > +++ b/drivers/infiniband/core/uverbs_cmd.c > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct uverbs_attr_bundle *attrs) > if (cmd.comp_mask) > return -EINVAL; > > - if (!capable(CAP_NET_RAW)) > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > return -EPERM; Looking at the code in drivers/infiniband/core/uverbs_cmd.c I don't think original capable call is actually correct. The problem is that infiniband runs commands through a file descriptor. Which means that anyone who can open the file descriptor and then obtain a program that will work like a suid cat can bypass the current permission check. Before we relax any checks that test needs to be: file_ns_capable(file, &init_user_ns, CAP_NET_RAW); Similarly the network namespace you are talking about in those infiniband commands really needs to be derived from the file descriptor instead of current. Those kinds of bugs seem very easy to find in the infiniband code so I have a hunch that the infiniband code needs some tender loving care before it is safe for unprivileged users to be able to do anything with it. In particular there was a whole lot of bug fixes and other work done to the mount namespace and in the networking stack before allowing unprivileged users to use it. In the ip part of the networking stack CAP_NET_RAW allows all kinds of things but when it is limited to only a single networking stack (one the user had to create) it becomes safe. I don't remember enough about infiniband to safe if those parts guarded with CAP_NET_RAW are safe in that way. Eric > > if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED)
> From: Eric W. Biederman <ebiederm@xmission.com> > Sent: Monday, March 10, 2025 9:59 PM > > Parav Pandit <parav@nvidia.com> writes: > > > A process running in a non-init user namespace possesses the > > CAP_NET_RAW capability. However, the patch cited in the fixes tag > > checks the capability in the default init user namespace. > > Because of this, when the process was started by Podman in a > > non-default user namespace, the flow creation failed. > > This change isn't a bug fix. This change is a relaxation of permissions and it > would be very good if this change description described why it is in fact safe. As you explained below, it is not safe enough. :) I will improve the change description to reflect as I follow your good suggestions below. > > Many parts of the kernel are not safe for arbitrary users > to use. In those cases an ordinary capable like you found > is used. > Understood now. > > Fix this issue by checking the CAP_NET_RAW networking capability in > > the owner user namespace that created the network namespace. > > > > This change is similar to the following cited patches. > > > > commit 5e1fccc0bfac ("net: Allow userns root control of the core of > > the network stack.") commit 52e804c6dfaa ("net: Allow userns root to > > control ipv4") commit 59cd7377660a ("net: openvswitch: allow conntrack > > in non-initial user namespace") commit 0a3deb11858a ("fs: Allow > > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: > > relax permissions for listmount()") > > It is different in that hardware is involved. There is a fair amount of kernel > bypass allowed by design in infiniband so this may indeed be safe to allow > any user on the system to do. Still for someone who isn't intimate with > infiniband this isn't clear. > > > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > > Signed-off-by: Parav Pandit <parav@nvidia.com> > > > > --- > > I would like to have feedback from the LSM experts to make sure this > > fix is correct. Given the widespread usage of the capable() call, it > > makes me wonder if the patch right. > > > > Secondly, I wasn't able to determine which primary namespace (such as > > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > > (not directly related to this patch, but as concept) > > I took a quick look and it appears that no one figures any of the > CAP_IPC_LOCK capability checks are safe for anyone except the global root > user. > > Allowing an arbitrary user to lock all of memory seems to defeat all of the > safeguards that are in place to limiting memory locking. > > It looks like RLIMIT_MEMLOCK has been updated to be per user namespace > (with hierachical limits), so I expect the most reasonable thing to do is to > simply ensure the process that creates the user namespace has a large > enough RLIMIT_MEMLOCK when the user namespace is created. Ok, but if infiniband code does capable(), it is going to check the limit outside of the user namespace, and the call will still fails. Isn't it? May be the users in non init user ns must run their infiniband application without pinning the memory. Aka ODP in infiniband world. > > > --- > > drivers/infiniband/core/uverbs_cmd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/core/uverbs_cmd.c > > b/drivers/infiniband/core/uverbs_cmd.c > > index 5ad14c39d48c..8d6615f390f5 100644 > > --- a/drivers/infiniband/core/uverbs_cmd.c > > +++ b/drivers/infiniband/core/uverbs_cmd.c > > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct > uverbs_attr_bundle *attrs) > > if (cmd.comp_mask) > > return -EINVAL; > > > > - if (!capable(CAP_NET_RAW)) > > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > > return -EPERM; > > Looking at the code in drivers/infiniband/core/uverbs_cmd.c > I don't think original capable call is actually correct. > > The problem is that infiniband runs commands through a file descriptor. > Which means that anyone who can open the file descriptor and then obtain a > program that will work like a suid cat can bypass the current permission > check. > > Before we relax any checks that test needs to be: > file_ns_capable(file, &init_user_ns, CAP_NET_RAW); > > Similarly the network namespace you are talking about in those infiniband > commands really needs to be derived from the file descriptor instead of > current. > This now start making sense to me. When the file descriptor is open, I need to record the net ns and use it for rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is called) after opening the file. Something like how sk_alloc() does sock_net_set(sk, net); Do I understand you correctly? > Those kinds of bugs seem very easy to find in the infiniband code so I have a > hunch that the infiniband code needs some tender loving care before it is safe > for unprivileged users to be able to do anything with it. > Well, started to improve now... > In particular there was a whole lot of bug fixes and other work done to the > mount namespace and in the networking stack before allowing unprivileged > users to use it. > > In the ip part of the networking stack CAP_NET_RAW allows all kinds of things > but when it is limited to only a single networking stack (one the user had to > create) it becomes safe. I don't remember enough about infiniband to safe if > those parts guarded with CAP_NET_RAW are safe in that way. > Its restrictive use here as well in infiniband too for CAP_NET_RAW. > Eric > > > > > > if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED)
Parav Pandit <parav@nvidia.com> writes: >> From: Eric W. Biederman <ebiederm@xmission.com> >> Sent: Monday, March 10, 2025 9:59 PM >> >> Parav Pandit <parav@nvidia.com> writes: >> >> > A process running in a non-init user namespace possesses the >> > CAP_NET_RAW capability. However, the patch cited in the fixes tag >> > checks the capability in the default init user namespace. >> > Because of this, when the process was started by Podman in a >> > non-default user namespace, the flow creation failed. >> >> This change isn't a bug fix. This change is a relaxation of permissions and it >> would be very good if this change description described why it is in fact safe. > As you explained below, it is not safe enough. :) > I will improve the change description to reflect as I follow your good suggestions below. > >> >> Many parts of the kernel are not safe for arbitrary users >> to use. In those cases an ordinary capable like you found >> is used. >> > Understood now. > >> > Fix this issue by checking the CAP_NET_RAW networking capability in >> > the owner user namespace that created the network namespace. >> > >> > This change is similar to the following cited patches. >> > >> > commit 5e1fccc0bfac ("net: Allow userns root control of the core of >> > the network stack.") commit 52e804c6dfaa ("net: Allow userns root to >> > control ipv4") commit 59cd7377660a ("net: openvswitch: allow conntrack >> > in non-initial user namespace") commit 0a3deb11858a ("fs: Allow >> > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: >> > relax permissions for listmount()") >> >> It is different in that hardware is involved. There is a fair amount of kernel >> bypass allowed by design in infiniband so this may indeed be safe to allow >> any user on the system to do. Still for someone who isn't intimate with >> infiniband this isn't clear. >> >> > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") >> > Signed-off-by: Parav Pandit <parav@nvidia.com> >> > >> > --- >> > I would like to have feedback from the LSM experts to make sure this >> > fix is correct. Given the widespread usage of the capable() call, it >> > makes me wonder if the patch right. >> > >> > Secondly, I wasn't able to determine which primary namespace (such as >> > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. >> > (not directly related to this patch, but as concept) >> >> I took a quick look and it appears that no one figures any of the >> CAP_IPC_LOCK capability checks are safe for anyone except the global root >> user. >> >> Allowing an arbitrary user to lock all of memory seems to defeat all of the >> safeguards that are in place to limiting memory locking. >> >> It looks like RLIMIT_MEMLOCK has been updated to be per user namespace >> (with hierachical limits), so I expect the most reasonable thing to do is to >> simply ensure the process that creates the user namespace has a large >> enough RLIMIT_MEMLOCK when the user namespace is created. > Ok, but if infiniband code does capable(), it is going to check the limit outside of the user namespace, and the call will still fails. > Isn't it? It depends on how the check is implemented. My point is that RLIMIT_MEMLOCK has all of the knobs you might need to do something. I don't know if the checks you are concerned about allow using RLIMIT_MEMLOCK. Given that some of them require having root in the initial user namespace they might make a lot of assumptions. But rlimits are related to but separate from capabilities. > May be the users in non init user ns must run their infiniband application without pinning the memory. > Aka ODP in infiniband world. That sounds right. I don't remember enough about infiniband to say for certain. Basically anything that uses ns_capable should be treated as something any user can do, and so you need to watch out for hostile users. >> > --- >> > drivers/infiniband/core/uverbs_cmd.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/infiniband/core/uverbs_cmd.c >> > b/drivers/infiniband/core/uverbs_cmd.c >> > index 5ad14c39d48c..8d6615f390f5 100644 >> > --- a/drivers/infiniband/core/uverbs_cmd.c >> > +++ b/drivers/infiniband/core/uverbs_cmd.c >> > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct >> uverbs_attr_bundle *attrs) >> > if (cmd.comp_mask) >> > return -EINVAL; >> > >> > - if (!capable(CAP_NET_RAW)) >> > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) >> > return -EPERM; >> >> Looking at the code in drivers/infiniband/core/uverbs_cmd.c >> I don't think original capable call is actually correct. >> >> The problem is that infiniband runs commands through a file descriptor. >> Which means that anyone who can open the file descriptor and then obtain a >> program that will work like a suid cat can bypass the current permission >> check. >> >> Before we relax any checks that test needs to be: >> file_ns_capable(file, &init_user_ns, CAP_NET_RAW); >> > >> Similarly the network namespace you are talking about in those infiniband >> commands really needs to be derived from the file descriptor instead of >> current. >> > This now start making sense to me. > When the file descriptor is open, I need to record the net ns and use it for rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is called) after opening the file. For the rest of the life cycle of the file descriptor. Don't forget that file descriptors can be passed between processes. > Something like how sk_alloc() does sock_net_set(sk, net); > > Do I understand you correctly? Yes. But first. The permission checks need to be fixed to use the cred cached on the file descriptor. So that the permission checks are not against the current process, but are against the process that opened the file descriptor. Otherwise a non-privileged process can open the file descriptor and trick another process with more permissions to write the values it wants to have written to the file descriptor. Usually that is accomplished by tricking some privileged application to write to stderr (that is passed from the attacker). Most of the time you can fix things like that using file_ns_capable. Other times you encounter a userspace program that breaks and something else needs to happen. >> Those kinds of bugs seem very easy to find in the infiniband code so I have a >> hunch that the infiniband code needs some tender loving care before it is safe >> for unprivileged users to be able to do anything with it. >> > Well, started to improve now... No fault if you haven't lots of code that only root could use no one takes seriously with respect to security issues, so it tends to be buggy. I have cleaned up a lot of code over the years before I have relaxed the permissions. Eric
On Mon, Mar 10, 2025 at 02:47:53PM +0000, Parav Pandit wrote: > Hi, > > > From: Serge E. Hallyn <serge@hallyn.com> > > Sent: Monday, March 10, 2025 7:01 PM > > > > On Sat, Mar 08, 2025 at 08:06:02PM +0200, Parav Pandit wrote: > > > A process running in a non-init user namespace possesses the > > > CAP_NET_RAW capability. However, the patch cited in the fixes tag > > > checks the capability in the default init user namespace. > > > Because of this, when the process was started by Podman in a > > > non-default user namespace, the flow creation failed. > > > > > > Fix this issue by checking the CAP_NET_RAW networking capability in > > > the owner user namespace that created the network namespace. > > > > Hi, > > > > you say > > > > > Fix this issue by checking the CAP_NET_RAW networking capability > in the > > owner user namespace that created the network namespace. > > > > But in fact you are checking the CAP_NET_RAW against the user's network > > namespace. > I didn't understand your comment. > The fix takes the current process's network namespace by referring to current->nsproxy->net_ns. > Each net ns has its owning user namespace who has created it. > So the patch is checking caps in the such user namespace. > > Can you please elaborate? It looks like it got straightened out later with Eric's reply. Please let me know if that's not the case. -serge
> From: Eric W. Biederman <ebiederm@xmission.com> > Sent: Monday, March 10, 2025 11:44 PM > > Parav Pandit <parav@nvidia.com> writes: > > >> From: Eric W. Biederman <ebiederm@xmission.com> > >> Sent: Monday, March 10, 2025 9:59 PM > >> > >> Parav Pandit <parav@nvidia.com> writes: > >> > >> > A process running in a non-init user namespace possesses the > >> > CAP_NET_RAW capability. However, the patch cited in the fixes tag > >> > checks the capability in the default init user namespace. > >> > Because of this, when the process was started by Podman in a > >> > non-default user namespace, the flow creation failed. > >> > >> This change isn't a bug fix. This change is a relaxation of > >> permissions and it would be very good if this change description described > why it is in fact safe. > > As you explained below, it is not safe enough. :) I will improve the > > change description to reflect as I follow your good suggestions below. > > > >> > >> Many parts of the kernel are not safe for arbitrary users > >> to use. In those cases an ordinary capable like you found > >> is used. > >> > > Understood now. > > > >> > Fix this issue by checking the CAP_NET_RAW networking capability in > >> > the owner user namespace that created the network namespace. > >> > > >> > This change is similar to the following cited patches. > >> > > >> > commit 5e1fccc0bfac ("net: Allow userns root control of the core of > >> > the network stack.") commit 52e804c6dfaa ("net: Allow userns root > >> > to control ipv4") commit 59cd7377660a ("net: openvswitch: allow > >> > conntrack in non-initial user namespace") commit 0a3deb11858a ("fs: > >> > Allow > >> > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: > >> > relax permissions for listmount()") > >> > >> It is different in that hardware is involved. There is a fair amount > >> of kernel bypass allowed by design in infiniband so this may indeed > >> be safe to allow any user on the system to do. Still for someone who > >> isn't intimate with infiniband this isn't clear. > >> > >> > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > >> > Signed-off-by: Parav Pandit <parav@nvidia.com> > >> > > >> > --- > >> > I would like to have feedback from the LSM experts to make sure > >> > this fix is correct. Given the widespread usage of the capable() > >> > call, it makes me wonder if the patch right. > >> > > >> > Secondly, I wasn't able to determine which primary namespace (such > >> > as mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > >> > (not directly related to this patch, but as concept) > >> > >> I took a quick look and it appears that no one figures any of the > >> CAP_IPC_LOCK capability checks are safe for anyone except the global > >> root user. > >> > >> Allowing an arbitrary user to lock all of memory seems to defeat all > >> of the safeguards that are in place to limiting memory locking. > >> > >> It looks like RLIMIT_MEMLOCK has been updated to be per user > >> namespace (with hierachical limits), so I expect the most reasonable > >> thing to do is to simply ensure the process that creates the user > >> namespace has a large enough RLIMIT_MEMLOCK when the user > namespace is created. > > Ok, but if infiniband code does capable(), it is going to check the limit > outside of the user namespace, and the call will still fails. > > Isn't it? > > It depends on how the check is implemented. My point is that > RLIMIT_MEMLOCK has all of the knobs you might need to do something. > > I don't know if the checks you are concerned about allow using > RLIMIT_MEMLOCK. Given that some of them require having root in the initial > user namespace they might make a lot of assumptions. > > But rlimits are related to but separate from capabilities. > Ok. I will take this task separately. > > May be the users in non init user ns must run their infiniband application > without pinning the memory. > > Aka ODP in infiniband world. > > That sounds right. I don't remember enough about infiniband to say for > certain. > > Basically anything that uses ns_capable should be treated as something any > user can do, and so you need to watch out for hostile users. > Got it. > >> > --- > >> > drivers/infiniband/core/uverbs_cmd.c | 2 +- > >> > 1 file changed, 1 insertion(+), 1 deletion(-) > >> > > >> > diff --git a/drivers/infiniband/core/uverbs_cmd.c > >> > b/drivers/infiniband/core/uverbs_cmd.c > >> > index 5ad14c39d48c..8d6615f390f5 100644 > >> > --- a/drivers/infiniband/core/uverbs_cmd.c > >> > +++ b/drivers/infiniband/core/uverbs_cmd.c > >> > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct > >> uverbs_attr_bundle *attrs) > >> > if (cmd.comp_mask) > >> > return -EINVAL; > >> > > >> > - if (!capable(CAP_NET_RAW)) > >> > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > >> > return -EPERM; > >> > >> Looking at the code in drivers/infiniband/core/uverbs_cmd.c > >> I don't think original capable call is actually correct. > >> > >> The problem is that infiniband runs commands through a file descriptor. > >> Which means that anyone who can open the file descriptor and then > >> obtain a program that will work like a suid cat can bypass the > >> current permission check. > >> > >> Before we relax any checks that test needs to be: > >> file_ns_capable(file, &init_user_ns, CAP_NET_RAW); > >> > > > >> Similarly the network namespace you are talking about in those > >> infiniband commands really needs to be derived from the file > >> descriptor instead of current. > >> > > This now start making sense to me. > > When the file descriptor is open, I need to record the net ns and use it for > rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is > called) after opening the file. > > For the rest of the life cycle of the file descriptor. Don't forget that file > descriptors can be passed between processes. > Right. We have seen increasing use of this in rdma applications in recent years. > > Something like how sk_alloc() does sock_net_set(sk, net); > > > > Do I understand you correctly? > > Yes. > > But first. The permission checks need to be fixed to use the cred cached on > the file descriptor. So that the permission checks are not against the current > process, but are against the process that opened the file descriptor. > I am fixing this in the v1. Was trying to find the subsequent patch after the fix on what things to take care of. > Otherwise a non-privileged process can open the file descriptor and trick > another process with more permissions to write the values it wants to have > written to the file descriptor. Usually that is accomplished by tricking some > privileged application to write to stderr (that is passed from the attacker). > > Most of the time you can fix things like that using file_ns_capable. > Other times you encounter a userspace program that breaks and something > else needs to happen. > > >> Those kinds of bugs seem very easy to find in the infiniband code so > >> I have a hunch that the infiniband code needs some tender loving care > >> before it is safe for unprivileged users to be able to do anything with it. > >> > > Well, started to improve now... > > No fault if you haven't lots of code that only root could use no one takes > seriously with respect to security issues, so it tends to be buggy. I have > cleaned up a lot of code over the years before I have relaxed the permissions. > Great feedback, thank you Eric for the direction. I will roll v1 for the fix you suggested and utilize that same infrastructure to honor the non-default user ns. > Eric
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 5ad14c39d48c..8d6615f390f5 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct uverbs_attr_bundle *attrs) if (cmd.comp_mask) return -EINVAL; - if (!capable(CAP_NET_RAW)) + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) return -EPERM; if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED)
A process running in a non-init user namespace possesses the CAP_NET_RAW capability. However, the patch cited in the fixes tag checks the capability in the default init user namespace. Because of this, when the process was started by Podman in a non-default user namespace, the flow creation failed. Fix this issue by checking the CAP_NET_RAW networking capability in the owner user namespace that created the network namespace. This change is similar to the following cited patches. commit 5e1fccc0bfac ("net: Allow userns root control of the core of the network stack.") commit 52e804c6dfaa ("net: Allow userns root to control ipv4") commit 59cd7377660a ("net: openvswitch: allow conntrack in non-initial user namespace") commit 0a3deb11858a ("fs: Allow listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: relax permissions for listmount()") Fixes: c938a616aadb ("IB/core: Add raw packet QP type") Signed-off-by: Parav Pandit <parav@nvidia.com> --- I would like to have feedback from the LSM experts to make sure this fix is correct. Given the widespread usage of the capable() call, it makes me wonder if the patch right. Secondly, I wasn't able to determine which primary namespace (such as mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. (not directly related to this patch, but as concept) --- drivers/infiniband/core/uverbs_cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)