diff mbox

[CFT,00/10] Making new mounts of proc and sysfs as safe as bind mounts (take 2)

Message ID 87lhg8pwvz.fsf@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman May 28, 2015, 3:03 p.m. UTC
Serge Hallyn <serge.hallyn@ubuntu.com> writes:

> Quoting Andy Lutomirski (luto@amacapital.net):
>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> > I had hoped to get some Tested-By's on that patch series.
>> 
>> Sorry, I've been totally swamped.
>> 
>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>> containers, but I'll see if I can test it for real this weekend.
>
> Testing this with unprivileged containers, I get
>
> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> - error mounting sysfs on
> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0

Grr..  I was afraid this would break something. :(

Looking at my system I see that sysfs is currently mounted
"nosuid,nodev,noexec"

Looking at the lxc-start code I don't see it as including any of those
mount options.  In practice for sysfs I think those options are
meaningless (as there should be no devices and nothing executable in
sysfs) but I can understand the past concerns with chmod on virtual
filesystems that would incline people to use them, so I think the
failure is reporting a legitimate security issue in the lxc userspace
code where the the unprivileged code is currently attempting to give
greater access to sysfs than was given by the original mount of sysfs.

As nosuid,nodev,noexec should not impair the operation of sysfs
operation it looks like you can always specify those options and just
make this concern go away.

Something like the untested patch below I expect.


Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andy Lutomirski May 28, 2015, 5:33 p.m. UTC | #1
On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
>> Quoting Andy Lutomirski (luto@amacapital.net):
>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>> > I had hoped to get some Tested-By's on that patch series.
>>>
>>> Sorry, I've been totally swamped.
>>>
>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>> containers, but I'll see if I can test it for real this weekend.
>>
>> Testing this with unprivileged containers, I get
>>
>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>> - error mounting sysfs on
>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>
> Grr..  I was afraid this would break something. :(
>
> Looking at my system I see that sysfs is currently mounted
> "nosuid,nodev,noexec"
>
> Looking at the lxc-start code I don't see it as including any of those
> mount options.  In practice for sysfs I think those options are
> meaningless (as there should be no devices and nothing executable in
> sysfs) but I can understand the past concerns with chmod on virtual
> filesystems that would incline people to use them, so I think the
> failure is reporting a legitimate security issue in the lxc userspace
> code where the the unprivileged code is currently attempting to give
> greater access to sysfs than was given by the original mount of sysfs.
>
> As nosuid,nodev,noexec should not impair the operation of sysfs
> operation it looks like you can always specify those options and just
> make this concern go away.

Linus is pretty strict about not breaking the ABI, and this definitely
counts as breaking the ABI.  There's an exception for security issues,
but is there really a security issue here?  That is, do we lose
anything important if we just drop the offending part of the patch
set?  As you've said, there shouldn't be sensitive device nodes,
executables, or setuid files in proc or sysfs in the first place.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenton Varda May 28, 2015, 6:20 p.m. UTC | #2
On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>>
>>> Quoting Andy Lutomirski (luto@amacapital.net):
>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>> > I had hoped to get some Tested-By's on that patch series.
>>>>
>>>> Sorry, I've been totally swamped.
>>>>
>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>>> containers, but I'll see if I can test it for real this weekend.
>>>
>>> Testing this with unprivileged containers, I get
>>>
>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>>> - error mounting sysfs on
>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>
>> Grr..  I was afraid this would break something. :(
>>
>> Looking at my system I see that sysfs is currently mounted
>> "nosuid,nodev,noexec"
>>
>> Looking at the lxc-start code I don't see it as including any of those
>> mount options.  In practice for sysfs I think those options are
>> meaningless (as there should be no devices and nothing executable in
>> sysfs) but I can understand the past concerns with chmod on virtual
>> filesystems that would incline people to use them, so I think the
>> failure is reporting a legitimate security issue in the lxc userspace
>> code where the the unprivileged code is currently attempting to give
>> greater access to sysfs than was given by the original mount of sysfs.
>>
>> As nosuid,nodev,noexec should not impair the operation of sysfs
>> operation it looks like you can always specify those options and just
>> make this concern go away.
>
> Linus is pretty strict about not breaking the ABI, and this definitely
> counts as breaking the ABI.  There's an exception for security issues,
> but is there really a security issue here?  That is, do we lose
> anything important if we just drop the offending part of the patch
> set?  As you've said, there shouldn't be sensitive device nodes,
> executables, or setuid files in proc or sysfs in the first place.

Speaking as a user of the mount() interfaces, I really think it would
be less confusing overall if mount() simply ignored the requested
flags when the caller doesn't have a choice. That is, in cases where
mount() currently fails with EPERM when not given, say, MS_NOSUID, it
should instead just pretend the caller actually set MS_NOSUID and go
ahead with a nosuid mount. Or put another way, the absence of
MS_NOSUID should not be interpreted as "remove the nosuid bit" but
rather "don't set the nosuid bit if not required".

Consider:

- This approach will actually cause lxc to have the correct behavior,
without any changes to lxc. I suspect that this generalizes: In the
vast majority of cases, when users have failed to set MS_NOSUID, it's
not because they are explicitly requesting that the flag be turned
off, but rather that they didn't know it mattered.

- If a user actually *does* expect not passing MS_NOSUID to remove the
nosuid bit, and they find instead that the nosuid bit is silently
kept, I don't think they'll be confused: it's pretty obvious in
context that this must be for security reasons.

- On the other hand, the current behavior *is* very confusing: mount()
returns EPERM because of rules the caller probably doesn't know
anything about. I've spent a fair amount of time frustrated by this
sort of thing.

-Kenton
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 28, 2015, 7:14 p.m. UTC | #3
Kenton Varda <kenton@sandstorm.io> writes:

> On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>>>
>>>> Quoting Andy Lutomirski (luto@amacapital.net):
>>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>> > I had hoped to get some Tested-By's on that patch series.
>>>>>
>>>>> Sorry, I've been totally swamped.
>>>>>
>>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
>>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
>>>>> containers, but I'll see if I can test it for real this weekend.
>>>>
>>>> Testing this with unprivileged containers, I get
>>>>
>>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>>>> - error mounting sysfs on
>>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>>>
>>> Grr..  I was afraid this would break something. :(
>>>
>>> Looking at my system I see that sysfs is currently mounted
>>> "nosuid,nodev,noexec"
>>>
>>> Looking at the lxc-start code I don't see it as including any of those
>>> mount options.  In practice for sysfs I think those options are
>>> meaningless (as there should be no devices and nothing executable in
>>> sysfs) but I can understand the past concerns with chmod on virtual
>>> filesystems that would incline people to use them, so I think the
>>> failure is reporting a legitimate security issue in the lxc userspace
>>> code where the the unprivileged code is currently attempting to give
>>> greater access to sysfs than was given by the original mount of sysfs.
>>>
>>> As nosuid,nodev,noexec should not impair the operation of sysfs
>>> operation it looks like you can always specify those options and just
>>> make this concern go away.
>>
>> Linus is pretty strict about not breaking the ABI, and this definitely
>> counts as breaking the ABI.  There's an exception for security issues,
>> but is there really a security issue here?  That is, do we lose
>> anything important if we just drop the offending part of the patch
>> set?  As you've said, there shouldn't be sensitive device nodes,
>> executables, or setuid files in proc or sysfs in the first place.

We do need to enforce retaining the existing mount flags one way or
another.  Where this really matters is with MS_RDONLY.  We don't want
any old user to be able to mount /proc read-write when root mounted it
read-only.  There is a very real attack vector there.  That attack
almost works in docker container today and is avoided simply because
docker mounts over a few files on proc.

Which leads to the second side of the reason for these changes.   I am
fixing a very small but long standing ABI break.   That is in some small
ways I broke some sandboxes and when I realized they were broken I could
not imagine think how to fix the code until now.

It is the goal that user namespaces don't introduce anything for people
to worry about security wise more than simply the ability to execute
more kernel code.  So at least when the kernel implementation is correct
developers of existing applications simply do not need care.  Sadly we are
not quite there yet.

> Speaking as a user of the mount() interfaces, I really think it would
> be less confusing overall if mount() simply ignored the requested
> flags when the caller doesn't have a choice. That is, in cases where
> mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> should instead just pretend the caller actually set MS_NOSUID and go
> ahead with a nosuid mount. Or put another way, the absence of
> MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> rather "don't set the nosuid bit if not required".

I am conflicted.  Implicits are nice but confusing.  If we can do
something reliable and robust and maintainable here that is truly worth
the cost I am all for it.

If I mount proc read-write I likely want to be able to write to proc
files, and I will be much happier if the mount fails than if a bazillion
syscalls later something else fails when it tries to write to proc.

> Consider:
>
> - This approach will actually cause lxc to have the correct behavior,
> without any changes to lxc. I suspect that this generalizes: In the
> vast majority of cases, when users have failed to set MS_NOSUID, it's
> not because they are explicitly requesting that the flag be turned
> off, but rather that they didn't know it mattered.
>
> - If a user actually *does* expect not passing MS_NOSUID to remove the
> nosuid bit, and they find instead that the nosuid bit is silently
> kept, I don't think they'll be confused: it's pretty obvious in
> context that this must be for security reasons.
>
> - On the other hand, the current behavior *is* very confusing: mount()
> returns EPERM because of rules the caller probably doesn't know
> anything about. I've spent a fair amount of time frustrated by this
> sort of thing.

My sympathies.  This all started with an oh crap we overlooked corner
case X and it actually matters, and the fixes were quite likely a little
bit hasty.  The only case where this really shows up is remount insode
of a user namespace of filesystems that were mounted outside of the user
namespace is where this all actually matters.  And mounting new
instances of proc and sysfs wind up being weird instances of that
nonsense.

But please someone test sandstorm with this patchset and tell me if it
bites you.  The impetus to find a way to avoid breaking slightly buggy
userspace is higher if it is more than unprivileged lxc that is broken.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenton Varda May 28, 2015, 8:12 p.m. UTC | #4
On Thu, May 28, 2015 at 12:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> But please someone test sandstorm with this patchset and tell me if it
> bites you.  The impetus to find a way to avoid breaking slightly buggy
> userspace is higher if it is more than unprivileged lxc that is broken.

One of these days I'm going to learn how to compile and test kernels
again (last time I did it was 1999). Unfortunately I don't think I
have time at the moment, but hopefully Andy can do it.

I note, though, that we only have two mount() calls in the sandstorm
codebase that seem like they could be affected:

run-bundle.c++:1264: KJ_SYSCALL(mount("proc", "proc", "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""));
minibox.c++:251: KJ_SYSCALL(mount("proc", vpath.cStr(), "proc",
MS_NOSUID | MS_NODEV | MS_NOEXEC, ""),
supervisor.c++:921: KJ_SYSCALL(mount("/proc", "proc", nullptr, MS_BIND
| MS_REC, nullptr));

The first two seem like they should be fine since they set all the
flags (except readonly, which would be inappropriate for proc). I
guess my habit of setting every security flag I see came in handy. The
third case looks like it will be broken, BUT this line is in a
debug-only code path, so I don't care. Also we have the ability to
push any needed update within 24 hours, so we're generally in good
shape.

We never mount sysfs in Sandstorm.

> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I'm not sure that's true. Consider the broader context:
1) Your system's /proc is mounted read-only.
2) Now you're trying to mount a new proc in a new pid namespace, and
you do *not* specify MS_READONLY.

What should we expect here? Let's back off a bit and state user intent:
1) The system administrator has set a system-wide policy that /proc
may only be read, not written.
2) You made a PID namespace and it needed its own proc.

It seems intuitive here that the administrator's policy should apply
in the namespace. Certainly everyone using the system and/or all
software on the system already needs to be aware of this policy, since
it's unusual and will break things. Running software on this system
outside of any container already has the problem that syscalls
randomly break, so why should it be surprising when this happens
inside the container as well? Why do we need to go out of our way to
break at mount() time?

-Kenton
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Weinberger May 28, 2015, 8:47 p.m. UTC | #5
Am 28.05.2015 um 22:12 schrieb Kenton Varda:
> We never mount sysfs in Sandstorm.

sysfs is ABI and applications depend on it.
Even glibc is using sysfs. Currently it has
fallback paths but these may go away...

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn May 28, 2015, 9:04 p.m. UTC | #6
On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> 
> > Quoting Andy Lutomirski (luto@amacapital.net):
> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >> > I had hoped to get some Tested-By's on that patch series.
> >> 
> >> Sorry, I've been totally swamped.
> >> 
> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >> containers, but I'll see if I can test it for real this weekend.
> >
> > Testing this with unprivileged containers, I get
> >
> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> > - error mounting sysfs on
> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> 
> Grr..  I was afraid this would break something. :(
> 
> Looking at my system I see that sysfs is currently mounted
> "nosuid,nodev,noexec"
> 
> Looking at the lxc-start code I don't see it as including any of those
> mount options.  In practice for sysfs I think those options are
> meaningless (as there should be no devices and nothing executable in
> sysfs) but I can understand the past concerns with chmod on virtual
> filesystems that would incline people to use them, so I think the
> failure is reporting a legitimate security issue in the lxc userspace
> code where the the unprivileged code is currently attempting to give
> greater access to sysfs than was given by the original mount of sysfs.
> 
> As nosuid,nodev,noexec should not impair the operation of sysfs
> operation it looks like you can always specify those options and just
> make this concern go away.
> 
> Something like the untested patch below I expect.
> 
> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> index 9870455b3cae..d9ccd03afe68 100644
> --- a/src/lxc/conf.c
> +++ b/src/lxc/conf.c
> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },

fwiw - the first one works, the second one does not due to an apparent
inability to statvfs the origin.

> Alternately you can read the flags off of the original mount of proc or sysfs.
> 
> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> index 9870455b3cae..50ea49973e80 100644
> --- a/src/lxc/conf.c
> +++ b/src/lxc/conf.c
> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d,
>         struct statvfs sb;
>         unsigned long required_flags = 0;
>  
> -       if (!(flags & MS_REMOUNT))
> +       if (!(flags & MS_REMOUNT) &&
> +           (strcmp(s, "proc") != 0) &&
> +           (strcmp(s, "sysfs") != 0))
>                 return flags;
>  
>         if (!s)
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenton Varda May 28, 2015, 9:07 p.m. UTC | #7
On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote:
> Am 28.05.2015 um 22:12 schrieb Kenton Varda:
>> We never mount sysfs in Sandstorm.
>
> sysfs is ABI and applications depend on it.
> Even glibc is using sysfs. Currently it has
> fallback paths but these may go away...

Off-topic, but Sandstorm isn't intended to provide a full Linux ABI.
It is intended to provide a secure sandbox that can run apps that have
been explicitly ported to Sandstorm. More background if you're interested:

https://github.com/sandstorm-io/sandstorm/wiki/Security-Practices-Overview#server-sandboxing
https://blog.sandstorm.io/news/2014-08-13-sandbox-security.html

-Kenton
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Weinberger May 28, 2015, 9:12 p.m. UTC | #8
Am 28.05.2015 um 23:07 schrieb Kenton Varda:
> On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote:
>> Am 28.05.2015 um 22:12 schrieb Kenton Varda:
>>> We never mount sysfs in Sandstorm.
>>
>> sysfs is ABI and applications depend on it.
>> Even glibc is using sysfs. Currently it has
>> fallback paths but these may go away...
> 
> Off-topic, but Sandstorm isn't intended to provide a full Linux ABI.
> It is intended to provide a secure sandbox that can run apps that have
> been explicitly ported to Sandstorm. More background if you're interested:

Ahh, the application needs to be Sandstorm aware.
I was missing that detail. Thanks for pointing that out!

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 28, 2015, 9:42 p.m. UTC | #9
"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> 
>> > Quoting Andy Lutomirski (luto@amacapital.net):
>> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
>> >> <ebiederm@xmission.com> wrote:
>> >> > I had hoped to get some Tested-By's on that patch series.
>> >> 
>> >> Sorry, I've been totally swamped.
>> >> 
>> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
>> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
>> >> containers, but I'll see if I can test it for real this weekend.
>> >
>> > Testing this with unprivileged containers, I get
>> >
>> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
>> > - error mounting sysfs on
>> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
>> 
>> Grr..  I was afraid this would break something. :(
>> 
>> Looking at my system I see that sysfs is currently mounted
>> "nosuid,nodev,noexec"
>> 
>> Looking at the lxc-start code I don't see it as including any of those
>> mount options.  In practice for sysfs I think those options are
>> meaningless (as there should be no devices and nothing executable in
>> sysfs) but I can understand the past concerns with chmod on virtual
>> filesystems that would incline people to use them, so I think the
>> failure is reporting a legitimate security issue in the lxc userspace
>> code where the the unprivileged code is currently attempting to give
>> greater access to sysfs than was given by the original mount of sysfs.
>> 
>> As nosuid,nodev,noexec should not impair the operation of sysfs
>> operation it looks like you can always specify those options and just
>> make this concern go away.
>> 
>> Something like the untested patch below I expect.
>> 
>> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
>> index 9870455b3cae..d9ccd03afe68 100644
>> --- a/src/lxc/conf.c
>> +++ b/src/lxc/conf.c
>> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
>> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
>> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
>>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
>
> fwiw - the first one works, the second one does not due to an apparent
> inability to statvfs the origin.

Good to hear.  That confirms there are no other gotchas waiting in the
wings.

Apparently my second suggested patch is buggy due to an invalid source
string.  The source would need to be %r/proc or %r/sysfs to use statvfs
productively.


>> Alternately you can read the flags off of the original mount of proc or sysfs.
>> 
>> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
>> index 9870455b3cae..50ea49973e80 100644
>> --- a/src/lxc/conf.c
>> +++ b/src/lxc/conf.c
>> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d,
>>         struct statvfs sb;
>>         unsigned long required_flags = 0;
>>  
>> -       if (!(flags & MS_REMOUNT))
>> +       if (!(flags & MS_REMOUNT) &&
>> +           (strcmp(s, "proc") != 0) &&
>> +           (strcmp(s, "sysfs") != 0))
>>                 return flags;
>>  
>>         if (!s)
>> 
>> Eric
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn May 28, 2015, 9:52 p.m. UTC | #10
On Thu, May 28, 2015 at 04:42:34PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote:
> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >> 
> >> > Quoting Andy Lutomirski (luto@amacapital.net):
> >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >> >> <ebiederm@xmission.com> wrote:
> >> >> > I had hoped to get some Tested-By's on that patch series.
> >> >> 
> >> >> Sorry, I've been totally swamped.
> >> >> 
> >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >> >> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >> >> containers, but I'll see if I can test it for real this weekend.
> >> >
> >> > Testing this with unprivileged containers, I get
> >> >
> >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> >> > - error mounting sysfs on
> >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> >> 
> >> Grr..  I was afraid this would break something. :(
> >> 
> >> Looking at my system I see that sysfs is currently mounted
> >> "nosuid,nodev,noexec"
> >> 
> >> Looking at the lxc-start code I don't see it as including any of those
> >> mount options.  In practice for sysfs I think those options are
> >> meaningless (as there should be no devices and nothing executable in
> >> sysfs) but I can understand the past concerns with chmod on virtual
> >> filesystems that would incline people to use them, so I think the
> >> failure is reporting a legitimate security issue in the lxc userspace
> >> code where the the unprivileged code is currently attempting to give
> >> greater access to sysfs than was given by the original mount of sysfs.
> >> 
> >> As nosuid,nodev,noexec should not impair the operation of sysfs
> >> operation it looks like you can always specify those options and just
> >> make this concern go away.
> >> 
> >> Something like the untested patch below I expect.
> >> 
> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> >> index 9870455b3cae..d9ccd03afe68 100644
> >> --- a/src/lxc/conf.c
> >> +++ b/src/lxc/conf.c
> >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
> >>  		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
> >> -		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
> >> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >> +		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
> >>  		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
> >
> > fwiw - the first one works, the second one does not due to an apparent
> > inability to statvfs the origin.
> 
> Good to hear.  That confirms there are no other gotchas waiting in the
> wings.
> 
> Apparently my second suggested patch is buggy due to an invalid source
> string.  The source would need to be %r/proc or %r/sysfs to use statvfs
> productively.

Right, in these cases they're only passing in "sysfs".  The first way
is more explicit anyway (though may not help some people who have a
"lxc.mount.entry = sysfs sys sysfs ro 0 0" line in their configuration
instead, so maybe we'll have to go with the second after all, d'oh.
I'll have to look into it next week)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski May 29, 2015, 12:35 a.m. UTC | #11
[resend due to HTML. Sorry.]


On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> Kenton Varda <kenton@sandstorm.io> writes:
>
> > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >>>
> >>>> Quoting Andy Lutomirski (luto@amacapital.net):
> >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman
> >>>>> <ebiederm@xmission.com> wrote:
> >>>>> > I had hoped to get some Tested-By's on that patch series.
> >>>>>
> >>>>> Sorry, I've been totally swamped.
> >>>>>
> >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test
> >>>>> it for real.  Sandstorm makes only limited use of proc and sysfs in
> >>>>> containers, but I'll see if I can test it for real this weekend.
> >>>>
> >>>> Testing this with unprivileged containers, I get
> >>>>
> >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted
> >>>> - error mounting sysfs on
> >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0
> >>>
> >>> Grr..  I was afraid this would break something. :(
> >>>
> >>> Looking at my system I see that sysfs is currently mounted
> >>> "nosuid,nodev,noexec"
> >>>
> >>> Looking at the lxc-start code I don't see it as including any of those
> >>> mount options.  In practice for sysfs I think those options are
> >>> meaningless (as there should be no devices and nothing executable in
> >>> sysfs) but I can understand the past concerns with chmod on virtual
> >>> filesystems that would incline people to use them, so I think the
> >>> failure is reporting a legitimate security issue in the lxc userspace
> >>> code where the the unprivileged code is currently attempting to give
> >>> greater access to sysfs than was given by the original mount of sysfs.
> >>>
> >>> As nosuid,nodev,noexec should not impair the operation of sysfs
> >>> operation it looks like you can always specify those options and just
> >>> make this concern go away.
> >>
> >> Linus is pretty strict about not breaking the ABI, and this definitely
> >> counts as breaking the ABI.  There's an exception for security issues,
> >> but is there really a security issue here?  That is, do we lose
> >> anything important if we just drop the offending part of the patch
> >> set?  As you've said, there shouldn't be sensitive device nodes,
> >> executables, or setuid files in proc or sysfs in the first place.
>
> We do need to enforce retaining the existing mount flags one way or
> another.  Where this really matters is with MS_RDONLY.  We don't want
> any old user to be able to mount /proc read-write when root mounted it
> read-only.  There is a very real attack vector there.  That attack
> almost works in docker container today and is avoided simply because
> docker mounts over a few files on proc.

You could drop the nosuid, noexec, and nodev changes and keep just the
ro part.  The ro part is probably not an ABI break in the sense of
something that actually breaks real programs.

>
> Which leads to the second side of the reason for these changes.   I am
> fixing a very small but long standing ABI break.   That is in some small
> ways I broke some sandboxes and when I realized they were broken I could
> not imagine think how to fix the code until now.
>
> It is the goal that user namespaces don't introduce anything for people
> to worry about security wise more than simply the ability to execute
> more kernel code.  So at least when the kernel implementation is correct
> developers of existing applications simply do not need care.  Sadly we are
> not quite there yet.
>
> > Speaking as a user of the mount() interfaces, I really think it would
> > be less confusing overall if mount() simply ignored the requested
> > flags when the caller doesn't have a choice. That is, in cases where
> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it
> > should instead just pretend the caller actually set MS_NOSUID and go
> > ahead with a nosuid mount. Or put another way, the absence of
> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but
> > rather "don't set the nosuid bit if not required".
>
> I am conflicted.  Implicits are nice but confusing.  If we can do
> something reliable and robust and maintainable here that is truly worth
> the cost I am all for it.
>
> If I mount proc read-write I likely want to be able to write to proc
> files, and I will be much happier if the mount fails than if a bazillion
> syscalls later something else fails when it tries to write to proc.

I agree.  I don't like the implicit thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 29, 2015, 4:36 a.m. UTC | #12
Andy Lutomirski <luto@amacapital.net> writes:
> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> Kenton Varda <kenton@sandstorm.io> writes:
>>
>> We do need to enforce retaining the existing mount flags one way or
>> another.  Where this really matters is with MS_RDONLY.  We don't want
>> any old user to be able to mount /proc read-write when root mounted it
>> read-only.  There is a very real attack vector there.  That attack
>> almost works in docker container today and is avoided simply because
>> docker mounts over a few files on proc.
>
> You could drop the nosuid, noexec, and nodev changes and keep just the
> ro part.  The ro part is probably not an ABI break in the sense of
> something that actually breaks real programs.

As a change simply removing the code from the existing patches that
worries about nosuid, noexec, and the nodev flags is certainly doable.
It is the best proposal I have heard so far.

I remain unconvinced about ignoring those flags:
- There are clearly people who think it matters (or else proc and sysfs
  would not have those flags specified).

- There have been times when it actually has mattered.
  Aka when files like /proc/self/env could be chmodded and used for
  privilege escalation.

- The code in lxc and libvirt-lxc so far has been clearly buggy.
  * lxc only has problems with sysfs (in some configurations).
  * libvirt-lxc only has problems on a bind mount remount of
    proc after remounting proc properly.

So I am leaning towards enforcing all of the mount flags including
nosuid, noexec, and nodev.  Then when the next subtle bug in proc or
sysfs with respect to chmod shows up I will be able to sleep soundly at
night because the mount flags of those filesystems allow a mitigation,
and I did not sabatage the mitigation.

Plus contemplating code that just enforces a couple of mount flags but
not all of the feels wrong.

I don't think it is actually a maintainable position to just enforce a
couple of those flags.  If nothing else I would expect someone to look
at the code and to generate a bug fix to start enforcing the rest of the
flags.  Or perhaps it is in a few years time and something gets
refactored and the enforcing starts happening by virtue of using a new
common function that no-one realizes will be a problem.

Additionally if we don't enforce nosuid, noexec, and nodev people are
going to ask questions, that will be hard to explain.  When what is
truly desirable is to say that sysfs and proc are a little odd but they
don't allow anything that a bind mount won't.

I can be persuaded otherwise but right now I do think the kernel code
needs to enforce nosuid, noexec, and nodev as it is a security issue (if
only a defence in depth one), and a maintenance issue as I do not
believe in the long term it is a maintanable or an explicable position.

>> > Speaking as a user of the mount() interfaces, I really think it would
>> > be less confusing overall if mount() simply ignored the requested
>> > flags when the caller doesn't have a choice. That is, in cases where
>> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it
>> > should instead just pretend the caller actually set MS_NOSUID and go
>> > ahead with a nosuid mount. Or put another way, the absence of
>> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but
>> > rather "don't set the nosuid bit if not required".
>>
>> I am conflicted.  Implicits are nice but confusing.  If we can do
>> something reliable and robust and maintainable here that is truly worth
>> the cost I am all for it.
>>
>> If I mount proc read-write I likely want to be able to write to proc
>> files, and I will be much happier if the mount fails than if a bazillion
>> syscalls later something else fails when it tries to write to proc.
>
> I agree.  I don't like the implicit thing.

My memory returns of our last round of looking at this and for whatever
it's warts the existing mount API for remounting filesystems needs to
have the flags have exactly the same meaning as at mount time.  There
are existing userspace applications that depend on that behavior.

Implicits for only the locked mount flags is a little different but
still ick.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kenton Varda May 29, 2015, 4:54 a.m. UTC | #13
On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Implicits for only the locked mount flags is a little different but
> still ick.

FWIW, I only ever meant to advocate for this for locked flags, i.e.
cases where the only other option is to throw EPERM. Clearly when the
user has permission, the exact requested flags should be applied, or
all kinds of things break.

It seems to me that if we can fix the security issue without breaking
userspace, we should. Sometimes we end up with icky APIs to avoid
breaking userspace. (Though IMO implicitly preserving locked bits is
not icky at all.)

-Kenton
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski May 29, 2015, 5:49 p.m. UTC | #14
On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>> Kenton Varda <kenton@sandstorm.io> writes:
>>>
>>> We do need to enforce retaining the existing mount flags one way or
>>> another.  Where this really matters is with MS_RDONLY.  We don't want
>>> any old user to be able to mount /proc read-write when root mounted it
>>> read-only.  There is a very real attack vector there.  That attack
>>> almost works in docker container today and is avoided simply because
>>> docker mounts over a few files on proc.
>>
>> You could drop the nosuid, noexec, and nodev changes and keep just the
>> ro part.  The ro part is probably not an ABI break in the sense of
>> something that actually breaks real programs.
>
> As a change simply removing the code from the existing patches that
> worries about nosuid, noexec, and the nodev flags is certainly doable.
> It is the best proposal I have heard so far.
>
> I remain unconvinced about ignoring those flags:
> - There are clearly people who think it matters (or else proc and sysfs
>   would not have those flags specified).
>
> - There have been times when it actually has mattered.
>   Aka when files like /proc/self/env could be chmodded and used for
>   privilege escalation.
>
> - The code in lxc and libvirt-lxc so far has been clearly buggy.
>   * lxc only has problems with sysfs (in some configurations).
>   * libvirt-lxc only has problems on a bind mount remount of
>     proc after remounting proc properly.
>
> So I am leaning towards enforcing all of the mount flags including
> nosuid, noexec, and nodev.  Then when the next subtle bug in proc or
> sysfs with respect to chmod shows up I will be able to sleep soundly at
> night because the mount flags of those filesystems allow a mitigation,
> and I did not sabatage the mitigation.

One option would be to break the nosuid, nodev, and noexec parts into
their own patch and then avoid tagging that patch for -stable if at
all possible.  It would be nice to avoid another -stable ABI break if
at all possible.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 3, 2015, 9:13 p.m. UTC | #15
Andy Lutomirski <luto@amacapital.net> writes:

> One option would be to break the nosuid, nodev, and noexec parts into
> their own patch and then avoid tagging that patch for -stable if at
> all possible.  It would be nice to avoid another -stable ABI break if
> at all possible.

So I don't think we actually have anything that could be called an ABI
break in the whole mess, but it is definitely a behavioral change that
is a regression for lxc and libvirt-lxc that prevents them from starting.

nodev does not actually matter because of the implicit silliness that
is being added right now.

We do want those programs fixed and after those programs are fixed we
can safely begin failing mount when those attributes are being cleared
in a fresh mount.

So it looks to me like the best thing to do is to print a warning
whenever lxc or libvirt-lxc gets it wrong, which should ensure the
authors are sufficiently pestered that in a kernel release or 3 we can
begin enforcing those attributes.  Especially as the discussion on the
fix for those applications has already begun.

And if folks would double check the patch I am going to post in a moment
to ensure that lxc and libvirt-lxc continue to start I would appreciate it.

Eric



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman June 4, 2015, 5:19 a.m. UTC | #16
On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
> 
> > One option would be to break the nosuid, nodev, and noexec parts into
> > their own patch and then avoid tagging that patch for -stable if at
> > all possible.  It would be nice to avoid another -stable ABI break if
> > at all possible.
> 
> So I don't think we actually have anything that could be called an ABI
> break in the whole mess, but it is definitely a behavioral change that
> is a regression for lxc and libvirt-lxc that prevents them from starting.
> 
> nodev does not actually matter because of the implicit silliness that
> is being added right now.
> 
> We do want those programs fixed and after those programs are fixed we
> can safely begin failing mount when those attributes are being cleared
> in a fresh mount.
> 
> So it looks to me like the best thing to do is to print a warning
> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
> authors are sufficiently pestered that in a kernel release or 3 we can
> begin enforcing those attributes.  Especially as the discussion on the
> fix for those applications has already begun.

"pestering" never works, look at some of the SCSI drivers for examples
of how a distro will just patch out the "warning this driver is using an
old api and needs to be fixed" messages.

You can't break stuff like this, people will get upset :(

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 4, 2015, 6:27 a.m. UTC | #17
Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <luto@amacapital.net> writes:
>> 
>> > One option would be to break the nosuid, nodev, and noexec parts into
>> > their own patch and then avoid tagging that patch for -stable if at
>> > all possible.  It would be nice to avoid another -stable ABI break if
>> > at all possible.
>> 
>> So I don't think we actually have anything that could be called an ABI
>> break in the whole mess, but it is definitely a behavioral change that
>> is a regression for lxc and libvirt-lxc that prevents them from starting.
>> 
>> nodev does not actually matter because of the implicit silliness that
>> is being added right now.
>> 
>> We do want those programs fixed and after those programs are fixed we
>> can safely begin failing mount when those attributes are being cleared
>> in a fresh mount.
>> 
>> So it looks to me like the best thing to do is to print a warning
>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
>> authors are sufficiently pestered that in a kernel release or 3 we can
>> begin enforcing those attributes.  Especially as the discussion on the
>> fix for those applications has already begun.
>
> "pestering" never works, look at some of the SCSI drivers for examples
> of how a distro will just patch out the "warning this driver is using an
> old api and needs to be fixed" messages.

> You can't break stuff like this, people will get upset :(

A) To the best of my knowledge there are two programs on the face of the
   planet where this matters. (lxc and libvirt-lxc)

B) The code in those two programs is buggy.  That is the code in those
   two programs does not do what the authors intended.  That is fixing
   those programs is something that should be done regardless of what
   I do in the kernel.  I have already reached out to the developers of
   those programs.  The pestering in the kernel is a form of reminder,
   not the primary source of communication.

C) These bugs really are security holes.  Currently they do not appear
   exploitable (thank goodness) but they are security holes.

   Since they are not currently exploitable it does make sense
   to give people a little time to get their act together.

   The bugs are larger then the case that is being hit here,
   this is just where they are noticed.

D) Letting people know that there is a problem as part of a larger
   effort has actually worked for me.  Distro's have stopped enabling
   the sysctl system call.

E) Given that I have not audited sysfs and proc closely in recent years
   I may actually be wrong.  Those bugs may actually be exploitable.
   All it takes is chmod to be supported on one file that can be made
   executable.  That bug has existed in the past and I don't doubt
   someone will overlook something and we will see the bug again in the
   future.

So it is my best judgment that I disable the code that stops
containers from starting and just making it a warning (for now).
Then in a release or so I start failing these operations instead of
warning.

This is the most fair and reasonable I can see to be.

The only other choice I can see is to say I don't care it is a security
issue I am breaking your sloopy insecure code.

Am I being too nice with these security bugs?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman June 4, 2015, 7:34 a.m. UTC | #18
ebiederm@xmission.com (Eric W. Biederman) writes:

> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
>
>> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
>>> Andy Lutomirski <luto@amacapital.net> writes:
>>> 
>>> > One option would be to break the nosuid, nodev, and noexec parts into
>>> > their own patch and then avoid tagging that patch for -stable if at
>>> > all possible.  It would be nice to avoid another -stable ABI break if
>>> > at all possible.
>>> 
>>> So I don't think we actually have anything that could be called an ABI
>>> break in the whole mess, but it is definitely a behavioral change that
>>> is a regression for lxc and libvirt-lxc that prevents them from starting.
>>> 
>>> nodev does not actually matter because of the implicit silliness that
>>> is being added right now.
>>> 
>>> We do want those programs fixed and after those programs are fixed we
>>> can safely begin failing mount when those attributes are being cleared
>>> in a fresh mount.
>>> 
>>> So it looks to me like the best thing to do is to print a warning
>>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
>>> authors are sufficiently pestered that in a kernel release or 3 we can
>>> begin enforcing those attributes.  Especially as the discussion on the
>>> fix for those applications has already begun.
>>
>> "pestering" never works, look at some of the SCSI drivers for examples
>> of how a distro will just patch out the "warning this driver is using an
>> old api and needs to be fixed" messages.
>
>> You can't break stuff like this, people will get upset :(
>
> A) To the best of my knowledge there are two programs on the face of the
>    planet where this matters. (lxc and libvirt-lxc)
>
> B) The code in those two programs is buggy.  That is the code in those
>    two programs does not do what the authors intended.  That is fixing
>    those programs is something that should be done regardless of what
>    I do in the kernel.  I have already reached out to the developers of
>    those programs.  The pestering in the kernel is a form of reminder,
>    not the primary source of communication.
>
> C) These bugs really are security holes.  Currently they do not appear
>    exploitable (thank goodness) but they are security holes.
>
>    Since they are not currently exploitable it does make sense
>    to give people a little time to get their act together.
>
>    The bugs are larger then the case that is being hit here,
>    this is just where they are noticed.
>
> D) Letting people know that there is a problem as part of a larger
>    effort has actually worked for me.  Distro's have stopped enabling
>    the sysctl system call.
>
> E) Given that I have not audited sysfs and proc closely in recent years
>    I may actually be wrong.  Those bugs may actually be exploitable.
>    All it takes is chmod to be supported on one file that can be made
>    executable.  That bug has existed in the past and I don't doubt
>    someone will overlook something and we will see the bug again in the
>    future.
>
> So it is my best judgment that I disable the code that stops
> containers from starting and just making it a warning (for now).
> Then in a release or so I start failing these operations instead of
> warning.
>
> This is the most fair and reasonable I can see to be.
>
> The only other choice I can see is to say I don't care it is a security
> issue I am breaking your sloopy insecure code.
>
> Am I being too nice with these security bugs?

Thinking about it a little more.  There is a possibility that sometime
in the future that someone will deliberately add a suid executable as a
file in proc or sysfs and have a good reason for doing so.

Some sysadmin or sandbox builder with special requirements may then
disable suid and exec on proc because in their sandbox (not linux in
general) having access to that executable is a bad thing.  At which
we have an exploitable security issue if nosuid and noexec are not
enforced.

Or in other words I am not smarter than the bad guys.  This is a
security issue.  I can not ignore nosuid and noexec indefinitely.
I have to make those cases fail at some point.  At that point
current unfixed versions of lxc and libvirt-lxc will break.

A warning is the nicest I can imagine being.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel P. Berrangé June 16, 2015, 12:23 p.m. UTC | #19
On Thu, Jun 04, 2015 at 01:27:10AM -0500, Eric W. Biederman wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> 
> > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote:
> >> Andy Lutomirski <luto@amacapital.net> writes:
> >> 
> >> > One option would be to break the nosuid, nodev, and noexec parts into
> >> > their own patch and then avoid tagging that patch for -stable if at
> >> > all possible.  It would be nice to avoid another -stable ABI break if
> >> > at all possible.
> >> 
> >> So I don't think we actually have anything that could be called an ABI
> >> break in the whole mess, but it is definitely a behavioral change that
> >> is a regression for lxc and libvirt-lxc that prevents them from starting.
> >> 
> >> nodev does not actually matter because of the implicit silliness that
> >> is being added right now.
> >> 
> >> We do want those programs fixed and after those programs are fixed we
> >> can safely begin failing mount when those attributes are being cleared
> >> in a fresh mount.
> >> 
> >> So it looks to me like the best thing to do is to print a warning
> >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the
> >> authors are sufficiently pestered that in a kernel release or 3 we can
> >> begin enforcing those attributes.  Especially as the discussion on the
> >> fix for those applications has already begun.
> >
> > "pestering" never works, look at some of the SCSI drivers for examples
> > of how a distro will just patch out the "warning this driver is using an
> > old api and needs to be fixed" messages.
> 
> > You can't break stuff like this, people will get upset :(
> 
> A) To the best of my knowledge there are two programs on the face of the
>    planet where this matters. (lxc and libvirt-lxc)
> 
> B) The code in those two programs is buggy.  That is the code in those
>    two programs does not do what the authors intended.  That is fixing
>    those programs is something that should be done regardless of what
>    I do in the kernel.  I have already reached out to the developers of
>    those programs.  The pestering in the kernel is a form of reminder,
>    not the primary source of communication.
> 
> C) These bugs really are security holes.  Currently they do not appear
>    exploitable (thank goodness) but they are security holes.
> 
>    Since they are not currently exploitable it does make sense
>    to give people a little time to get their act together.
> 
>    The bugs are larger then the case that is being hit here,
>    this is just where they are noticed.
> 
> D) Letting people know that there is a problem as part of a larger
>    effort has actually worked for me.  Distro's have stopped enabling
>    the sysctl system call.
> 
> E) Given that I have not audited sysfs and proc closely in recent years
>    I may actually be wrong.  Those bugs may actually be exploitable.
>    All it takes is chmod to be supported on one file that can be made
>    executable.  That bug has existed in the past and I don't doubt
>    someone will overlook something and we will see the bug again in the
>    future.
> 
> So it is my best judgment that I disable the code that stops
> containers from starting and just making it a warning (for now).
> Then in a release or so I start failing these operations instead of
> warning.
> 
> This is the most fair and reasonable I can see to be.

While I generally like & support the kernel standard that userspace must
never be broken, as libvirt LXC maintainer I think what Eric proposes is
acceptable from the libvirt POV.

We'll get the fix into libvirt LXC in this month's release and backport
it to our stable branches. So as long as there are a few months/releases
grace period between this being a kernel warning and it turning into a
hard error, libvirt users will have the fix already, or at least have it
easily available to them.

Regards,
Daniel
diff mbox

Patch

diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 9870455b3cae..d9ccd03afe68 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -770,8 +770,8 @@  static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger",                             "%r/proc/sysrq-trigger",        NULL,       MS_BIND,                        NULL },
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL,                                                "%r/proc/sysrq-trigger",        NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },
 		{ LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW,    "proc",                                              "%r/proc",                      "proc",     MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
-		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    0,                              NULL },
-		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_RDONLY,                      NULL },
+		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RW,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
+		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_RO,     "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "sysfs",                                             "%r/sys",                       "sysfs",    MS_NODEV|MS_NOEXEC|MS_NOSUID,   NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  "%r/sys",                                            "%r/sys",                       NULL,       MS_BIND,                        NULL },
 		{ LXC_AUTO_SYS_MASK,  LXC_AUTO_SYS_MIXED,  NULL,                                                "%r/sys",                       NULL,       MS_REMOUNT|MS_BIND|MS_RDONLY,   NULL },

Alternately you can read the flags off of the original mount of proc or sysfs.

diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 9870455b3cae..50ea49973e80 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -712,7 +712,9 @@  static unsigned long add_required_remount_flags(const char *s, const char *d,
        struct statvfs sb;
        unsigned long required_flags = 0;
 
-       if (!(flags & MS_REMOUNT))
+       if (!(flags & MS_REMOUNT) &&
+           (strcmp(s, "proc") != 0) &&
+           (strcmp(s, "sysfs") != 0))
                return flags;
 
        if (!s)