diff mbox

[RFC] Add a "nolinks" mount option.

Message ID 1476455305-35554-1-git-send-email-mnissler@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Mattias Nissler Oct. 14, 2016, 2:28 p.m. UTC
For mounts that have the new "nolinks" option, don't follow symlinks
and reject to open files with a hard link count larger than one. The
new option is similar in spirit to the existing "nodev", "noexec", and
"nosuid" options.

Note that symlinks and hard links may still be created on mounts where
the "nolinks" option is present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly. Similarly, hard-linked files can be identified from
userspace using stat() output while the "nolinks" option is set.

Setting the "nolinks" mount option helps prevent privileged writers
from modifying files unintentionally in case there is an unexpected
link along the accessed path. The "nolinks" option is thus useful as a
defensive measure against persistent exploits (i.e. a system getting
re-exploited after a reboot) for systems that employ a read-only or
dm-verity-protected rootfs. These systems prevent non-legit binaries
from running after reboot. However, legit code typically still reads
from and writes to a writable file system previously under full
control of the attacker, who can place symlinks to trick file writes
after reboot to target a file of their choice. "nolinks" fundamentally
prevents this.

Signed-off-by: Mattias Nissler <mnissler@chromium.org>
---
 fs/namei.c              | 9 ++++++++-
 fs/namespace.c          | 8 +++++---
 fs/proc_namespace.c     | 1 +
 include/linux/mount.h   | 3 ++-
 include/uapi/linux/fs.h | 1 +
 5 files changed, 17 insertions(+), 5 deletions(-)

Comments

Al Viro Oct. 14, 2016, 2:55 p.m. UTC | #1
On Fri, Oct 14, 2016 at 04:28:25PM +0200, Mattias Nissler wrote:
> For mounts that have the new "nolinks" option, don't follow symlinks
> and reject to open files with a hard link count larger than one. The
> new option is similar in spirit to the existing "nodev", "noexec", and
> "nosuid" options.
> 
> Note that symlinks and hard links may still be created on mounts where
> the "nolinks" option is present. readlink() remains functional, so
> user space code that is aware of symlinks can still choose to follow
> them explicitly. Similarly, hard-linked files can be identified from
> userspace using stat() output while the "nolinks" option is set.
> 
> Setting the "nolinks" mount option helps prevent privileged writers
> from modifying files unintentionally in case there is an unexpected
> link along the accessed path. The "nolinks" option is thus useful as a
> defensive measure against persistent exploits (i.e. a system getting
> re-exploited after a reboot) for systems that employ a read-only or
> dm-verity-protected rootfs. These systems prevent non-legit binaries
> from running after reboot. However, legit code typically still reads
> from and writes to a writable file system previously under full
> control of the attacker, who can place symlinks to trick file writes
> after reboot to target a file of their choice. "nolinks" fundamentally
> prevents this.

Which parts of the tree would be on that "protected" rootfs and which would
you mount with that option?  Description above is rather vague and I'm
not convinced that it actually buys you anything.  Details, please...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 14, 2016, 3 p.m. UTC | #2
On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:
> > Setting the "nolinks" mount option helps prevent privileged writers
> > from modifying files unintentionally in case there is an unexpected
> > link along the accessed path. The "nolinks" option is thus useful as a
> > defensive measure against persistent exploits (i.e. a system getting
> > re-exploited after a reboot) for systems that employ a read-only or
> > dm-verity-protected rootfs. These systems prevent non-legit binaries
> > from running after reboot. However, legit code typically still reads
> > from and writes to a writable file system previously under full
> > control of the attacker, who can place symlinks to trick file writes
> > after reboot to target a file of their choice. "nolinks" fundamentally
> > prevents this.
> 
> Which parts of the tree would be on that "protected" rootfs and which would
> you mount with that option?  Description above is rather vague and I'm
> not convinced that it actually buys you anything.  Details, please...

PS: what the hell do restrictions on _following_ symlinks have to _creating_
hardlinks?  I'm trying to imagine a threat model where both would apply or
anything else beyond the word "link" they would have in common...

The one you've described above might have something to do with the first
one (modulo missing description of the setup you have in mind), but it
clearly has nothing to do with the second - attackers could've created
whatever they wanted while the fs had been under their control, after all.
Doesn't make sense...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mattias Nissler Oct. 14, 2016, 3:50 p.m. UTC | #3
On Fri, Oct 14, 2016 at 5:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:
> > > Setting the "nolinks" mount option helps prevent privileged writers
> > > from modifying files unintentionally in case there is an unexpected
> > > link along the accessed path. The "nolinks" option is thus useful as a
> > > defensive measure against persistent exploits (i.e. a system getting
> > > re-exploited after a reboot) for systems that employ a read-only or
> > > dm-verity-protected rootfs. These systems prevent non-legit binaries
> > > from running after reboot. However, legit code typically still reads
> > > from and writes to a writable file system previously under full
> > > control of the attacker, who can place symlinks to trick file writes
> > > after reboot to target a file of their choice. "nolinks" fundamentally
> > > prevents this.
> >
> > Which parts of the tree would be on that "protected" rootfs and which would
> > you mount with that option?  Description above is rather vague and I'm
> > not convinced that it actually buys you anything.  Details, please...

Apologies for the vague description, I'm happy to explain in detail.

In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
an attacker can't modify any binaries. After reboot, everything except
the rootfs is mounted noexec, so there's no way to re-gain code
execution after reboot by modifying existing binaries or dropping new
ones.

We've seen multiple exploits now where the attacker worked around
these limitations in two steps:

1. Before reboot, the attacker sets up symlinks on the writeable file
system (called "stateful" file system), which are later accessed by
legit boot code (such as init scripts) after reboot. For example, an
init script that copies file A to B can be abused by an attacker by
symlinking or hardlinking B to a location C of their choice, and
placing desired data to be written to C in A. That gives the attacker
a primitive to write data of their choice to a path of their choice
after reboot. Note that this primitive may target locations _outside_
the stateful file system the attacker previously had control of.
Particularly of interest are targets on /sys, but also tmpfs on /run
etc.

2. The second step for a successful attack is finding some legit code
invoked in the boot flow that has a vulnerability exploitable by
feeding it unexpected data. As an example, there are Linux userspace
utilities that read config from /run which may contain shell commands
the the utility executes, through which the attacker can gain code
execution again.

The purpose of the proposed patch is to raise the bar for the first
step of the attack: Writing arbitrary files after reboot. I'm
intending to mount the stateful file system with the nolinks option
(or otherwise prevent symlink traversal). This will help make sure
that any legit writes taking place during boot in init scripts etc. go
to the files intended by the developer, and can't be redirected by an
attacker.

Does this make more sense to you?

>
>
> PS: what the hell do restrictions on _following_ symlinks have to _creating_
> hardlinks?  I'm trying to imagine a threat model where both would apply or
> anything else beyond the word "link" they would have in common...

The restriction is not on _creating_ hard links, but _opening_
hardlinks. The commonality is in the confusion between the file you're
meaning to write vs. the file you actually end up writing to, which
stems from the fact that as things stand a file can be accessible on
other paths than its canonical one. For Chrome OS, I'd like to get to
a point where most privileged code can only access a file via its
canonical name (bind mounts are an OK exception as they're not
persistent, so out of reach for manipulation).

>
> The one you've described above might have something to do with the first
> one (modulo missing description of the setup you have in mind), but it
> clearly has nothing to do with the second - attackers could've created
> whatever they wanted while the fs had been under their control, after all.
> Doesn't make sense...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mattias Nissler Oct. 14, 2016, 4:22 p.m. UTC | #4
Forgot to mention: I realize my motivation is very specific to Chrome
OS, however the nolinks option seemed useful also as a mitigation to
generic privilege escalation symlink attacks, for cases where
disabling symlinks/hardlinks is acceptable.

On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler <mnissler@chromium.org> wrote:
> On Fri, Oct 14, 2016 at 5:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:
>> > > Setting the "nolinks" mount option helps prevent privileged writers
>> > > from modifying files unintentionally in case there is an unexpected
>> > > link along the accessed path. The "nolinks" option is thus useful as a
>> > > defensive measure against persistent exploits (i.e. a system getting
>> > > re-exploited after a reboot) for systems that employ a read-only or
>> > > dm-verity-protected rootfs. These systems prevent non-legit binaries
>> > > from running after reboot. However, legit code typically still reads
>> > > from and writes to a writable file system previously under full
>> > > control of the attacker, who can place symlinks to trick file writes
>> > > after reboot to target a file of their choice. "nolinks" fundamentally
>> > > prevents this.
>> >
>> > Which parts of the tree would be on that "protected" rootfs and which would
>> > you mount with that option?  Description above is rather vague and I'm
>> > not convinced that it actually buys you anything.  Details, please...
>
> Apologies for the vague description, I'm happy to explain in detail.
>
> In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
> an attacker can't modify any binaries. After reboot, everything except
> the rootfs is mounted noexec, so there's no way to re-gain code
> execution after reboot by modifying existing binaries or dropping new
> ones.
>
> We've seen multiple exploits now where the attacker worked around
> these limitations in two steps:
>
> 1. Before reboot, the attacker sets up symlinks on the writeable file
> system (called "stateful" file system), which are later accessed by
> legit boot code (such as init scripts) after reboot. For example, an
> init script that copies file A to B can be abused by an attacker by
> symlinking or hardlinking B to a location C of their choice, and
> placing desired data to be written to C in A. That gives the attacker
> a primitive to write data of their choice to a path of their choice
> after reboot. Note that this primitive may target locations _outside_
> the stateful file system the attacker previously had control of.
> Particularly of interest are targets on /sys, but also tmpfs on /run
> etc.
>
> 2. The second step for a successful attack is finding some legit code
> invoked in the boot flow that has a vulnerability exploitable by
> feeding it unexpected data. As an example, there are Linux userspace
> utilities that read config from /run which may contain shell commands
> the the utility executes, through which the attacker can gain code
> execution again.
>
> The purpose of the proposed patch is to raise the bar for the first
> step of the attack: Writing arbitrary files after reboot. I'm
> intending to mount the stateful file system with the nolinks option
> (or otherwise prevent symlink traversal). This will help make sure
> that any legit writes taking place during boot in init scripts etc. go
> to the files intended by the developer, and can't be redirected by an
> attacker.
>
> Does this make more sense to you?
>
>>
>>
>> PS: what the hell do restrictions on _following_ symlinks have to _creating_
>> hardlinks?  I'm trying to imagine a threat model where both would apply or
>> anything else beyond the word "link" they would have in common...
>
> The restriction is not on _creating_ hard links, but _opening_
> hardlinks. The commonality is in the confusion between the file you're
> meaning to write vs. the file you actually end up writing to, which
> stems from the fact that as things stand a file can be accessible on
> other paths than its canonical one. For Chrome OS, I'd like to get to
> a point where most privileged code can only access a file via its
> canonical name (bind mounts are an OK exception as they're not
> persistent, so out of reach for manipulation).
>
>>
>> The one you've described above might have something to do with the first
>> one (modulo missing description of the setup you have in mind), but it
>> clearly has nothing to do with the second - attackers could've created
>> whatever they wanted while the fs had been under their control, after all.
>> Doesn't make sense...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mattias Nissler Oct. 17, 2016, 1:02 p.m. UTC | #5
OK, no more feedback thus far. Is there generally any interest in a
mount option to avoid path name aliasing resulting in target file
confusion? Perhaps a version that only disables symlinks instead of
also hard-disabling files hard-linked to multiple locations (those are
much lower risk for the situation I care about)?

If there is interest, I'm happy to iterate the patch until it's
accepted. If there's no interest, that's fine too - I'll then likely
resort to moving the restrictions desired for Chrome OS into an LSM we
compile into our kernels.

On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nissler <mnissler@chromium.org> wrote:
> Forgot to mention: I realize my motivation is very specific to Chrome
> OS, however the nolinks option seemed useful also as a mitigation to
> generic privilege escalation symlink attacks, for cases where
> disabling symlinks/hardlinks is acceptable.
>
> On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler <mnissler@chromium.org> wrote:
>> On Fri, Oct 14, 2016 at 5:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:
>>> > > Setting the "nolinks" mount option helps prevent privileged writers
>>> > > from modifying files unintentionally in case there is an unexpected
>>> > > link along the accessed path. The "nolinks" option is thus useful as a
>>> > > defensive measure against persistent exploits (i.e. a system getting
>>> > > re-exploited after a reboot) for systems that employ a read-only or
>>> > > dm-verity-protected rootfs. These systems prevent non-legit binaries
>>> > > from running after reboot. However, legit code typically still reads
>>> > > from and writes to a writable file system previously under full
>>> > > control of the attacker, who can place symlinks to trick file writes
>>> > > after reboot to target a file of their choice. "nolinks" fundamentally
>>> > > prevents this.
>>> >
>>> > Which parts of the tree would be on that "protected" rootfs and which would
>>> > you mount with that option?  Description above is rather vague and I'm
>>> > not convinced that it actually buys you anything.  Details, please...
>>
>> Apologies for the vague description, I'm happy to explain in detail.
>>
>> In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
>> an attacker can't modify any binaries. After reboot, everything except
>> the rootfs is mounted noexec, so there's no way to re-gain code
>> execution after reboot by modifying existing binaries or dropping new
>> ones.
>>
>> We've seen multiple exploits now where the attacker worked around
>> these limitations in two steps:
>>
>> 1. Before reboot, the attacker sets up symlinks on the writeable file
>> system (called "stateful" file system), which are later accessed by
>> legit boot code (such as init scripts) after reboot. For example, an
>> init script that copies file A to B can be abused by an attacker by
>> symlinking or hardlinking B to a location C of their choice, and
>> placing desired data to be written to C in A. That gives the attacker
>> a primitive to write data of their choice to a path of their choice
>> after reboot. Note that this primitive may target locations _outside_
>> the stateful file system the attacker previously had control of.
>> Particularly of interest are targets on /sys, but also tmpfs on /run
>> etc.
>>
>> 2. The second step for a successful attack is finding some legit code
>> invoked in the boot flow that has a vulnerability exploitable by
>> feeding it unexpected data. As an example, there are Linux userspace
>> utilities that read config from /run which may contain shell commands
>> the the utility executes, through which the attacker can gain code
>> execution again.
>>
>> The purpose of the proposed patch is to raise the bar for the first
>> step of the attack: Writing arbitrary files after reboot. I'm
>> intending to mount the stateful file system with the nolinks option
>> (or otherwise prevent symlink traversal). This will help make sure
>> that any legit writes taking place during boot in init scripts etc. go
>> to the files intended by the developer, and can't be redirected by an
>> attacker.
>>
>> Does this make more sense to you?
>>
>>>
>>>
>>> PS: what the hell do restrictions on _following_ symlinks have to _creating_
>>> hardlinks?  I'm trying to imagine a threat model where both would apply or
>>> anything else beyond the word "link" they would have in common...
>>
>> The restriction is not on _creating_ hard links, but _opening_
>> hardlinks. The commonality is in the confusion between the file you're
>> meaning to write vs. the file you actually end up writing to, which
>> stems from the fact that as things stand a file can be accessible on
>> other paths than its canonical one. For Chrome OS, I'd like to get to
>> a point where most privileged code can only access a file via its
>> canonical name (bind mounts are an OK exception as they're not
>> persistent, so out of reach for manipulation).
>>
>>>
>>> The one you've described above might have something to do with the first
>>> one (modulo missing description of the setup you have in mind), but it
>>> clearly has nothing to do with the second - attackers could've created
>>> whatever they wanted while the fs had been under their control, after all.
>>> Doesn't make sense...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn Oct. 17, 2016, 2:14 p.m. UTC | #6
On 2016-10-17 09:02, Mattias Nissler wrote:
> OK, no more feedback thus far. Is there generally any interest in a
> mount option to avoid path name aliasing resulting in target file
> confusion? Perhaps a version that only disables symlinks instead of
> also hard-disabling files hard-linked to multiple locations (those are
> much lower risk for the situation I care about)?
>
> If there is interest, I'm happy to iterate the patch until it's
> accepted. If there's no interest, that's fine too - I'll then likely
> resort to moving the restrictions desired for Chrome OS into an LSM we
> compile into our kernels.
I can see the symlink related part potentially being useful in other 
cases, although if you do get rid of the hardlink portion, I'd suggest 
renaming the mount option to 'nosymlinks'.

One use that comes to mind is securing multi-protocol file servers (for 
example, something serving both NFS and SMB) where at least one protocol 
doesn't directly handle symlinks (or there is inconsistency among the 
protocols in how they're handled).
>
> On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nissler <mnissler@chromium.org> wrote:
>> Forgot to mention: I realize my motivation is very specific to Chrome
>> OS, however the nolinks option seemed useful also as a mitigation to
>> generic privilege escalation symlink attacks, for cases where
>> disabling symlinks/hardlinks is acceptable.
>>
>> On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler <mnissler@chromium.org> wrote:
>>> On Fri, Oct 14, 2016 at 5:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>>
>>>> On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:
>>>>>> Setting the "nolinks" mount option helps prevent privileged writers
>>>>>> from modifying files unintentionally in case there is an unexpected
>>>>>> link along the accessed path. The "nolinks" option is thus useful as a
>>>>>> defensive measure against persistent exploits (i.e. a system getting
>>>>>> re-exploited after a reboot) for systems that employ a read-only or
>>>>>> dm-verity-protected rootfs. These systems prevent non-legit binaries
>>>>>> from running after reboot. However, legit code typically still reads
>>>>>> from and writes to a writable file system previously under full
>>>>>> control of the attacker, who can place symlinks to trick file writes
>>>>>> after reboot to target a file of their choice. "nolinks" fundamentally
>>>>>> prevents this.
>>>>>
>>>>> Which parts of the tree would be on that "protected" rootfs and which would
>>>>> you mount with that option?  Description above is rather vague and I'm
>>>>> not convinced that it actually buys you anything.  Details, please...
>>>
>>> Apologies for the vague description, I'm happy to explain in detail.
>>>
>>> In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
>>> an attacker can't modify any binaries. After reboot, everything except
>>> the rootfs is mounted noexec, so there's no way to re-gain code
>>> execution after reboot by modifying existing binaries or dropping new
>>> ones.
>>>
>>> We've seen multiple exploits now where the attacker worked around
>>> these limitations in two steps:
>>>
>>> 1. Before reboot, the attacker sets up symlinks on the writeable file
>>> system (called "stateful" file system), which are later accessed by
>>> legit boot code (such as init scripts) after reboot. For example, an
>>> init script that copies file A to B can be abused by an attacker by
>>> symlinking or hardlinking B to a location C of their choice, and
>>> placing desired data to be written to C in A. That gives the attacker
>>> a primitive to write data of their choice to a path of their choice
>>> after reboot. Note that this primitive may target locations _outside_
>>> the stateful file system the attacker previously had control of.
>>> Particularly of interest are targets on /sys, but also tmpfs on /run
>>> etc.
>>>
>>> 2. The second step for a successful attack is finding some legit code
>>> invoked in the boot flow that has a vulnerability exploitable by
>>> feeding it unexpected data. As an example, there are Linux userspace
>>> utilities that read config from /run which may contain shell commands
>>> the the utility executes, through which the attacker can gain code
>>> execution again.
>>>
>>> The purpose of the proposed patch is to raise the bar for the first
>>> step of the attack: Writing arbitrary files after reboot. I'm
>>> intending to mount the stateful file system with the nolinks option
>>> (or otherwise prevent symlink traversal). This will help make sure
>>> that any legit writes taking place during boot in init scripts etc. go
>>> to the files intended by the developer, and can't be redirected by an
>>> attacker.
>>>
>>> Does this make more sense to you?
>>>
>>>>
>>>>
>>>> PS: what the hell do restrictions on _following_ symlinks have to _creating_
>>>> hardlinks?  I'm trying to imagine a threat model where both would apply or
>>>> anything else beyond the word "link" they would have in common...
>>>
>>> The restriction is not on _creating_ hard links, but _opening_
>>> hardlinks. The commonality is in the confusion between the file you're
>>> meaning to write vs. the file you actually end up writing to, which
>>> stems from the fact that as things stand a file can be accessible on
>>> other paths than its canonical one. For Chrome OS, I'd like to get to
>>> a point where most privileged code can only access a file via its
>>> canonical name (bind mounts are an OK exception as they're not
>>> persistent, so out of reach for manipulation).
>>>
>>>>
>>>> The one you've described above might have something to do with the first
>>>> one (modulo missing description of the setup you have in mind), but it
>>>> clearly has nothing to do with the second - attackers could've created
>>>> whatever they wanted while the fs had been under their control, after all.
>>>> Doesn't make sense...

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Colin Walters Oct. 18, 2016, 3:14 p.m. UTC | #7
On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote:
> OK, no more feedback thus far. Is there generally any interest in a
> mount option to avoid path name aliasing resulting in target file
> confusion? Perhaps a version that only disables symlinks instead of
> also hard-disabling files hard-linked to multiple locations (those are
> much lower risk for the situation I care about)?

So the situation here is a (privileged) process that is trying to read/write
to a filesystem tree writable by other processes that are in a separate
security domain?

That's a classic situation that requires extreme care, and I am doubtful
that symlinks are the only issue you're facing.  For example, if this
process is also *parsing* any data there, there's another whole source
of risk.

I suspect for you it wouldn't be too hard to have a "follow untrusted
path" helper function, it's possible to implement in userspace safely
with O_NOFOLLOW etc.

Regardless too, it sounds like what you want more is a
"same filesystem" traversal (stat and compare devices). 

Or does it even need to handle full traversal?  Would it have mitigated
the security issue to fstat() any files you opened and verified they
were from the writable partition you expected?



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mattias Nissler Oct. 19, 2016, 11:28 a.m. UTC | #8
On Tue, Oct 18, 2016 at 5:14 PM, Colin Walters <walters@verbum.org> wrote:
>
> On Mon, Oct 17, 2016, at 09:02 AM, Mattias Nissler wrote:
> > OK, no more feedback thus far. Is there generally any interest in a
> > mount option to avoid path name aliasing resulting in target file
> > confusion? Perhaps a version that only disables symlinks instead of
> > also hard-disabling files hard-linked to multiple locations (those are
> > much lower risk for the situation I care about)?
>
> So the situation here is a (privileged) process that is trying to read/write
> to a filesystem tree writable by other processes that are in a separate
> security domain?

More or less. The scenario is that of an attacker gaining root access,
followed by a reboot. The dm-verity-protected root FS raises the bar
for re-exploiting the system after reboot, but the fact that we'll
consume data from the writable file system during boot can (and has
been) used to regain code execution.

You may argue that "all is lost" after the initial root exploit, but
we think there's value in hardening the system to the point where
getting a *persistent* exploit (i.e. retaining control across reboots)
is significantly harder. This increases our chances to successfully
auto-update devices even after a root exploit and thus retroactively
fix them.

>
> That's a classic situation that requires extreme care, and I am doubtful
> that symlinks are the only issue you're facing.  For example, if this
> process is also *parsing* any data there, there's another whole source
> of risk.

Agreed. Essentially, the entire writable file system has to be
regarded as untrusted input. Every process that stores or reads data
needs to do so cautiously. I'm aware this is quite a rabbit hole and
difficult to harden. In particular it'll require attention and changes
across a number of areas. The fact that we've seen multiple exploits
that rely on target file confusion by creating symlinks suggests that
there's value in blocking this as an attack vector though (in
particular given that we have only few legit uses of symlinks that are
relatively easy to clean up).

>
> I suspect for you it wouldn't be too hard to have a "follow untrusted
> path" helper function, it's possible to implement in userspace safely
> with O_NOFOLLOW etc.

Note that O_NOFOLLOW only affects the final path component. If there's
a symlink in any of the parent directories, that'll still be traversed
even with O_NOFOLLOW. This situation is less risky as an attacker will
have to deal with the restriction of a fixed filename in the last
component, but might still be exploitable.

>
> Regardless too, it sounds like what you want more is a
> "same filesystem" traversal (stat and compare devices).
>
> Or does it even need to handle full traversal?  Would it have mitigated
> the security issue to fstat() any files you opened and verified they
> were from the writable partition you expected?

These are all good ideas, and in fact I'm already looking into making
safe helpers to use when dealing with files from the writable file
system. There is an idea to even go a step further and mount the
writable file system in a separate mount namespace (to avoid
accidental access from the rest of the system). Then, the helper tools
would take care of performing file access in said mount namespace and
can prevent malicious symlinks by canonicalizing the requested path
and bailing out if it doesn't match the requested path.

The difficulty lies in applying these measures of precaution
system-wide. This affects most init scripts and daemons, and
everything else that keeps state on the writable file system. Some of
the affected code we have written ourselves so is relatively easy to
fix, but we're also running a number of third party packages for which
things are more complicated. Going through with a fine comb and
auditing all file access is possible in theory, but hardly in
practice. So we'll have to make do with partial auditing and
implementing mitigations that reduce risk for the rest of the code,
which led to the idea of killing the symlink attack vector
systematically.

FWIW, I plan to send an updated patch later that only disables
symlinks for consideration.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Colin Walters Oct. 19, 2016, 2:35 p.m. UTC | #9
On Wed, Oct 19, 2016, at 07:28 AM, Mattias Nissler wrote:
> 
> Note that O_NOFOLLOW only affects the final path component. If there's
> a symlink in any of the parent directories, that'll still be traversed
> even with O_NOFOLLOW. This situation is less risky as an attacker will
> have to deal with the restriction of a fixed filename in the last
> component, but might still be exploitable.

Yeah, I meant that you'd walk the path string in userspace one by
one. That said the "fstat at the end and check device" seems a
lot better, or perhaps the mount namespaces could help.

Also, don't forget about `setfsuid()`.

> The difficulty lies in applying these measures of precaution
> system-wide. This affects most init scripts and daemons, and
> everything else that keeps state on the writable file system. 

One thing to note is that at least in the freedesktop.org/GNOME etc.
side of things, we basically never have privileged processes
accessing user home directories anymore.

A good example is that GDM used to read ~username/.config/face.png
or something like that to show the user's picture on the login screen, and that was
subject to many of the same risks.

But we've basically across the board migrated to a model where
the unprivileged user session talks to privileged daemons via
a DBus (or other) API.  In this case, the picture data is stored
in accountsservice.  NetworkManager is another big
example of this, where e.g. WiFi credentials can be per user, and
the session passes them to the privileged daemon over DBus,
rather than having the privileged process try to parse config files
in the user's homedir.   It's a lot easier to secure.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index a7f601c..f152687 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1021,6 +1021,10 @@  const char *get_link(struct nameidata *nd)
 		touch_atime(&last->link);
 	}
 
+	error = -EACCES;
+	if (nd->path.mnt->mnt_flags & MNT_NOLINKS)
+		return ERR_PTR(error);
+
 	error = security_inode_follow_link(dentry, inode,
 					   nd->flags & LOOKUP_RCU);
 	if (unlikely(error))
@@ -2919,7 +2923,10 @@  static int may_open(struct path *path, int acc_mode, int flag)
 	case S_IFIFO:
 	case S_IFSOCK:
 		flag &= ~O_TRUNC;
-		break;
+		/*FALLTHRU*/
+	default:
+		if ((path->mnt->mnt_flags & MNT_NOLINKS) && inode->i_nlink > 1)
+			return -EACCES;
 	}
 
 	error = inode_permission(inode, MAY_OPEN | acc_mode);
diff --git a/fs/namespace.c b/fs/namespace.c
index 58aca9c..c421fbb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2732,6 +2732,8 @@  long do_mount(const char *dev_name, const char __user *dir_name,
 		mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
+	if (flags & MS_NOLINKS)
+		mnt_flags |= MNT_NOLINKS;
 
 	/* The default atime for remount is preservation */
 	if ((flags & MS_REMOUNT) &&
@@ -2741,9 +2743,9 @@  long do_mount(const char *dev_name, const char __user *dir_name,
 		mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
 	}
 
-	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
-		   MS_STRICTATIME | MS_NOREMOTELOCK);
+	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_NOLINKS | MS_ACTIVE |
+		   MS_BORN | MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+		   MS_KERNMOUNT | MS_STRICTATIME | MS_NOREMOTELOCK);
 
 	if (flags & MS_REMOUNT)
 		retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3f1190d..b5d9d35 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -67,6 +67,7 @@  static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_NOLINKS, ",nolinks" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..df4eb6a 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -28,6 +28,7 @@  struct mnt_namespace;
 #define MNT_NODIRATIME	0x10
 #define MNT_RELATIME	0x20
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
+#define MNT_NOLINKS	0x80
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
@@ -44,7 +45,7 @@  struct mnt_namespace;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
-				 | MNT_READONLY)
+				 | MNT_READONLY | MNT_NOLINKS)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 2473272..6624ece 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -130,6 +130,7 @@  struct inodes_stat_t {
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
+#define MS_NOLINKS	(1<<26) /* Ignore symbolic and hard links */
 
 /* These sb flags are internal to the kernel */
 #define MS_NOREMOTELOCK	(1<<27)