secretmem: Prevent secretmem_users from wrapping to zero

Message ID	20211025181634.3889666-1-willy@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=AeVH=PN=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org A0F2560720 From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Linus Torvalds <torvalds@linux-foundation.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, Jordy Zomer <jordy@pwning.systems>, Kees Cook <keescook@chromium.org>, James Bottomley <James.Bottomley@HansenPartnership.com>, Mike Rapoport <rppt@kernel.org>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH] secretmem: Prevent secretmem_users from wrapping to zero Date: Mon, 25 Oct 2021 19:16:34 +0100 Message-Id: <20211025181634.3889666-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	secretmem: Prevent secretmem_users from wrapping to zero \| expand secretmem: Prevent secretmem_users from wrapping to zero

Matthew Wilcox Oct. 25, 2021, 6:16 p.m. UTC

Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
attempted to fix the problem of secretmem_users wrapping to zero and
allowing suspend once again.  Prevent secretmem_users from wrapping to
zero by forbidding new users if the number of users has wrapped from
positive to negative.  This stops a long way short of reaching the
necessary 4 billion users, so there's no need to be clever with special
anti-wrap types or checking the return value from atomic_inc().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jordy Zomer <jordy@pwning.systems>
Cc: Kees Cook <keescook@chromium.org>,
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/secretmem.c | 2 ++
 1 file changed, 2 insertions(+)

Kees Cook Oct. 25, 2021, 7:29 p.m. UTC | #1

On Mon, Oct 25, 2021 at 07:16:34PM +0100, Matthew Wilcox (Oracle) wrote:
> Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
> attempted to fix the problem of secretmem_users wrapping to zero and
> allowing suspend once again.  Prevent secretmem_users from wrapping to
> zero by forbidding new users if the number of users has wrapped from
> positive to negative.  This stops a long way short of reaching the
> necessary 4 billion users, so there's no need to be clever with special
> anti-wrap types or checking the return value from atomic_inc().

I still prefer refcount_t here because it provides deterministic
saturation, but the risk right now is so narrow ("don't hibernate"),
I'm not going to fight for it. I think it'd be fine to use it initialized
to 1, and have the removal check for == 0 as a failure state, which would
deterministically cover the underflow case too.

-Kees

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jordy Zomer <jordy@pwning.systems>
> Cc: Kees Cook <keescook@chromium.org>,
> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/secretmem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 030f02ddc7c1..c2dda408bb36 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -203,6 +203,8 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
>  
>  	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
>  		return -EINVAL;
> +	if (atomic_read(&secretmem_users) < 0)
> +		return -ENFILE;
>  
>  	fd = get_unused_fd_flags(flags & O_CLOEXEC);
>  	if (fd < 0)
> -- 
> 2.33.0
>

Matthew Wilcox Oct. 25, 2021, 7:51 p.m. UTC | #2

On Mon, Oct 25, 2021 at 12:29:46PM -0700, Kees Cook wrote:
> On Mon, Oct 25, 2021 at 07:16:34PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
> > attempted to fix the problem of secretmem_users wrapping to zero and
> > allowing suspend once again.  Prevent secretmem_users from wrapping to
> > zero by forbidding new users if the number of users has wrapped from
> > positive to negative.  This stops a long way short of reaching the
> > necessary 4 billion users, so there's no need to be clever with special
> > anti-wrap types or checking the return value from atomic_inc().
> 
> I still prefer refcount_t here because it provides deterministic
> saturation, but the risk right now is so narrow ("don't hibernate"),
> I'm not going to fight for it. I think it'd be fine to use it initialized
> to 1, and have the removal check for == 0 as a failure state, which would
> deterministically cover the underflow case too.

I still think that's abusing the refcount_t pattern.  refcount_t should
be for ... reference counts.  Not these other things.

Kees Cook Oct. 25, 2021, 9:04 p.m. UTC | #3

On Mon, Oct 25, 2021 at 08:51:40PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 25, 2021 at 12:29:46PM -0700, Kees Cook wrote:
> > On Mon, Oct 25, 2021 at 07:16:34PM +0100, Matthew Wilcox (Oracle) wrote:
> > > Commit 110860541f44 ("mm/secretmem: use refcount_t instead of atomic_t")
> > > attempted to fix the problem of secretmem_users wrapping to zero and
> > > allowing suspend once again.  Prevent secretmem_users from wrapping to
> > > zero by forbidding new users if the number of users has wrapped from
> > > positive to negative.  This stops a long way short of reaching the
> > > necessary 4 billion users, so there's no need to be clever with special
> > > anti-wrap types or checking the return value from atomic_inc().
> > 
> > I still prefer refcount_t here because it provides deterministic
> > saturation, but the risk right now is so narrow ("don't hibernate"),
> > I'm not going to fight for it. I think it'd be fine to use it initialized
> > to 1, and have the removal check for == 0 as a failure state, which would
> > deterministically cover the underflow case too.
> 
> I still think that's abusing the refcount_t pattern.  refcount_t should
> be for ... reference counts.  Not these other things.

Is secretmem different? We're trying to count how many of these we have,
this is a common pattern in, for example, the network code which does
this kind of thing a lot.

It's just that for a "global" counter like here, we're not tied to a
specific object's lifetime, but usually some global state that depends
on having _any_ of the objects alive.

Linus Torvalds Oct. 25, 2021, 9:17 p.m. UTC | #4

On Mon, Oct 25, 2021 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
>
> Is secretmem different? We're trying to count how many of these we have,
> this is a common pattern in, for example, the network code which does
> this kind of thing a lot.

Yes, secretmem is different.

A refcount being zero means that the data it referenced no longer exists.

That's not what the secretmem counter meant at all.

Making it a refcount was simply WRONG. It's not a refcount, it's a
usage count, and the fact that syzbot caught the warning just shows
how wrong it was.

Stop arguing for garbage. It was wrong, just admit it. The semantics
for "refcount" is something else than what that code had. As a result,
it got reverted. I've applied Willy's patch that actually makes sense.

Arguing for garbage in the name of "security" is still garbage. In
fact, it only causes confusion, and as such is likely to result in
problems - including security problems - later.

Because confusion about semantics is bad. And that was what that patch was.

And I want to state - again - how dangerous this "refcounts are always
prefereable to atomics" mental model is. Refcounts are _not_
fundamentally preferable to atomics. They are slower, bigger, and have
completely different semantics.

So if something isn't a refcount, it damn well shouldn't use "refcount_t".

             Linus

Kees Cook Oct. 25, 2021, 10:30 p.m. UTC | #5

On Mon, Oct 25, 2021 at 02:17:58PM -0700, Linus Torvalds wrote:
> On Mon, Oct 25, 2021 at 2:04 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > Is secretmem different? We're trying to count how many of these we have,
> > this is a common pattern in, for example, the network code which does
> > this kind of thing a lot.
> 
> Yes, secretmem is different.

Prefix: I'm fine with this being whatever; it's a corner case, so this
reply is mainly about nailing down the rationales for future decisions.

> A refcount being zero means that the data it referenced no longer exists.

I don't disagree with this definition, but I would like to understand how
some other use-cases fit into this. The case of basic allocated object memory
lifetime reference counting we all agree on. What about the case of what
I see that is more like a "shared resource usage count" where the shared
resource doesn't necessarily disappear when we reach "no users"?

i.e. there is some resource, and it starts its life with no one using it
(count = 1). Then we have users declaring their use (count = 2) and later
release (count = 1) of the resource. It's not really ever unallocated when
users == 0 (count == 1), but we might use the usage counter for other
things and don't want to let it go negative or ever increment from zero
(in case zero is used for marking the resource unavailable forever). For
example, protocols knowing if there are any sockets left open, crypto
API usage counts, kernel module usage counts, etc.

I don't see as clear a distinction between secretmem and the above
examples. The question being answered is "how many users do I have?" and
we want to make sure we don't end up with overflow or underflow given the
(unfortunately) common case of reference counting kernel bugs. The fact
that secretmem doesn't have its own allocation to free when it hits 0
seems like just an implementation detail of this particular resource
usage counter.

But, ignoring secretmem for a moment, what about a specific example
from above: the module loader's refcnt atomic_t, which is actually
maintaining an allocation (the module), and uses usage counts of 0 to mean
"I am removing this module", and 1 is "I have no users", 2 is "1 user",
etc. It seems like this should use refcount_t to me?

> Stop arguing for garbage. It was wrong, just admit it. The semantics
> for "refcount" is something else than what that code had. As a result,
> it got reverted. I've applied Willy's patch that actually makes sense.
> 
> Arguing for garbage in the name of "security" is still garbage. In
> fact, it only causes confusion, and as such is likely to result in
> problems - including security problems - later.

Sure, and reasonable people can disagree about what is garbage. :) I see
using a refcount_t here as a not unreasonable way to protect against
potential future security problems if the scope of secretmem ever grows
and more than just hibernation starts to care about this usage counter.

> Because confusion about semantics is bad. And that was what that patch was.
> 
> And I want to state - again - how dangerous this "refcounts are always
> prefereable to atomics" mental model is. Refcounts are _not_
> fundamentally preferable to atomics. They are slower, bigger, and have
> completely different semantics.

I will push back on this; I don't think that's a fair assessment at all.
For storage size, refcount_t is identical to atomic_t (refcount_t _is_
an atomic_t). But perhaps you meant code size? Yes, the refcount_t
helpers are technically larger. But like speed, where refcount_t is also
technically slower, neither size nor speed were so much changed that
we kept around the routines that made it exactly the same speed and
grew the instruction count by 1 (the original x86-specific refcount_t
implementation was just as fast as atomic_t). I don't see speed nor size
alone to be a good reason to say "don't use refcount_t".

But yes, I agree about the different semantics: there are very specific
memory ordering assumptions that tend to be more strict than atomic_t
(which IMO actually makes it more suitable than atomic_t for shared
usage counts).

> So if something isn't a refcount, it damn well shouldn't use "refcount_t".

Again, I don't disagree, but since it looked like a refcount to me, I'd
like to understand what why we don't see this case the same way. Since I
agree that secretmem is currently pretty iffy (nothing actually allocated
to track, only system state), I'll ask a slightly different question:
should the module loader use refcount_t? If not, why?

Linus Torvalds Oct. 25, 2021, 11:37 p.m. UTC | #6

On Mon, Oct 25, 2021 at 3:30 PM Kees Cook <keescook@chromium.org> wrote:
>
> > A refcount being zero means that the data it referenced no longer exists.
>
> I don't disagree with this definition, but I would like to understand how
> some other use-cases fit into this.

I certainly hope that there are no other use-cases for 'recount_t',
because that "zero is invalid" is very much part of the semantics.

If we want other semantics, it should be a new type.

>      What about the case of what
> I see that is more like a "shared resource usage count" where the shared
> resource doesn't necessarily disappear when we reach "no users"?

So I think that's really "atomic_t".

And instead of saturating, people should always check such shared
resources for limits.

> i.e. there is some resource, and it starts its life with no one using it
> (count = 1).

You are already going off into the weeds.

That's not a natural thing to do. It's already confusing. Really. Read
that sentence yourself, and read it like an outsider.

"No one is using it, so count == 1" is a nonsensican statement on the
face of it.

You are thinking of a refcount_t trick, not some sane semantics.

Yes, we have played off-by-one games in the kernel before. We've done
it for various subtle reasons.

For example, traditionally, on x86, with atomic counting there are
three special situations: negative, 0 and positive. So if you use the
traditional x86 counting atomics (just add/sub/inc/dec, no xadd) then
there are situations where you can get more information about the
result in %eflags if you don't use zero as the initial value, but -1.

Because then you can do "inc", and if ZF is set, you know you were the
_first_ person to increment it. And when you use "dec", and SF is set
afterwards, you know you are the _last_ person to decrement it.

That was useful when things like "xadd" weren't available, and cmpxchg
loops are expensive. So we used to have counters where -1 was that
"zero point". Very similar to your "1 is the zero point".

But was it _logical_? No. It was an implementation trick. I think
we've removed all those cases because it was so subtle and confusing
(but maybe we still have it somewhere - I did not check).

So we've certainly played those kinds of games. But it had better be
for a really good reason.

> I don't see as clear a distinction between secretmem and the above
> examples.

I really don't see what's wrong with 'atomic_t', and just checking for limits.

Saturating counters are EVIL AND BAD. They are a DoS waiting to
happen. Once you saturate, the machine is basically dead. You may have
"protected" against some attack, but you did so by killing the machine
and making the resource accounting no longer work.

So if a user can ever trigger a saturating counter, that's a big big
problem in itself.

In contrast, an 'atomic_t' with a simple limit? It just works.

And it doesn't need illogical tricks to work.

Stop thinking that refcount_t is a good type. Start realizing the
downsides. Start understanding that saturation is a HORRENDOUSLY BAD
solution, and horrible QoI.

              Linus

Linus Torvalds Oct. 26, 2021, 12:02 a.m. UTC | #7

On Mon, Oct 25, 2021 at 4:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Stop thinking that refcount_t is a good type. Start realizing the
> downsides. Start understanding that saturation is a HORRENDOUSLY BAD
> solution, and horrible QoI.

Basically, refcount_t should be used purely for internal kernel data
structures - it makes perfect sense for things like the 'struct
device' resource handling, for example.

In fact, there it is good for two reasons:

 - it's not counting some user resource, so users shouldn't have any
way to trigger overflow and saturation which causes problems

 - it's used by random driver stuff, which is often where kernel bugs
happen and testing is fundamentally limited by hw availability etc

but in general, anything that is user-accountable needs to have
_limits_, not saturation. It's why the page count is a "atomic_t" even
if the name of the field is "_refcount". Because refcount_t is the
INFERIOR TYPE.

Using an atomic_t properly is actually the much better option. It's
just that "properly" might be a bit more code, involving actual limit
checking.

'refcount_t' is basically a shorthand for "I didn't bother doing this
right, so I'm using this type that adds debugging, warns and stops
working and might DoS the kernel".

It's a crutch. It's not the alpha and the omega of counting types. It
has its place, but I really want to stress how people should
ABSOLUTELY not think "oh, refcount_t is better than atomic_t".

              Linus

Kees Cook Oct. 26, 2021, 12:17 a.m. UTC | #8

On Mon, Oct 25, 2021 at 04:37:01PM -0700, Linus Torvalds wrote:
> If we want other semantics, it should be a new type.

Okay, that's reasonable.

> > I see that is more like a "shared resource usage count" where the shared
> > resource doesn't necessarily disappear when we reach "no users"?
> 
> So I think that's really "atomic_t".
> 
> And instead of saturating, people should always check such shared
> resources for limits.

Right, but people make mistakes, etc. I agree about the limit being much
more sane than saturating (though in the cases of "missed decrement"),
we get to the same place: an open-coded check for the limit that never
goes down doesn't matter if it's refcount_t nor atomic_t. :)

> > i.e. there is some resource, and it starts its life with no one using it
> > (count = 1).
> 
> You are already going off into the weeds.
> 
> That's not a natural thing to do. It's already confusing. Really. Read
> that sentence yourself, and read it like an outsider.
> 
> "No one is using it, so count == 1" is a nonsensican statement on the
> face of it.
> 
> You are thinking of a refcount_t trick, not some sane semantics.
> 
> Yes, we have played off-by-one games in the kernel before. We've done
> it for various subtle reasons.

Right, sure, but it's not a rare pattern. Given that it exists, and that
it _does_ get used for allocation management (e.g. module loader), it
seems worth constructing a proper type for it so that all the open coded
stuff around these instances can be consolidated, and the API can be
defined in a way that will behave sanely.

> I really don't see what's wrong with 'atomic_t', and just checking for limits.

It's that last part. :) If we go through atomic_dec() see a zero and do
something, okay, fine. But these places need to check for insane
conditions too ("we got a -1 back -- this means there's a bug but what
do we do?"). Same for atomic_inc(): "oh, we're at our limit, do
something", but what above discovering ourselves above the limit?
There's nothing about using the atomic_t primitives that enforces these
kinds of checks. (And there likely shouldn't be for atomic_t -- it's a
plain type.) But we likely need something that fills in this API gap
between atomic_t and refcount_t.

> So if a user can ever trigger a saturating counter, that's a big big
> problem in itself.

Yes! It is. :) But they don't get to gain control over a Use-after-Free.
The risk to the system is DoS instead of loss of execution control.
That's a meaningful risk downgrade.

So, what's the right semantics for an atomic type that could be used in
the module loader, that would catch kernel counting bugs in a safe manner?

The "refcount_t but 1-based" is close, but clearly not the right name. :)

Linus Torvalds Oct. 26, 2021, 1:10 a.m. UTC | #9

On Mon, Oct 25, 2021 at 5:18 PM Kees Cook <keescook@chromium.org> wrote:
>
> Right, sure, but it's not a rare pattern.

Well, for an actual reference count it certainly isn't a rare pattern,
and zero _is_ special, because at zero, you are now in use-after-free
territory.

But that's kind of the issue here: that really isn't what
'secretmem_users' was ever about.

Zero isn't some "now we're use-after-free" situation. Quite the
reverse. Zero ends up being the safe thing.

So with that kind of "just count number of existing users", where zero
isn't special, then refcount_t doesn't make sense.

And refcount_t is for non-core stuff that has a lot of random kernel
users that you can't easily verify.

In contrast, 'secretmem_users' had exactly two sites that modified it,
and one that tested it.

> But these places need to check for insane
> conditions too ("we got a -1 back -- this means there's a bug but what
> do we do?"). Same for atomic_inc(): "oh, we're at our limit, do
> something", but what above discovering ourselves above the limit?

So honestly, "above the limit" is often perfectly fine too.

It can be fine for two very different reasons:

 (a) racy checks are often much simpler and faster, and perfectly safe
when the limit is "far away from overflow".

 (b) limits can change

And (a) isn't just about "avoid special atomics". It's about doing the
limit check optimistically outside locking etc.

And (b) wasn't an issue here (where the only real use was ltierally
"are there any users at all"), but in most _proper_ use cases you will
want to have some resource limit that might be set by MIS. And might
be changed dynamically.

So it's entirely possible that somebody sets the limit to something
smaller than the current user (prep for shutdown, whatever), without
it being an error at all.

The limit is for future work, not for past work. Easily happens with
things like rlimits etc.

> There's nothing about using the atomic_t primitives that enforces these
> kinds of checks. (And there likely shouldn't be for atomic_t -- it's a
> plain type.) But we likely need something that fills in this API gap
> between atomic_t and refcount_t.

I dispute the "need". This isn't as common as you claim. Most resource
counting _is_ for "free when no longer used".

And on the other end, you have the users that don't want refcount_t
because they can't live with the limitations of that interface, like
the page counts etc, that do it properly.

So I think in 99% of all situations, the proper fix is to embed an
"atomic_t" in the type it protects, and then have the helper functions
that actually do it properly. Like we do for "get_page()" and friends.
The "new type" isn't about the reference counting, it's about the data
itself, and the atomic_t is just a part of it.

Could we do something new type that warns on the "decrement past zero"
and "overflow on increment"? Sure. But since they by _definition_
aren't about data lifetimes, they probably don't need saturation - you
want the _warning_, but they aren't protecting data, since they aren't
refcounts.

Or could we have something even fancier, that is an actual defined
range, and "overflow" is not "overflow in 32 bits", but "becomes
bigger than X")? That gets more complex because now you'd have to
encode the range in the type somehow.

You could do it with actual static types (generate typedef names and
code), or you could do it with types that have a more dynamic pointer
to ranges (kind of like the sysfs interfaces do) or have the ranges
embedded in the data structure itself.

But honestly, the complexity downside seems to just dwarf the upside.

                  Linus

Matthew Wilcox Oct. 26, 2021, 1:34 a.m. UTC | #10

On Mon, Oct 25, 2021 at 04:37:01PM -0700, Linus Torvalds wrote:
> For example, traditionally, on x86, with atomic counting there are
> three special situations: negative, 0 and positive. So if you use the
> traditional x86 counting atomics (just add/sub/inc/dec, no xadd) then
> there are situations where you can get more information about the
> result in %eflags if you don't use zero as the initial value, but -1.
> 
> Because then you can do "inc", and if ZF is set, you know you were the
> _first_ person to increment it. And when you use "dec", and SF is set
> afterwards, you know you are the _last_ person to decrement it.
> 
> That was useful when things like "xadd" weren't available, and cmpxchg
> loops are expensive. So we used to have counters where -1 was that
> "zero point". Very similar to your "1 is the zero point".
> 
> But was it _logical_? No. It was an implementation trick. I think
> we've removed all those cases because it was so subtle and confusing
> (but maybe we still have it somewhere - I did not check).

We still do it for page->_mapcount; it's biased to -1 so that both
"the page has exactly one mapping" and "the page has no mapping" are
cheaply checkable conditions.

But, as you say, that's invisible to the users.  page_mapcount() is
return atomic_read(&page->_mapcount) + 1;

secretmem: Prevent secretmem_users from wrapping to zero

Commit Message

Comments

Patch