diff mbox series

[RFC,3/8] mm: Avoid using set_page_count() in set_page_recounted()

Message ID 20211026173822.502506-4-pasha.tatashin@soleen.com (mailing list archive)
State New
Headers show
Series Hardening page _refcount | expand

Commit Message

Pasha Tatashin Oct. 26, 2021, 5:38 p.m. UTC
set_page_refcounted() converts a non-refcounted page that has
(page->_refcount == 0) into a refcounted page by setting _refcount to 1,

Use page_ref_inc_return() instead to avoid unconditionally overwriting
the _refcount value.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 mm/internal.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

John Hubbard Oct. 26, 2021, 5:53 p.m. UTC | #1
On 10/26/21 10:38, Pasha Tatashin wrote:
> set_page_refcounted() converts a non-refcounted page that has
> (page->_refcount == 0) into a refcounted page by setting _refcount to 1,
> 
> Use page_ref_inc_return() instead to avoid unconditionally overwriting
> the _refcount value.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>   mm/internal.h | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index cf3cb933eba3..cf345fac6894 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page)
>    */
>   static inline void set_page_refcounted(struct page *page)
>   {
> +	int refcnt;
> +
>   	VM_BUG_ON_PAGE(PageTail(page), page);
>   	VM_BUG_ON_PAGE(page_ref_count(page), page);
> -	set_page_count(page, 1);
> +	refcnt = page_ref_inc_return(page);
> +	VM_BUG_ON_PAGE(refcnt != 1, page);

Hi Pavel,

I am acutely uncomfortable with this change, because it changes the
meaning and behavior of the function to something completely different,
while leaving the function name unchanged. Furthermore, in relies upon
debug assertions, rather than a return value (for example) to verify
that all is well.

I understand where this patchset is going, but this intermediate step is
not a good move.

Also, for the overall series, if you want to change from
"set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
do that is *not* to depend solely on VM_BUG*() to verify. Instead,
return something like -EBUSY if incrementing the value results in a
surprise, and let the caller decide how to handle it.

thanks,
John Hubbard Oct. 26, 2021, 6:01 p.m. UTC | #2
On 10/26/21 10:53, John Hubbard wrote:
> On 10/26/21 10:38, Pasha Tatashin wrote:
>> set_page_refcounted() converts a non-refcounted page that has
>> (page->_refcount == 0) into a refcounted page by setting _refcount to 1,
>>
>> Use page_ref_inc_return() instead to avoid unconditionally overwriting
>> the _refcount value.
>>
>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> ---
>>   mm/internal.h | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cf3cb933eba3..cf345fac6894 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page)
>>    */
>>   static inline void set_page_refcounted(struct page *page)
>>   {
>> +    int refcnt;
>> +
>>       VM_BUG_ON_PAGE(PageTail(page), page);
>>       VM_BUG_ON_PAGE(page_ref_count(page), page);
>> -    set_page_count(page, 1);
>> +    refcnt = page_ref_inc_return(page);
>> +    VM_BUG_ON_PAGE(refcnt != 1, page);
> 
> Hi Pavel,

ohhh, s/Pavel/Pasha/ !

Huge apologies for the name mixup, I had just seen another email...
very sorry about that mistake.


thanks,
Pasha Tatashin Oct. 26, 2021, 6:14 p.m. UTC | #3
> > Hi Pavel,
>
> ohhh, s/Pavel/Pasha/ !
>
> Huge apologies for the name mixup, I had just seen another email...
> very sorry about that mistake.

Hi John,

Do not need to be sorry, Pavel == Pasha :) In Russian it is the same
name; I've been trying to reduce the confusion, by converting
everything to Pasha.

Pasha
Pasha Tatashin Oct. 26, 2021, 6:21 p.m. UTC | #4
Hi John,

Thank you for looking at this series.

> >   static inline void set_page_refcounted(struct page *page)
> >   {
> > +     int refcnt;
> > +
> >       VM_BUG_ON_PAGE(PageTail(page), page);
> >       VM_BUG_ON_PAGE(page_ref_count(page), page);
> > -     set_page_count(page, 1);
> > +     refcnt = page_ref_inc_return(page);
> > +     VM_BUG_ON_PAGE(refcnt != 1, page);


> I am acutely uncomfortable with this change, because it changes the
> meaning and behavior of the function to something completely different,
> while leaving the function name unchanged. Furthermore, in relies upon
> debug assertions, rather than a return value (for example) to verify
> that all is well.


It must return the same thing, if it does not we have a bug in our
kernel which may lead to memory corruptions and security holes.

So today we have this:
   VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
   < What if something modified here? Hmm..>
   set_page_count(page, 1); -> Yet we reset it to 1.

With my proposed change:
   VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
   refcnt = page_ref_inc_return(page);  -> ref_count better be 1.
   VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.

>
> I understand where this patchset is going, but this intermediate step is
> not a good move.
>
> Also, for the overall series, if you want to change from
> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> return something like -EBUSY if incrementing the value results in a
> surprise, and let the caller decide how to handle it.

Actually, -EBUSY would be OK if the problems were because we failed to
modify refcount for some reason, but if we modified refcount and got
an unexpected value (i.e underflow/overflow) we better report it right
away instead of waiting for memory corruption to happen.

Thanks,
Pasha
John Hubbard Oct. 27, 2021, 5:12 a.m. UTC | #5
On 10/26/21 11:21, Pasha Tatashin wrote:
> It must return the same thing, if it does not we have a bug in our
> kernel which may lead to memory corruptions and security holes.
> 
> So today we have this:
>     VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>     < What if something modified here? Hmm..>
>     set_page_count(page, 1); -> Yet we reset it to 1.
> 
> With my proposed change:
>     VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>     refcnt = page_ref_inc_return(page);  -> ref_count better be 1.
>     VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
> 

Yes, you are just repeating what the diffs say.

But it's still not good to have this function name doing something completely
different than its name indicates.

>>
>> I understand where this patchset is going, but this intermediate step is
>> not a good move.
>>
>> Also, for the overall series, if you want to change from
>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>> return something like -EBUSY if incrementing the value results in a
>> surprise, and let the caller decide how to handle it.
> 
> Actually, -EBUSY would be OK if the problems were because we failed to
> modify refcount for some reason, but if we modified refcount and got
> an unexpected value (i.e underflow/overflow) we better report it right
> away instead of waiting for memory corruption to happen.
> 

Having the caller do the BUG() or VM_BUG*() is not a significant delay.


thanks,
Pasha Tatashin Oct. 27, 2021, 6:27 p.m. UTC | #6
On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/26/21 11:21, Pasha Tatashin wrote:
> > It must return the same thing, if it does not we have a bug in our
> > kernel which may lead to memory corruptions and security holes.
> >
> > So today we have this:
> >     VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> >     < What if something modified here? Hmm..>
> >     set_page_count(page, 1); -> Yet we reset it to 1.
> >
> > With my proposed change:
> >     VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> >     refcnt = page_ref_inc_return(page);  -> ref_count better be 1.
> >     VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
> >
>
> Yes, you are just repeating what the diffs say.
>
> But it's still not good to have this function name doing something completely
> different than its name indicates.

I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?

>
> >>
> >> I understand where this patchset is going, but this intermediate step is
> >> not a good move.
> >>
> >> Also, for the overall series, if you want to change from
> >> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> >> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> >> return something like -EBUSY if incrementing the value results in a
> >> surprise, and let the caller decide how to handle it.
> >
> > Actually, -EBUSY would be OK if the problems were because we failed to
> > modify refcount for some reason, but if we modified refcount and got
> > an unexpected value (i.e underflow/overflow) we better report it right
> > away instead of waiting for memory corruption to happen.
> >
>
> Having the caller do the BUG() or VM_BUG*() is not a significant delay.

We cannot guarantee that new callers in the future will check return
values, the idea behind this work is to ensure that we are always
protected from refcount underflow/overflow and invalid refcount
modifications by set_refcount.

Pasha
John Hubbard Oct. 28, 2021, 1:20 a.m. UTC | #7
On 10/27/21 11:27, Pasha Tatashin wrote:
> On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/26/21 11:21, Pasha Tatashin wrote:
>>> It must return the same thing, if it does not we have a bug in our
>>> kernel which may lead to memory corruptions and security holes.
>>>
>>> So today we have this:
>>>      VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>>>      < What if something modified here? Hmm..>
>>>      set_page_count(page, 1); -> Yet we reset it to 1.
>>>
>>> With my proposed change:
>>>      VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>>>      refcnt = page_ref_inc_return(page);  -> ref_count better be 1.
>>>      VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
>>>
>>
>> Yes, you are just repeating what the diffs say.
>>
>> But it's still not good to have this function name doing something completely
>> different than its name indicates.
> 
> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
> 

What? No, that's not where I was going at all. The function is already
named set_page_refcounted(), and one of the problems I see is that your
changes turn it into something that most certainly does not
set_page_refounted(). Instead, this patch *increments* the refcount.
That is not the same thing.

And then it uses a .config-sensitive assertion to "prevent" problems.
And by that I mean, the wording throughout this series seems to equate
VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
however, in CONFIG_DEBUG_VM configurations, and provide no protection at
all for normal (most distros) users. That's something that the wording,
comments, and even design should be tweaked to account for.


>>
>>>>
>>>> I understand where this patchset is going, but this intermediate step is
>>>> not a good move.
>>>>
>>>> Also, for the overall series, if you want to change from
>>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>>>> return something like -EBUSY if incrementing the value results in a
>>>> surprise, and let the caller decide how to handle it.
>>>
>>> Actually, -EBUSY would be OK if the problems were because we failed to
>>> modify refcount for some reason, but if we modified refcount and got
>>> an unexpected value (i.e underflow/overflow) we better report it right
>>> away instead of waiting for memory corruption to happen.
>>>
>>
>> Having the caller do the BUG() or VM_BUG*() is not a significant delay.
> 
> We cannot guarantee that new callers in the future will check return
> values, the idea behind this work is to ensure that we are always
> protected from refcount underflow/overflow and invalid refcount
> modifications by set_refcount.
> 

I don't have a problem with putting assertions closest to where they should
fire. That's a good thing. I'm looking here for ways to fix up the problems
listed in the points above, though.

And I do want to point out another thing, though, and that is: generally, we
don't have to program to quite the level of defensiveness you seem to be at.
If return values must be checked, they usually are in the kernel--and we even
have tooling to enforce it:

/*
  *   gcc: 
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-warn_005funused_005fresult-function-attribute
  * clang: https://clang.llvm.org/docs/AttributeReference.html#nodiscard-warn-unused-result
  */
#define __must_check                    __attribute__((__warn_unused_result__))


Please take that into consideration when weighing tradeoffs, just sort of in
general.



thanks,
John Hubbard Oct. 28, 2021, 1:35 a.m. UTC | #8
On 10/27/21 18:20, John Hubbard wrote:
>>> But it's still not good to have this function name doing something completely
>>> different than its name indicates.
>>
>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>
> 
> What? No, that's not where I was going at all. The function is already
> named set_page_refcounted(), and one of the problems I see is that your
> changes turn it into something that most certainly does not
> set_page_refounted(). Instead, this patch *increments* the refcount.
> That is not the same thing.
> 
> And then it uses a .config-sensitive assertion to "prevent" problems.
> And by that I mean, the wording throughout this series seems to equate
> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> all for normal (most distros) users. That's something that the wording,
> comments, and even design should be tweaked to account for.

...and to clarify a bit more, maybe this also helps:

These patches are attempting to improve debugging, and that is fine, as
far as debugging goes. However, a point that seems to be slightly
misunderstood is: incrementing a bad refcount value is not actually any
better than overwriting it, from a recovery point of view. Maybe (?)
it's better from a debugging point of view.

That's because the problem occurred before this code, and its debug-only
assertions, ran. Once here, the code cannot actually recover: there is
no automatic way to recover from a refcount that it 1, -1, 2, or 706,
when it was supposed to be zero. Incrementing it is, again, not really
necessarily better than setting: setting it might actually make the
broken system appear to run--and in some cases, even avoid symptoms.
Whereas incrementing doesn't cover anything up. The only thing you can
really does is just panic() or BUG(), really.

Don't get me wrong, I don't want bugs covered up. But the claim that
incrementing is somehow better deserves some actual thinking about it.

Overall, I'm inclined to *not* switch anything over to incrementing the
refcounts. Instead, go ahead and:

a) Add assertions up to a "reasonable" point (some others have pointed
out that you don't need quite all of the assertions you've added).

b) Remove set_page_count() calls where possible--maybe everywhere.

c) Fix any bugs found along the way.

d) ...but, leave the basic logic as-is: no changing over to
page_ref_inc_return().

Anyway, that's my take on it.

  thanks,
Pasha Tatashin Nov. 1, 2021, 2:22 p.m. UTC | #9
> >> Yes, you are just repeating what the diffs say.
> >>
> >> But it's still not good to have this function name doing something completely
> >> different than its name indicates.
> >
> > I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
> >
>
> What? No, that's not where I was going at all. The function is already
> named set_page_refcounted(), and one of the problems I see is that your
> changes turn it into something that most certainly does not
> set_page_refounted(). Instead, this patch *increments* the refcount.
> That is not the same thing.
>
> And then it uses a .config-sensitive assertion to "prevent" problems.
> And by that I mean, the wording throughout this series seems to equate
> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> all for normal (most distros) users. That's something that the wording,
> comments, and even design should be tweaked to account for.

VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
sensitive, but in both cases *BUG_ON() means that there is an
unrecoverable problem that occured. The only difference between the
two is that VM_BUG_ON() is not enabled when distros decide to reduce
the size of their kernel and improve runtime performance by skipping
some extra checking.

There is no logical separation between VM_BUG_ON and BUG_ON, there is
been a lengthy discussion about this:

https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/
"so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
just allows some people to build smaller kernels, but apparently
distro people would rather have debugging than save a few kB of RAM."

Losing control of ref_count is an unrecoverable problem because it
leads to security sensitive memory corruptions. It is better to crash
the kernel when that happens instead of ending up with some pages
mapped into the wrong address space.

The races are tricky to spot, but set_page_count() is inherently
dangerous, so I am removing it entirely and replacing it with safer
operations which do the same thing.

One example is this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29

> >>>> I understand where this patchset is going, but this intermediate step is
> >>>> not a good move.
> >>>>
> >>>> Also, for the overall series, if you want to change from
> >>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> >>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> >>>> return something like -EBUSY if incrementing the value results in a
> >>>> surprise, and let the caller decide how to handle it.

In set_page_refcounted() we already have:

VM_BUG_ON_PAGE(page_ref_count(page), page);
set_page_count(page, 1);

I am pointing out that above code is racy:

Between the check VM_BUG_ON_PAGE() check and unconditional set to 1
the value of page->_refcount can change.

I am replacing it with an identical version of code that is not racy.
There is no need to complicate the code by introducing new -EBUSY
returns here, as it would reduce the fragility of this could even
farther.

> >>> Actually, -EBUSY would be OK if the problems were because we failed to

I am not sure -EBUSY would be OK here, it means we had a race which we
were not aware about, and which could have led to memory corruptions.

> >>> modify refcount for some reason, but if we modified refcount and got
> >>> an unexpected value (i.e underflow/overflow) we better report it right
> >>> away instead of waiting for memory corruption to happen.
> >>>
> >>
> >> Having the caller do the BUG() or VM_BUG*() is not a significant delay.

I agree, however, helper functions exist to remove code duplications.
If we must verify the assumption of set_page_refcounted() that non
counted page is turned into a counted page, it is better to do it in
one place than at every call site. We do it today in thus helper
function, I do not see why we would change that.
Pasha Tatashin Nov. 1, 2021, 2:30 p.m. UTC | #10
On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/27/21 18:20, John Hubbard wrote:
> >>> But it's still not good to have this function name doing something completely
> >>> different than its name indicates.
> >>
> >> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
> >>
> >
> > What? No, that's not where I was going at all. The function is already
> > named set_page_refcounted(), and one of the problems I see is that your
> > changes turn it into something that most certainly does not
> > set_page_refounted(). Instead, this patch *increments* the refcount.
> > That is not the same thing.
> >
> > And then it uses a .config-sensitive assertion to "prevent" problems.
> > And by that I mean, the wording throughout this series seems to equate
> > VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> > however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> > all for normal (most distros) users. That's something that the wording,
> > comments, and even design should be tweaked to account for.
>
> ...and to clarify a bit more, maybe this also helps:
>
> These patches are attempting to improve debugging, and that is fine, as

They are attempting to catch potentioal race conditions where
_refcount is changed between the time we verified what it was and we
set it to something else.

They also attempt to prevent overflows and underflows bugs which are
not all tested today, but can be tested with this patch set at least
on kernels where DEBUG_VM is enabled.

> far as debugging goes. However, a point that seems to be slightly
> misunderstood is: incrementing a bad refcount value is not actually any
> better than overwriting it, from a recovery point of view. Maybe (?)
> it's better from a debugging point of view.

It is better for debugging as well: if one is tracing the page
_refcount history, knowing that the _refcount can only be
incremented/decremented/frozen/unfrozen provides a contiguous history
of refcount that can be tracked. In case when we set refcount in some
places as we do today, the contigous history is lost, as we do not
know the actual _refcount value at the time of the set operation.

>
> That's because the problem occurred before this code, and its debug-only
> assertions, ran. Once here, the code cannot actually recover: there is
> no automatic way to recover from a refcount that it 1, -1, 2, or 706,
> when it was supposed to be zero. Incrementing it is, again, not really
> necessarily better than setting: setting it might actually make the
> broken system appear to run--and in some cases, even avoid symptoms.
> Whereas incrementing doesn't cover anything up. The only thing you can
> really does is just panic() or BUG(), really.

This is what my patch series attempt to do, I chose to use VM_BUG()
instead of BUG() because this is VM code, and avoid potential
performance regressions for those who chose performance over possible
security implications.

>
> Don't get me wrong, I don't want bugs covered up. But the claim that
> incrementing is somehow better deserves some actual thinking about it.

I think it does, I described my points above, if you still disagree
please let me know.

Thank you for providing your thoughts on this RFC, I will send out a
new version, and we can continue discussion in the new thread.

Pasha
John Hubbard Nov. 1, 2021, 7:31 p.m. UTC | #11
On 11/1/21 07:22, Pasha Tatashin wrote:
>>>> Yes, you are just repeating what the diffs say.
>>>>
>>>> But it's still not good to have this function name doing something completely
>>>> different than its name indicates.
>>>
>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>
>>
>> What? No, that's not where I was going at all. The function is already
>> named set_page_refcounted(), and one of the problems I see is that your
>> changes turn it into something that most certainly does not
>> set_page_refounted(). Instead, this patch *increments* the refcount.
>> That is not the same thing.
>>
>> And then it uses a .config-sensitive assertion to "prevent" problems.
>> And by that I mean, the wording throughout this series seems to equate
>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>> all for normal (most distros) users. That's something that the wording,
>> comments, and even design should be tweaked to account for.
> 
> VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
> sensitive, but in both cases *BUG_ON() means that there is an
> unrecoverable problem that occured. The only difference between the
> two is that VM_BUG_ON() is not enabled when distros decide to reduce
> the size of their kernel and improve runtime performance by skipping
> some extra checking.
> 
> There is no logical separation between VM_BUG_ON and BUG_ON, there is
> been a lengthy discussion about this:
> 
> https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/
> "so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
> just allows some people to build smaller kernels, but apparently
> distro people would rather have debugging than save a few kB of RAM."
> 
> Losing control of ref_count is an unrecoverable problem because it
> leads to security sensitive memory corruptions. It is better to crash
> the kernel when that happens instead of ending up with some pages
> mapped into the wrong address space.
> 
> The races are tricky to spot, but set_page_count() is inherently
> dangerous, so I am removing it entirely and replacing it with safer
> operations which do the same thing.
> 
> One example is this:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29

I don't think we have any disagreement about the landscape here. But it's
much easier to describe the problem than it is to fix it--as always. And
repeating the problem doesn't make a proposed fix more (or less) appropriate. :)

> 
>>>>>> I understand where this patchset is going, but this intermediate step is
>>>>>> not a good move.
>>>>>>
>>>>>> Also, for the overall series, if you want to change from
>>>>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>>>>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>>>>>> return something like -EBUSY if incrementing the value results in a
>>>>>> surprise, and let the caller decide how to handle it.
> 
> In set_page_refcounted() we already have:
> 
> VM_BUG_ON_PAGE(page_ref_count(page), page);
> set_page_count(page, 1);
> 
> I am pointing out that above code is racy:
> 
> Between the check VM_BUG_ON_PAGE() check and unconditional set to 1
> the value of page->_refcount can change.
> 
> I am replacing it with an identical version of code that is not racy.

And I'm pointing out that raciness is not the real bug, or at least, not
the only bug. "Fixing" the race does not fix the code, but the patch
series seems to imply that it does.

> There is no need to complicate the code by introducing new -EBUSY
> returns here, as it would reduce the fragility of this could even
> farther.
> 
>>>>> Actually, -EBUSY would be OK if the problems were because we failed to
> 
> I am not sure -EBUSY would be OK here, it means we had a race which we
> were not aware about, and which could have led to memory corruptions.
> 
>>>>> modify refcount for some reason, but if we modified refcount and got
>>>>> an unexpected value (i.e underflow/overflow) we better report it right
>>>>> away instead of waiting for memory corruption to happen.
>>>>>
>>>>
>>>> Having the caller do the BUG() or VM_BUG*() is not a significant delay.
> 
> I agree, however, helper functions exist to remove code duplications.
> If we must verify the assumption of set_page_refcounted() that non
> counted page is turned into a counted page, it is better to do it in
> one place than at every call site. We do it today in thus helper
> function, I do not see why we would change that.
> 

Let's ignore this -EBUSY idea for now, because I'm not sure where you are
going with your next version, and maybe it won't even come up.


thanks,
John Hubbard Nov. 1, 2021, 7:35 p.m. UTC | #12
On 11/1/21 07:30, Pasha Tatashin wrote:
> On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/27/21 18:20, John Hubbard wrote:
>>>>> But it's still not good to have this function name doing something completely
>>>>> different than its name indicates.
>>>>
>>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>>
>>>
>>> What? No, that's not where I was going at all. The function is already
>>> named set_page_refcounted(), and one of the problems I see is that your
>>> changes turn it into something that most certainly does not
>>> set_page_refounted(). Instead, this patch *increments* the refcount.
>>> That is not the same thing.
>>>
>>> And then it uses a .config-sensitive assertion to "prevent" problems.
>>> And by that I mean, the wording throughout this series seems to equate
>>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>>> all for normal (most distros) users. That's something that the wording,
>>> comments, and even design should be tweaked to account for.
>>
>> ...and to clarify a bit more, maybe this also helps:
>>
>> These patches are attempting to improve debugging, and that is fine, as
> 
> They are attempting to catch potentioal race conditions where
> _refcount is changed between the time we verified what it was and we
> set it to something else.
> 
> They also attempt to prevent overflows and underflows bugs which are
> not all tested today, but can be tested with this patch set at least
> on kernels where DEBUG_VM is enabled.

OK, but did you get my point about the naming problem?

> 
>> far as debugging goes. However, a point that seems to be slightly
>> misunderstood is: incrementing a bad refcount value is not actually any
>> better than overwriting it, from a recovery point of view. Maybe (?)
>> it's better from a debugging point of view.
> 
> It is better for debugging as well: if one is tracing the page
> _refcount history, knowing that the _refcount can only be
> incremented/decremented/frozen/unfrozen provides a contiguous history
> of refcount that can be tracked. In case when we set refcount in some
> places as we do today, the contigous history is lost, as we do not
> know the actual _refcount value at the time of the set operation.
> 

OK, that is a reasonable argument. Let's put it somewhere, maybe in a
comment block, if it's not already there.

>>
>> That's because the problem occurred before this code, and its debug-only
>> assertions, ran. Once here, the code cannot actually recover: there is
>> no automatic way to recover from a refcount that it 1, -1, 2, or 706,
>> when it was supposed to be zero. Incrementing it is, again, not really
>> necessarily better than setting: setting it might actually make the
>> broken system appear to run--and in some cases, even avoid symptoms.
>> Whereas incrementing doesn't cover anything up. The only thing you can
>> really does is just panic() or BUG(), really.
> 
> This is what my patch series attempt to do, I chose to use VM_BUG()
> instead of BUG() because this is VM code, and avoid potential
> performance regressions for those who chose performance over possible
> security implications.

Yes, the VM_BUG() vs. BUG() is awkward. But you cannot rely on VM_BUG()
to stop the system, even if Fedora does turn it on.

> 
>>
>> Don't get me wrong, I don't want bugs covered up. But the claim that
>> incrementing is somehow better deserves some actual thinking about it.
> 
> I think it does, I described my points above, if you still disagree
> please let me know.
> 
> Thank you for providing your thoughts on this RFC, I will send out a
> new version, and we can continue discussion in the new thread.
> 
> Pasha
> 

Yes, let's see what it looks like.

thanks,
John Hubbard Nov. 1, 2021, 7:42 p.m. UTC | #13
On 11/1/21 07:22, Pasha Tatashin wrote:
>>>> Yes, you are just repeating what the diffs say.
>>>>
>>>> But it's still not good to have this function name doing something completely
>>>> different than its name indicates.
>>>
>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>
>>
>> What? No, that's not where I was going at all. The function is already
>> named set_page_refcounted(), and one of the problems I see is that your
>> changes turn it into something that most certainly does not
>> set_page_refounted(). Instead, this patch *increments* the refcount.
>> That is not the same thing.
>>
>> And then it uses a .config-sensitive assertion to "prevent" problems.
>> And by that I mean, the wording throughout this series seems to equate
>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>> all for normal (most distros) users. That's something that the wording,
>> comments, and even design should be tweaked to account for.
> 
> VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
> sensitive, but in both cases *BUG_ON() means that there is an
> unrecoverable problem that occured. The only difference between the
> two is that VM_BUG_ON() is not enabled when distros decide to reduce
> the size of their kernel and improve runtime performance by skipping
> some extra checking.
> 
> There is no logical separation between VM_BUG_ON and BUG_ON, there is
> been a lengthy discussion about this:

Actually I do want to mention one more thing about this, before we move
on to the next version of the patchset. The above is inaccurate. The
intent of VM_BUG_ON() and BUG_ON() is similar, but there is *definitely*
a logical separation between them: they do not behave the same at runtime.

Just because some distros enable VM_BUG_ON(), does not mean that we can
treat it the same as BUG_ON() in "both directions". From a "don't BUG()
crash the kernel unless really warranted", they are about the same, as
Linus keeps repeating. From the other direction, though ("I need to BUG()-
crash the kernel"), they are NOT the same. BUG_ON() is more reliably
available.

And that's the essence of my object to treating this as if you have
reliably stopped the kernel with a VM_BUG_ON(). It's not really the same!


thanks,
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..cf345fac6894 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -91,9 +91,12 @@  static inline bool page_evictable(struct page *page)
  */
 static inline void set_page_refcounted(struct page *page)
 {
+	int refcnt;
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	VM_BUG_ON_PAGE(page_ref_count(page), page);
-	set_page_count(page, 1);
+	refcnt = page_ref_inc_return(page);
+	VM_BUG_ON_PAGE(refcnt != 1, page);
 }
 
 extern unsigned long highest_memmap_pfn;