Message ID | 20211026173822.502506-4-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Hardening page _refcount | expand |
On 10/26/21 10:38, Pasha Tatashin wrote: > set_page_refcounted() converts a non-refcounted page that has > (page->_refcount == 0) into a refcounted page by setting _refcount to 1, > > Use page_ref_inc_return() instead to avoid unconditionally overwriting > the _refcount value. > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> > --- > mm/internal.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/internal.h b/mm/internal.h > index cf3cb933eba3..cf345fac6894 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page) > */ > static inline void set_page_refcounted(struct page *page) > { > + int refcnt; > + > VM_BUG_ON_PAGE(PageTail(page), page); > VM_BUG_ON_PAGE(page_ref_count(page), page); > - set_page_count(page, 1); > + refcnt = page_ref_inc_return(page); > + VM_BUG_ON_PAGE(refcnt != 1, page); Hi Pavel, I am acutely uncomfortable with this change, because it changes the meaning and behavior of the function to something completely different, while leaving the function name unchanged. Furthermore, in relies upon debug assertions, rather than a return value (for example) to verify that all is well. I understand where this patchset is going, but this intermediate step is not a good move. Also, for the overall series, if you want to change from "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to do that is *not* to depend solely on VM_BUG*() to verify. Instead, return something like -EBUSY if incrementing the value results in a surprise, and let the caller decide how to handle it. thanks,
On 10/26/21 10:53, John Hubbard wrote: > On 10/26/21 10:38, Pasha Tatashin wrote: >> set_page_refcounted() converts a non-refcounted page that has >> (page->_refcount == 0) into a refcounted page by setting _refcount to 1, >> >> Use page_ref_inc_return() instead to avoid unconditionally overwriting >> the _refcount value. >> >> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> >> --- >> mm/internal.h | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/mm/internal.h b/mm/internal.h >> index cf3cb933eba3..cf345fac6894 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page) >> */ >> static inline void set_page_refcounted(struct page *page) >> { >> + int refcnt; >> + >> VM_BUG_ON_PAGE(PageTail(page), page); >> VM_BUG_ON_PAGE(page_ref_count(page), page); >> - set_page_count(page, 1); >> + refcnt = page_ref_inc_return(page); >> + VM_BUG_ON_PAGE(refcnt != 1, page); > > Hi Pavel, ohhh, s/Pavel/Pasha/ ! Huge apologies for the name mixup, I had just seen another email... very sorry about that mistake. thanks,
> > Hi Pavel, > > ohhh, s/Pavel/Pasha/ ! > > Huge apologies for the name mixup, I had just seen another email... > very sorry about that mistake. Hi John, Do not need to be sorry, Pavel == Pasha :) In Russian it is the same name; I've been trying to reduce the confusion, by converting everything to Pasha. Pasha
Hi John, Thank you for looking at this series. > > static inline void set_page_refcounted(struct page *page) > > { > > + int refcnt; > > + > > VM_BUG_ON_PAGE(PageTail(page), page); > > VM_BUG_ON_PAGE(page_ref_count(page), page); > > - set_page_count(page, 1); > > + refcnt = page_ref_inc_return(page); > > + VM_BUG_ON_PAGE(refcnt != 1, page); > I am acutely uncomfortable with this change, because it changes the > meaning and behavior of the function to something completely different, > while leaving the function name unchanged. Furthermore, in relies upon > debug assertions, rather than a return value (for example) to verify > that all is well. It must return the same thing, if it does not we have a bug in our kernel which may lead to memory corruptions and security holes. So today we have this: VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 < What if something modified here? Hmm..> set_page_count(page, 1); -> Yet we reset it to 1. With my proposed change: VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 refcnt = page_ref_inc_return(page); -> ref_count better be 1. VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1. > > I understand where this patchset is going, but this intermediate step is > not a good move. > > Also, for the overall series, if you want to change from > "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to > do that is *not* to depend solely on VM_BUG*() to verify. Instead, > return something like -EBUSY if incrementing the value results in a > surprise, and let the caller decide how to handle it. Actually, -EBUSY would be OK if the problems were because we failed to modify refcount for some reason, but if we modified refcount and got an unexpected value (i.e underflow/overflow) we better report it right away instead of waiting for memory corruption to happen. Thanks, Pasha
On 10/26/21 11:21, Pasha Tatashin wrote: > It must return the same thing, if it does not we have a bug in our > kernel which may lead to memory corruptions and security holes. > > So today we have this: > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 > < What if something modified here? Hmm..> > set_page_count(page, 1); -> Yet we reset it to 1. > > With my proposed change: > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 > refcnt = page_ref_inc_return(page); -> ref_count better be 1. > VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1. > Yes, you are just repeating what the diffs say. But it's still not good to have this function name doing something completely different than its name indicates. >> >> I understand where this patchset is going, but this intermediate step is >> not a good move. >> >> Also, for the overall series, if you want to change from >> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to >> do that is *not* to depend solely on VM_BUG*() to verify. Instead, >> return something like -EBUSY if incrementing the value results in a >> surprise, and let the caller decide how to handle it. > > Actually, -EBUSY would be OK if the problems were because we failed to > modify refcount for some reason, but if we modified refcount and got > an unexpected value (i.e underflow/overflow) we better report it right > away instead of waiting for memory corruption to happen. > Having the caller do the BUG() or VM_BUG*() is not a significant delay. thanks,
On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <jhubbard@nvidia.com> wrote: > > On 10/26/21 11:21, Pasha Tatashin wrote: > > It must return the same thing, if it does not we have a bug in our > > kernel which may lead to memory corruptions and security holes. > > > > So today we have this: > > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 > > < What if something modified here? Hmm..> > > set_page_count(page, 1); -> Yet we reset it to 1. > > > > With my proposed change: > > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 > > refcnt = page_ref_inc_return(page); -> ref_count better be 1. > > VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1. > > > > Yes, you are just repeating what the diffs say. > > But it's still not good to have this function name doing something completely > different than its name indicates. I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? > > >> > >> I understand where this patchset is going, but this intermediate step is > >> not a good move. > >> > >> Also, for the overall series, if you want to change from > >> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to > >> do that is *not* to depend solely on VM_BUG*() to verify. Instead, > >> return something like -EBUSY if incrementing the value results in a > >> surprise, and let the caller decide how to handle it. > > > > Actually, -EBUSY would be OK if the problems were because we failed to > > modify refcount for some reason, but if we modified refcount and got > > an unexpected value (i.e underflow/overflow) we better report it right > > away instead of waiting for memory corruption to happen. > > > > Having the caller do the BUG() or VM_BUG*() is not a significant delay. We cannot guarantee that new callers in the future will check return values, the idea behind this work is to ensure that we are always protected from refcount underflow/overflow and invalid refcount modifications by set_refcount. Pasha
On 10/27/21 11:27, Pasha Tatashin wrote: > On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 10/26/21 11:21, Pasha Tatashin wrote: >>> It must return the same thing, if it does not we have a bug in our >>> kernel which may lead to memory corruptions and security holes. >>> >>> So today we have this: >>> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 >>> < What if something modified here? Hmm..> >>> set_page_count(page, 1); -> Yet we reset it to 1. >>> >>> With my proposed change: >>> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0 >>> refcnt = page_ref_inc_return(page); -> ref_count better be 1. >>> VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1. >>> >> >> Yes, you are just repeating what the diffs say. >> >> But it's still not good to have this function name doing something completely >> different than its name indicates. > > I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? > What? No, that's not where I was going at all. The function is already named set_page_refcounted(), and one of the problems I see is that your changes turn it into something that most certainly does not set_page_refounted(). Instead, this patch *increments* the refcount. That is not the same thing. And then it uses a .config-sensitive assertion to "prevent" problems. And by that I mean, the wording throughout this series seems to equate VM_BUG_ON_PAGE() assertions with real assertions. They are only active, however, in CONFIG_DEBUG_VM configurations, and provide no protection at all for normal (most distros) users. That's something that the wording, comments, and even design should be tweaked to account for. >> >>>> >>>> I understand where this patchset is going, but this intermediate step is >>>> not a good move. >>>> >>>> Also, for the overall series, if you want to change from >>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to >>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead, >>>> return something like -EBUSY if incrementing the value results in a >>>> surprise, and let the caller decide how to handle it. >>> >>> Actually, -EBUSY would be OK if the problems were because we failed to >>> modify refcount for some reason, but if we modified refcount and got >>> an unexpected value (i.e underflow/overflow) we better report it right >>> away instead of waiting for memory corruption to happen. >>> >> >> Having the caller do the BUG() or VM_BUG*() is not a significant delay. > > We cannot guarantee that new callers in the future will check return > values, the idea behind this work is to ensure that we are always > protected from refcount underflow/overflow and invalid refcount > modifications by set_refcount. > I don't have a problem with putting assertions closest to where they should fire. That's a good thing. I'm looking here for ways to fix up the problems listed in the points above, though. And I do want to point out another thing, though, and that is: generally, we don't have to program to quite the level of defensiveness you seem to be at. If return values must be checked, they usually are in the kernel--and we even have tooling to enforce it: /* * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-warn_005funused_005fresult-function-attribute * clang: https://clang.llvm.org/docs/AttributeReference.html#nodiscard-warn-unused-result */ #define __must_check __attribute__((__warn_unused_result__)) Please take that into consideration when weighing tradeoffs, just sort of in general. thanks,
On 10/27/21 18:20, John Hubbard wrote: >>> But it's still not good to have this function name doing something completely >>> different than its name indicates. >> >> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? >> > > What? No, that's not where I was going at all. The function is already > named set_page_refcounted(), and one of the problems I see is that your > changes turn it into something that most certainly does not > set_page_refounted(). Instead, this patch *increments* the refcount. > That is not the same thing. > > And then it uses a .config-sensitive assertion to "prevent" problems. > And by that I mean, the wording throughout this series seems to equate > VM_BUG_ON_PAGE() assertions with real assertions. They are only active, > however, in CONFIG_DEBUG_VM configurations, and provide no protection at > all for normal (most distros) users. That's something that the wording, > comments, and even design should be tweaked to account for. ...and to clarify a bit more, maybe this also helps: These patches are attempting to improve debugging, and that is fine, as far as debugging goes. However, a point that seems to be slightly misunderstood is: incrementing a bad refcount value is not actually any better than overwriting it, from a recovery point of view. Maybe (?) it's better from a debugging point of view. That's because the problem occurred before this code, and its debug-only assertions, ran. Once here, the code cannot actually recover: there is no automatic way to recover from a refcount that it 1, -1, 2, or 706, when it was supposed to be zero. Incrementing it is, again, not really necessarily better than setting: setting it might actually make the broken system appear to run--and in some cases, even avoid symptoms. Whereas incrementing doesn't cover anything up. The only thing you can really does is just panic() or BUG(), really. Don't get me wrong, I don't want bugs covered up. But the claim that incrementing is somehow better deserves some actual thinking about it. Overall, I'm inclined to *not* switch anything over to incrementing the refcounts. Instead, go ahead and: a) Add assertions up to a "reasonable" point (some others have pointed out that you don't need quite all of the assertions you've added). b) Remove set_page_count() calls where possible--maybe everywhere. c) Fix any bugs found along the way. d) ...but, leave the basic logic as-is: no changing over to page_ref_inc_return(). Anyway, that's my take on it. thanks,
> >> Yes, you are just repeating what the diffs say. > >> > >> But it's still not good to have this function name doing something completely > >> different than its name indicates. > > > > I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? > > > > What? No, that's not where I was going at all. The function is already > named set_page_refcounted(), and one of the problems I see is that your > changes turn it into something that most certainly does not > set_page_refounted(). Instead, this patch *increments* the refcount. > That is not the same thing. > > And then it uses a .config-sensitive assertion to "prevent" problems. > And by that I mean, the wording throughout this series seems to equate > VM_BUG_ON_PAGE() assertions with real assertions. They are only active, > however, in CONFIG_DEBUG_VM configurations, and provide no protection at > all for normal (most distros) users. That's something that the wording, > comments, and even design should be tweaked to account for. VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config sensitive, but in both cases *BUG_ON() means that there is an unrecoverable problem that occured. The only difference between the two is that VM_BUG_ON() is not enabled when distros decide to reduce the size of their kernel and improve runtime performance by skipping some extra checking. There is no logical separation between VM_BUG_ON and BUG_ON, there is been a lengthy discussion about this: https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/ "so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It just allows some people to build smaller kernels, but apparently distro people would rather have debugging than save a few kB of RAM." Losing control of ref_count is an unrecoverable problem because it leads to security sensitive memory corruptions. It is better to crash the kernel when that happens instead of ending up with some pages mapped into the wrong address space. The races are tricky to spot, but set_page_count() is inherently dangerous, so I am removing it entirely and replacing it with safer operations which do the same thing. One example is this: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29 > >>>> I understand where this patchset is going, but this intermediate step is > >>>> not a good move. > >>>> > >>>> Also, for the overall series, if you want to change from > >>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to > >>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead, > >>>> return something like -EBUSY if incrementing the value results in a > >>>> surprise, and let the caller decide how to handle it. In set_page_refcounted() we already have: VM_BUG_ON_PAGE(page_ref_count(page), page); set_page_count(page, 1); I am pointing out that above code is racy: Between the check VM_BUG_ON_PAGE() check and unconditional set to 1 the value of page->_refcount can change. I am replacing it with an identical version of code that is not racy. There is no need to complicate the code by introducing new -EBUSY returns here, as it would reduce the fragility of this could even farther. > >>> Actually, -EBUSY would be OK if the problems were because we failed to I am not sure -EBUSY would be OK here, it means we had a race which we were not aware about, and which could have led to memory corruptions. > >>> modify refcount for some reason, but if we modified refcount and got > >>> an unexpected value (i.e underflow/overflow) we better report it right > >>> away instead of waiting for memory corruption to happen. > >>> > >> > >> Having the caller do the BUG() or VM_BUG*() is not a significant delay. I agree, however, helper functions exist to remove code duplications. If we must verify the assumption of set_page_refcounted() that non counted page is turned into a counted page, it is better to do it in one place than at every call site. We do it today in thus helper function, I do not see why we would change that.
On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 10/27/21 18:20, John Hubbard wrote: > >>> But it's still not good to have this function name doing something completely > >>> different than its name indicates. > >> > >> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? > >> > > > > What? No, that's not where I was going at all. The function is already > > named set_page_refcounted(), and one of the problems I see is that your > > changes turn it into something that most certainly does not > > set_page_refounted(). Instead, this patch *increments* the refcount. > > That is not the same thing. > > > > And then it uses a .config-sensitive assertion to "prevent" problems. > > And by that I mean, the wording throughout this series seems to equate > > VM_BUG_ON_PAGE() assertions with real assertions. They are only active, > > however, in CONFIG_DEBUG_VM configurations, and provide no protection at > > all for normal (most distros) users. That's something that the wording, > > comments, and even design should be tweaked to account for. > > ...and to clarify a bit more, maybe this also helps: > > These patches are attempting to improve debugging, and that is fine, as They are attempting to catch potentioal race conditions where _refcount is changed between the time we verified what it was and we set it to something else. They also attempt to prevent overflows and underflows bugs which are not all tested today, but can be tested with this patch set at least on kernels where DEBUG_VM is enabled. > far as debugging goes. However, a point that seems to be slightly > misunderstood is: incrementing a bad refcount value is not actually any > better than overwriting it, from a recovery point of view. Maybe (?) > it's better from a debugging point of view. It is better for debugging as well: if one is tracing the page _refcount history, knowing that the _refcount can only be incremented/decremented/frozen/unfrozen provides a contiguous history of refcount that can be tracked. In case when we set refcount in some places as we do today, the contigous history is lost, as we do not know the actual _refcount value at the time of the set operation. > > That's because the problem occurred before this code, and its debug-only > assertions, ran. Once here, the code cannot actually recover: there is > no automatic way to recover from a refcount that it 1, -1, 2, or 706, > when it was supposed to be zero. Incrementing it is, again, not really > necessarily better than setting: setting it might actually make the > broken system appear to run--and in some cases, even avoid symptoms. > Whereas incrementing doesn't cover anything up. The only thing you can > really does is just panic() or BUG(), really. This is what my patch series attempt to do, I chose to use VM_BUG() instead of BUG() because this is VM code, and avoid potential performance regressions for those who chose performance over possible security implications. > > Don't get me wrong, I don't want bugs covered up. But the claim that > incrementing is somehow better deserves some actual thinking about it. I think it does, I described my points above, if you still disagree please let me know. Thank you for providing your thoughts on this RFC, I will send out a new version, and we can continue discussion in the new thread. Pasha
On 11/1/21 07:22, Pasha Tatashin wrote: >>>> Yes, you are just repeating what the diffs say. >>>> >>>> But it's still not good to have this function name doing something completely >>>> different than its name indicates. >>> >>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? >>> >> >> What? No, that's not where I was going at all. The function is already >> named set_page_refcounted(), and one of the problems I see is that your >> changes turn it into something that most certainly does not >> set_page_refounted(). Instead, this patch *increments* the refcount. >> That is not the same thing. >> >> And then it uses a .config-sensitive assertion to "prevent" problems. >> And by that I mean, the wording throughout this series seems to equate >> VM_BUG_ON_PAGE() assertions with real assertions. They are only active, >> however, in CONFIG_DEBUG_VM configurations, and provide no protection at >> all for normal (most distros) users. That's something that the wording, >> comments, and even design should be tweaked to account for. > > VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config > sensitive, but in both cases *BUG_ON() means that there is an > unrecoverable problem that occured. The only difference between the > two is that VM_BUG_ON() is not enabled when distros decide to reduce > the size of their kernel and improve runtime performance by skipping > some extra checking. > > There is no logical separation between VM_BUG_ON and BUG_ON, there is > been a lengthy discussion about this: > > https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/ > "so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It > just allows some people to build smaller kernels, but apparently > distro people would rather have debugging than save a few kB of RAM." > > Losing control of ref_count is an unrecoverable problem because it > leads to security sensitive memory corruptions. It is better to crash > the kernel when that happens instead of ending up with some pages > mapped into the wrong address space. > > The races are tricky to spot, but set_page_count() is inherently > dangerous, so I am removing it entirely and replacing it with safer > operations which do the same thing. > > One example is this: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29 I don't think we have any disagreement about the landscape here. But it's much easier to describe the problem than it is to fix it--as always. And repeating the problem doesn't make a proposed fix more (or less) appropriate. :) > >>>>>> I understand where this patchset is going, but this intermediate step is >>>>>> not a good move. >>>>>> >>>>>> Also, for the overall series, if you want to change from >>>>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to >>>>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead, >>>>>> return something like -EBUSY if incrementing the value results in a >>>>>> surprise, and let the caller decide how to handle it. > > In set_page_refcounted() we already have: > > VM_BUG_ON_PAGE(page_ref_count(page), page); > set_page_count(page, 1); > > I am pointing out that above code is racy: > > Between the check VM_BUG_ON_PAGE() check and unconditional set to 1 > the value of page->_refcount can change. > > I am replacing it with an identical version of code that is not racy. And I'm pointing out that raciness is not the real bug, or at least, not the only bug. "Fixing" the race does not fix the code, but the patch series seems to imply that it does. > There is no need to complicate the code by introducing new -EBUSY > returns here, as it would reduce the fragility of this could even > farther. > >>>>> Actually, -EBUSY would be OK if the problems were because we failed to > > I am not sure -EBUSY would be OK here, it means we had a race which we > were not aware about, and which could have led to memory corruptions. > >>>>> modify refcount for some reason, but if we modified refcount and got >>>>> an unexpected value (i.e underflow/overflow) we better report it right >>>>> away instead of waiting for memory corruption to happen. >>>>> >>>> >>>> Having the caller do the BUG() or VM_BUG*() is not a significant delay. > > I agree, however, helper functions exist to remove code duplications. > If we must verify the assumption of set_page_refcounted() that non > counted page is turned into a counted page, it is better to do it in > one place than at every call site. We do it today in thus helper > function, I do not see why we would change that. > Let's ignore this -EBUSY idea for now, because I'm not sure where you are going with your next version, and maybe it won't even come up. thanks,
On 11/1/21 07:30, Pasha Tatashin wrote: > On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <jhubbard@nvidia.com> wrote: >> >> On 10/27/21 18:20, John Hubbard wrote: >>>>> But it's still not good to have this function name doing something completely >>>>> different than its name indicates. >>>> >>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? >>>> >>> >>> What? No, that's not where I was going at all. The function is already >>> named set_page_refcounted(), and one of the problems I see is that your >>> changes turn it into something that most certainly does not >>> set_page_refounted(). Instead, this patch *increments* the refcount. >>> That is not the same thing. >>> >>> And then it uses a .config-sensitive assertion to "prevent" problems. >>> And by that I mean, the wording throughout this series seems to equate >>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active, >>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at >>> all for normal (most distros) users. That's something that the wording, >>> comments, and even design should be tweaked to account for. >> >> ...and to clarify a bit more, maybe this also helps: >> >> These patches are attempting to improve debugging, and that is fine, as > > They are attempting to catch potentioal race conditions where > _refcount is changed between the time we verified what it was and we > set it to something else. > > They also attempt to prevent overflows and underflows bugs which are > not all tested today, but can be tested with this patch set at least > on kernels where DEBUG_VM is enabled. OK, but did you get my point about the naming problem? > >> far as debugging goes. However, a point that seems to be slightly >> misunderstood is: incrementing a bad refcount value is not actually any >> better than overwriting it, from a recovery point of view. Maybe (?) >> it's better from a debugging point of view. > > It is better for debugging as well: if one is tracing the page > _refcount history, knowing that the _refcount can only be > incremented/decremented/frozen/unfrozen provides a contiguous history > of refcount that can be tracked. In case when we set refcount in some > places as we do today, the contigous history is lost, as we do not > know the actual _refcount value at the time of the set operation. > OK, that is a reasonable argument. Let's put it somewhere, maybe in a comment block, if it's not already there. >> >> That's because the problem occurred before this code, and its debug-only >> assertions, ran. Once here, the code cannot actually recover: there is >> no automatic way to recover from a refcount that it 1, -1, 2, or 706, >> when it was supposed to be zero. Incrementing it is, again, not really >> necessarily better than setting: setting it might actually make the >> broken system appear to run--and in some cases, even avoid symptoms. >> Whereas incrementing doesn't cover anything up. The only thing you can >> really does is just panic() or BUG(), really. > > This is what my patch series attempt to do, I chose to use VM_BUG() > instead of BUG() because this is VM code, and avoid potential > performance regressions for those who chose performance over possible > security implications. Yes, the VM_BUG() vs. BUG() is awkward. But you cannot rely on VM_BUG() to stop the system, even if Fedora does turn it on. > >> >> Don't get me wrong, I don't want bugs covered up. But the claim that >> incrementing is somehow better deserves some actual thinking about it. > > I think it does, I described my points above, if you still disagree > please let me know. > > Thank you for providing your thoughts on this RFC, I will send out a > new version, and we can continue discussion in the new thread. > > Pasha > Yes, let's see what it looks like. thanks,
On 11/1/21 07:22, Pasha Tatashin wrote: >>>> Yes, you are just repeating what the diffs say. >>>> >>>> But it's still not good to have this function name doing something completely >>>> different than its name indicates. >>> >>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ? >>> >> >> What? No, that's not where I was going at all. The function is already >> named set_page_refcounted(), and one of the problems I see is that your >> changes turn it into something that most certainly does not >> set_page_refounted(). Instead, this patch *increments* the refcount. >> That is not the same thing. >> >> And then it uses a .config-sensitive assertion to "prevent" problems. >> And by that I mean, the wording throughout this series seems to equate >> VM_BUG_ON_PAGE() assertions with real assertions. They are only active, >> however, in CONFIG_DEBUG_VM configurations, and provide no protection at >> all for normal (most distros) users. That's something that the wording, >> comments, and even design should be tweaked to account for. > > VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config > sensitive, but in both cases *BUG_ON() means that there is an > unrecoverable problem that occured. The only difference between the > two is that VM_BUG_ON() is not enabled when distros decide to reduce > the size of their kernel and improve runtime performance by skipping > some extra checking. > > There is no logical separation between VM_BUG_ON and BUG_ON, there is > been a lengthy discussion about this: Actually I do want to mention one more thing about this, before we move on to the next version of the patchset. The above is inaccurate. The intent of VM_BUG_ON() and BUG_ON() is similar, but there is *definitely* a logical separation between them: they do not behave the same at runtime. Just because some distros enable VM_BUG_ON(), does not mean that we can treat it the same as BUG_ON() in "both directions". From a "don't BUG() crash the kernel unless really warranted", they are about the same, as Linus keeps repeating. From the other direction, though ("I need to BUG()- crash the kernel"), they are NOT the same. BUG_ON() is more reliably available. And that's the essence of my object to treating this as if you have reliably stopped the kernel with a VM_BUG_ON(). It's not really the same! thanks,
diff --git a/mm/internal.h b/mm/internal.h index cf3cb933eba3..cf345fac6894 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page) */ static inline void set_page_refcounted(struct page *page) { + int refcnt; + VM_BUG_ON_PAGE(PageTail(page), page); VM_BUG_ON_PAGE(page_ref_count(page), page); - set_page_count(page, 1); + refcnt = page_ref_inc_return(page); + VM_BUG_ON_PAGE(refcnt != 1, page); } extern unsigned long highest_memmap_pfn;
set_page_refcounted() converts a non-refcounted page that has (page->_refcount == 0) into a refcounted page by setting _refcount to 1, Use page_ref_inc_return() instead to avoid unconditionally overwriting the _refcount value. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> --- mm/internal.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)