mbox series

[RFC,0/8] Hardening page _refcount

Message ID 20211026173822.502506-1-pasha.tatashin@soleen.com (mailing list archive)
Headers show
Series Hardening page _refcount | expand

Message

Pasha Tatashin Oct. 26, 2021, 5:38 p.m. UTC
It is hard to root cause _refcount problems, because they usually
manifest after the damage has occurred.  Yet, they can lead to
catastrophic failures such memory corruptions.

Improve debugability by adding more checks that ensure that
page->_refcount never turns negative (i.e. double free does not
happen, or free after freeze etc).

- Check for overflow and underflow right from the functions that
  modify _refcount
- Remove set_page_count(), so we do not unconditionally overwrite
  _refcount with an unrestrained value
- Trace return values in all functions that modify _refcount

Applies against v5.15-rc7. Boot tested in QEMU.

Pasha Tatashin (8):
  mm: add overflow and underflow checks for page->_refcount
  mm/hugetlb: remove useless set_page_count()
  mm: Avoid using set_page_count() in set_page_recounted()
  mm: remove set_page_count() from page_frag_alloc_align
  mm: avoid using set_page_count() when pages are freed into allocator
  mm: rename init_page_count() -> page_ref_init()
  mm: remove set_page_count()
  mm: simplify page_ref_* functions

 arch/m68k/mm/motorola.c         |   2 +-
 include/linux/mm.h              |   2 +-
 include/linux/page_ref.h        | 116 ++++++++++++++++----------------
 include/trace/events/page_ref.h |  66 +++++++++++-------
 mm/debug_page_ref.c             |  22 ++----
 mm/hugetlb.c                    |   2 +-
 mm/internal.h                   |   5 +-
 mm/page_alloc.c                 |  19 ++++--
 8 files changed, 125 insertions(+), 109 deletions(-)

Comments

Matthew Wilcox Oct. 26, 2021, 6:23 p.m. UTC | #1
On Tue, Oct 26, 2021 at 05:38:14PM +0000, Pasha Tatashin wrote:
> It is hard to root cause _refcount problems, because they usually
> manifest after the damage has occurred.  Yet, they can lead to
> catastrophic failures such memory corruptions.
> 
> Improve debugability by adding more checks that ensure that
> page->_refcount never turns negative (i.e. double free does not
> happen, or free after freeze etc).
> 
> - Check for overflow and underflow right from the functions that
>   modify _refcount
> - Remove set_page_count(), so we do not unconditionally overwrite
>   _refcount with an unrestrained value
> - Trace return values in all functions that modify _refcount

I think this is overkill.  Won't we get exactly the same protection
by simply testing that page->_refcount == 0 in set_page_count()?
Anything which triggers that BUG_ON would already be buggy because
it can race with speculative gets.
Pasha Tatashin Oct. 26, 2021, 6:30 p.m. UTC | #2
On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 26, 2021 at 05:38:14PM +0000, Pasha Tatashin wrote:
> > It is hard to root cause _refcount problems, because they usually
> > manifest after the damage has occurred.  Yet, they can lead to
> > catastrophic failures such memory corruptions.
> >
> > Improve debugability by adding more checks that ensure that
> > page->_refcount never turns negative (i.e. double free does not
> > happen, or free after freeze etc).
> >
> > - Check for overflow and underflow right from the functions that
> >   modify _refcount
> > - Remove set_page_count(), so we do not unconditionally overwrite
> >   _refcount with an unrestrained value
> > - Trace return values in all functions that modify _refcount
>
> I think this is overkill.  Won't we get exactly the same protection
> by simply testing that page->_refcount == 0 in set_page_count()?
> Anything which triggers that BUG_ON would already be buggy because
> it can race with speculative gets.

We can't because set_page_count(v) is used for
1. changing _refcount form a current value to unconstrained v
2.  initialize _refcount from undefined state to v.

In this work we forbid the first case, and reduce the second case to
initialize only to 1.

Pasha
Matthew Wilcox Oct. 26, 2021, 8:13 p.m. UTC | #3
On Tue, Oct 26, 2021 at 02:30:25PM -0400, Pasha Tatashin wrote:
> On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I think this is overkill.  Won't we get exactly the same protection
> > by simply testing that page->_refcount == 0 in set_page_count()?
> > Anything which triggers that BUG_ON would already be buggy because
> > it can race with speculative gets.
> 
> We can't because set_page_count(v) is used for
> 1. changing _refcount form a current value to unconstrained v
> 2.  initialize _refcount from undefined state to v.
> 
> In this work we forbid the first case, and reduce the second case to
> initialize only to 1.

Anything that is calling set_page_refcount() on something which is
not 0 is buggy today.  There are several ways to increment the page
refcount speculatively if it is not 0.  eg lockless GUP and page cache
reads.  So we could have:

CPU 0: alloc_page() (refcount now 1)
CPU 1: get_page_unless_zero() (refcount now 2)
CPU 0: set_page_refcount(5) (refcount now 5)
CPU 1: put_page() (refcount now 4)

Now the refcount is wrong.  So it is *only* safe to call
set_page_refcount() if the refcount is 0.  If you can find somewhere
that's calling set_page_refcount() on a non-0 refcount, that's a bug
that needs to be fixed.
Pasha Tatashin Oct. 26, 2021, 9:24 p.m. UTC | #4
On Tue, Oct 26, 2021 at 4:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Oct 26, 2021 at 02:30:25PM -0400, Pasha Tatashin wrote:
> > On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > I think this is overkill.  Won't we get exactly the same protection
> > > by simply testing that page->_refcount == 0 in set_page_count()?
> > > Anything which triggers that BUG_ON would already be buggy because
> > > it can race with speculative gets.
> >
> > We can't because set_page_count(v) is used for
> > 1. changing _refcount form a current value to unconstrained v
> > 2.  initialize _refcount from undefined state to v.
> >
> > In this work we forbid the first case, and reduce the second case to
> > initialize only to 1.
>
> Anything that is calling set_page_refcount() on something which is
> not 0 is buggy today

For performance reasons the memblock allocator does not zero struct
page memory, we initialize every field in struct page individually,
that includes page->_refcount. Most of the time the boot memory is
zeroed by firmware, but it is not guaranteed, non-zero pages can come
from bootloader, or from freed firmware pages. Also, after kexec
memory state is not zeroed as well, and struct pages can contain
garbage until fields are initialized.

This is a valid reason to do set_page_recount() with non-zero
refcounts, but the function is too generic, as in this case we really
need to set it only to 1: so instead rename it to page_ref_init() and
set it only to 1.

Another example:
In __free_pages_core() and in init_cma_reserved_pageblock() we do
set_page_refcount() when _ref_count is 1 and we change it to 0.

In this case doing dec_return check makes more sense in order to
verify that the ref_count was indeed 1, and we won't have a double
free.

>  There are several ways to increment the page
> refcount speculatively if it is not 0.  eg lockless GUP and page cache
> reads.  So we could have:
>
> CPU 0: alloc_page() (refcount now 1)
> CPU 1: get_page_unless_zero() (refcount now 2)
> CPU 0: set_page_refcount(5) (refcount now 5)
> CPU 1: put_page() (refcount now 4)
>
> Now the refcount is wrong.  So it is *only* safe to call
> set_page_refcount() if the refcount is 0.  If you can find somewhere
> that's calling set_page_refcount() on a non-0 refcount, that's a bug
> that needs to be fixed.

Right, add_return/sub_return() with check after operation ensures that
there are no races where we could have some writes to refcount between
knowing it is 0 and calling set_page_refcount().

Thanks,
Pasha