Message ID | 20220126095557.32392-6-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 1: fix the COW security issue for THP and swap | expand |
On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote: > > We currently have a different COW logic for anon THP than we have for > ordinary anon pages in do_wp_page(): the effect is that the issue reported > in CVE-2020-29374 is currently still possible for anon THP: an unintended > information leak from the parent to the child. > > Let's apply the same logic (page_count() == 1), with similar > optimizations to remove additional references first as we really want to > avoid PTE-mapping the THP and copying individual pages best we can. > > If we end up with a page that has page_count() != 1, we'll have to PTE-map > the THP and fallback to do_wp_page(), which will always copy the page. > > Note that KSM does not apply to THP. > > I. Interaction with the swapcache and writeback > > While a THP is in the swapcache, the swapcache holds one reference on each > subpage of the THP. So with PageSwapCache() set, we expect as many > additional references as we have subpages. If we manage to remove the > THP from the swapcache, all these references will be gone. > > Usually, a THP is not split when entered into the swapcache and stays a > compound page. However, try_to_unmap() will PTE-map the THP and use PTE > swap entries. There are no PMD swap entries for that purpose, consequently, > we always only swapin subpages into PTEs. > > Removing a page from the swapcache can fail either when there are remaining > swap entries (in which case COW is the right thing to do) or if the page is > currently under writeback. > > Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be > possible only in corner cases, for example, if try_to_unmap() failed > after adding the page to the swapcache. However, it's comparatively easy to > handle. > > As we have to fully unmap a THP before starting writeback, and swapin is > always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the > swapcache that is under writeback. This should at least leave writeback > out of the picture. > > II. Interaction with GUP references > > Having a R/O PMD-mapped THP with GUP references (i.e., R/O references) > will result in PTE-mapping the THP on a write fault. Similar to ordinary > anon pages, do_wp_page() will have to copy sub-pages and result in a > disconnect between the GUP references and the pages actually mapped into > the page tables. To improve the situation in the future, we'll need > additional handling to mark anonymous pages as definitely exclusive to a > single process, only allow GUP pins on exclusive anon pages, and > disallow sharing of exclusive anon pages with GUP pins e.g., during > fork(). > > III. Interaction with references from LRU pagevecs > > Similar to ordinary anon pages, we can have LRU pagevecs referencing our > THP. Reliably removing such references requires draining LRU pagevecs on > all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can > sleep. For now, similar do do_wp_page(), let's conditionally drain the > local LRU pagevecs only if we detect !PageLRU(). > > IV. Interaction with speculative/temporary references > > Similar to ordinary anon pages, other speculative/temporary references on > the THP, for example, from the pagecache or page migration code, will > disallow exclusive reuse of the page. We'll have to PTE-map the THP. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/huge_memory.c | 19 +++++++++++++++---- > 1 file changed, 15 insertions(+), 4 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 406a3c28c026..b6ba88a98266 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > struct page *page; > unsigned long haddr = vmf->address & HPAGE_PMD_MASK; > pmd_t orig_pmd = vmf->orig_pmd; > + int swapcache_refs = 0; > > vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); > VM_BUG_ON_VMA(!vma->anon_vma, vma); > @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > page = pmd_page(orig_pmd); > VM_BUG_ON_PAGE(!PageHead(page), page); > > - /* Lock page for reuse_swap_page() */ > if (!trylock_page(page)) { > get_page(page); > spin_unlock(vmf->ptl); > @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > } > > /* > - * We can only reuse the page if nobody else maps the huge page or it's > - * part. > + * See do_wp_page(): we can only map the page writable if there are > + * no additional references. > */ > - if (reuse_swap_page(page)) { > + if (PageSwapCache(page)) > + swapcache_refs = thp_nr_pages(page); > + if (page_count(page) > 1 + swapcache_refs + !PageLRU(page)) > + goto unlock_fallback; > + if (!PageLRU(page)) > + lru_add_drain(); IMHO, draining lru doesn't help out too much for THP since THP will be drained to LRU immediately once it is added into pagevec. > + if (page_count(page) > 1 + swapcache_refs) > + goto unlock_fallback; > + if (swapcache_refs) > + try_to_free_swap(page); > + if (page_count(page) == 1) { > pmd_t entry; > entry = pmd_mkyoung(orig_pmd); > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > @@ -1333,6 +1343,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) > return VM_FAULT_WRITE; > } > > +unlock_fallback: > unlock_page(page); > spin_unlock(vmf->ptl); > fallback: > -- > 2.34.1 >
On 26.01.22 21:36, Yang Shi wrote: > On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote: >> >> We currently have a different COW logic for anon THP than we have for >> ordinary anon pages in do_wp_page(): the effect is that the issue reported >> in CVE-2020-29374 is currently still possible for anon THP: an unintended >> information leak from the parent to the child. >> >> Let's apply the same logic (page_count() == 1), with similar >> optimizations to remove additional references first as we really want to >> avoid PTE-mapping the THP and copying individual pages best we can. >> >> If we end up with a page that has page_count() != 1, we'll have to PTE-map >> the THP and fallback to do_wp_page(), which will always copy the page. >> >> Note that KSM does not apply to THP. >> >> I. Interaction with the swapcache and writeback >> >> While a THP is in the swapcache, the swapcache holds one reference on each >> subpage of the THP. So with PageSwapCache() set, we expect as many >> additional references as we have subpages. If we manage to remove the >> THP from the swapcache, all these references will be gone. >> >> Usually, a THP is not split when entered into the swapcache and stays a >> compound page. However, try_to_unmap() will PTE-map the THP and use PTE >> swap entries. There are no PMD swap entries for that purpose, consequently, >> we always only swapin subpages into PTEs. >> >> Removing a page from the swapcache can fail either when there are remaining >> swap entries (in which case COW is the right thing to do) or if the page is >> currently under writeback. >> >> Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be >> possible only in corner cases, for example, if try_to_unmap() failed >> after adding the page to the swapcache. However, it's comparatively easy to >> handle. >> >> As we have to fully unmap a THP before starting writeback, and swapin is >> always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the >> swapcache that is under writeback. This should at least leave writeback >> out of the picture. >> >> II. Interaction with GUP references >> >> Having a R/O PMD-mapped THP with GUP references (i.e., R/O references) >> will result in PTE-mapping the THP on a write fault. Similar to ordinary >> anon pages, do_wp_page() will have to copy sub-pages and result in a >> disconnect between the GUP references and the pages actually mapped into >> the page tables. To improve the situation in the future, we'll need >> additional handling to mark anonymous pages as definitely exclusive to a >> single process, only allow GUP pins on exclusive anon pages, and >> disallow sharing of exclusive anon pages with GUP pins e.g., during >> fork(). >> >> III. Interaction with references from LRU pagevecs >> >> Similar to ordinary anon pages, we can have LRU pagevecs referencing our >> THP. Reliably removing such references requires draining LRU pagevecs on >> all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can >> sleep. For now, similar do do_wp_page(), let's conditionally drain the >> local LRU pagevecs only if we detect !PageLRU(). >> >> IV. Interaction with speculative/temporary references >> >> Similar to ordinary anon pages, other speculative/temporary references on >> the THP, for example, from the pagecache or page migration code, will >> disallow exclusive reuse of the page. We'll have to PTE-map the THP. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/huge_memory.c | 19 +++++++++++++++---- >> 1 file changed, 15 insertions(+), 4 deletions(-) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 406a3c28c026..b6ba88a98266 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >> struct page *page; >> unsigned long haddr = vmf->address & HPAGE_PMD_MASK; >> pmd_t orig_pmd = vmf->orig_pmd; >> + int swapcache_refs = 0; >> >> vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); >> VM_BUG_ON_VMA(!vma->anon_vma, vma); >> @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >> page = pmd_page(orig_pmd); >> VM_BUG_ON_PAGE(!PageHead(page), page); >> >> - /* Lock page for reuse_swap_page() */ >> if (!trylock_page(page)) { >> get_page(page); >> spin_unlock(vmf->ptl); >> @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >> } >> >> /* >> - * We can only reuse the page if nobody else maps the huge page or it's >> - * part. >> + * See do_wp_page(): we can only map the page writable if there are >> + * no additional references. >> */ >> - if (reuse_swap_page(page)) { >> + if (PageSwapCache(page)) >> + swapcache_refs = thp_nr_pages(page); >> + if (page_count(page) > 1 + swapcache_refs + !PageLRU(page)) >> + goto unlock_fallback; >> + if (!PageLRU(page)) >> + lru_add_drain(); > > IMHO, draining lru doesn't help out too much for THP since THP will be > drained to LRU immediately once it is added into pagevec. Oh, thanks, I think you're right. The interesting bit is static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page) { bool ret = false; if (!pagevec_add(pvec, page) || PageCompound(page) || lru_cache_disabled()) ret = true; return ret; } Which indeed drains after adding it to the pagevec. Will adjust the patch and update the description/comment accordingly, thanks!
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..b6ba88a98266 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; pmd_t orig_pmd = vmf->orig_pmd; + int swapcache_refs = 0; vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) page = pmd_page(orig_pmd); VM_BUG_ON_PAGE(!PageHead(page), page); - /* Lock page for reuse_swap_page() */ if (!trylock_page(page)) { get_page(page); spin_unlock(vmf->ptl); @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) } /* - * We can only reuse the page if nobody else maps the huge page or it's - * part. + * See do_wp_page(): we can only map the page writable if there are + * no additional references. */ - if (reuse_swap_page(page)) { + if (PageSwapCache(page)) + swapcache_refs = thp_nr_pages(page); + if (page_count(page) > 1 + swapcache_refs + !PageLRU(page)) + goto unlock_fallback; + if (!PageLRU(page)) + lru_add_drain(); + if (page_count(page) > 1 + swapcache_refs) + goto unlock_fallback; + if (swapcache_refs) + try_to_free_swap(page); + if (page_count(page) == 1) { pmd_t entry; entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -1333,6 +1343,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return VM_FAULT_WRITE; } +unlock_fallback: unlock_page(page); spin_unlock(vmf->ptl); fallback:
We currently have a different COW logic for anon THP than we have for ordinary anon pages in do_wp_page(): the effect is that the issue reported in CVE-2020-29374 is currently still possible for anon THP: an unintended information leak from the parent to the child. Let's apply the same logic (page_count() == 1), with similar optimizations to remove additional references first as we really want to avoid PTE-mapping the THP and copying individual pages best we can. If we end up with a page that has page_count() != 1, we'll have to PTE-map the THP and fallback to do_wp_page(), which will always copy the page. Note that KSM does not apply to THP. I. Interaction with the swapcache and writeback While a THP is in the swapcache, the swapcache holds one reference on each subpage of the THP. So with PageSwapCache() set, we expect as many additional references as we have subpages. If we manage to remove the THP from the swapcache, all these references will be gone. Usually, a THP is not split when entered into the swapcache and stays a compound page. However, try_to_unmap() will PTE-map the THP and use PTE swap entries. There are no PMD swap entries for that purpose, consequently, we always only swapin subpages into PTEs. Removing a page from the swapcache can fail either when there are remaining swap entries (in which case COW is the right thing to do) or if the page is currently under writeback. Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be possible only in corner cases, for example, if try_to_unmap() failed after adding the page to the swapcache. However, it's comparatively easy to handle. As we have to fully unmap a THP before starting writeback, and swapin is always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the swapcache that is under writeback. This should at least leave writeback out of the picture. II. Interaction with GUP references Having a R/O PMD-mapped THP with GUP references (i.e., R/O references) will result in PTE-mapping the THP on a write fault. Similar to ordinary anon pages, do_wp_page() will have to copy sub-pages and result in a disconnect between the GUP references and the pages actually mapped into the page tables. To improve the situation in the future, we'll need additional handling to mark anonymous pages as definitely exclusive to a single process, only allow GUP pins on exclusive anon pages, and disallow sharing of exclusive anon pages with GUP pins e.g., during fork(). III. Interaction with references from LRU pagevecs Similar to ordinary anon pages, we can have LRU pagevecs referencing our THP. Reliably removing such references requires draining LRU pagevecs on all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can sleep. For now, similar do do_wp_page(), let's conditionally drain the local LRU pagevecs only if we detect !PageLRU(). IV. Interaction with speculative/temporary references Similar to ordinary anon pages, other speculative/temporary references on the THP, for example, from the pagecache or page migration code, will disallow exclusive reuse of the page. We'll have to PTE-map the THP. Signed-off-by: David Hildenbrand <david@redhat.com> --- mm/huge_memory.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-)