[v1,0/5] s390: page_mapcount(), page_has_private() and PG_arch_1

Message ID	20240404163642.1125529-1-david@redhat.com (mailing list archive)
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 715BE12BEAB for <kvm@vger.kernel.org>; Thu, 4 Apr 2024 16:37:14 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-s390@vger.kernel.org, kvm@vger.kernel.org, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Heiko Carstens <hca@linux.ibm.com>, Vasily Gorbik <gor@linux.ibm.com>, Alexander Gordeev <agordeev@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Sven Schnelle <svens@linux.ibm.com>, Janosch Frank <frankja@linux.ibm.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Gerald Schaefer <gerald.schaefer@linux.ibm.com>, Thomas Huth <thuth@redhat.com> Subject: [PATCH v1 0/5] s390: page_mapcount(), page_has_private() and PG_arch_1 Date: Thu, 4 Apr 2024 18:36:37 +0200 Message-ID: <20240404163642.1125529-1-david@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	s390: page_mapcount(), page_has_private() and PG_arch_1 \| expand [v1,0/5] s390: page_mapcount(), page_has_private() and PG_arch_1 [v1,1/5] s390/uv: don't call wait_on_page_writeback() without a reference [v1,2/5] s390/uv: convert gmap_make_secure() to work on folios [v1,3/5] s390/uv: convert PG_arch_1 users to only work on small folios [v1,4/5] s390/uv: update PG_arch_1 comment [v1,5/5] s390/hugetlb: convert PG_arch_1 code to work on folio->flags

Message ID

20240404163642.1125529-1-david@redhat.com (mailing list archive)

Headers

From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,
	linux-s390@vger.kernel.org,
	kvm@vger.kernel.org,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Thomas Huth <thuth@redhat.com>
Subject: [PATCH v1 0/5] s390: page_mapcount(),
 page_has_private() and PG_arch_1
Date: Thu,  4 Apr 2024 18:36:37 +0200
Message-ID: <20240404163642.1125529-1-david@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

s390: page_mapcount(), page_has_private() and PG_arch_1 | expand

Message

David Hildenbrand April 4, 2024, 4:36 p.m. UTC

On my journey to remove page_mapcount(), I got hooked up on other folio
cleanups that Willy most certainly will enjoy.

This series removes the s390x usage of:
* page_mapcount() [patches WIP]
* page_has_private() [have patches to remove it]

... and makes PG_arch_1 only be set on folio->flags (i.e., never on tail
pages of large folios).

Further, one "easy" fix upfront.

... unfortunately there is one other issue I spotted that I am not
tackling in this series, because I am not 100% sure what we want to
do: the usage of page_ref_freeze()/folio_ref_freeze() in
make_folio_secure() is unsafe. :(

In make_folio_secure(), we're holding the folio lock, the mmap lock and
the PT lock. So we are protected against concurrent fork(), zap, GUP,
swapin, migration ... The page_ref_freeze()/ folio_ref_freeze() should
also block concurrent GUP-fast very reliably.

But if the folio is mapped into multiple page tables, we could see
concurrent zapping of the folio, a pagecache folios could get mapped/
accessed concurrent, we could see fork() sharing the page in another
process, GUP ... trying to adjust the folio refcount while we froze it.
Very bad.

For anonymous folios, it would likely be sufficient to check that
folio_mapcount() == 1. For pagecache folios, that's insufficient, likely
we would have to lock the pagecache. To handle folios mapped into
multiple page tables, we would have to do what
split_huge_page_to_list_to_order() does (temporary migration entries).

So it's a bit more involved, and I'll have to leave that to s390x folks to
figure out. There are othe reasonable cleanups I think, but I'll have to
focus on other stuff.

Compile tested, but not runtime tested, I'll appreiate some testing help
from people with UV access and experience.

Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Huth <thuth@redhat.com>

David Hildenbrand (5):
  s390/uv: don't call wait_on_page_writeback() without a reference
  s390/uv: convert gmap_make_secure() to work on folios
  s390/uv: convert PG_arch_1 users to only work on small folios
  s390/uv: update PG_arch_1 comment
  s390/hugetlb: convert PG_arch_1 code to work on folio->flags

 arch/s390/include/asm/page.h |   2 +
 arch/s390/kernel/uv.c        | 112 ++++++++++++++++++++++-------------
 arch/s390/mm/gmap.c          |   4 +-
 arch/s390/mm/hugetlbpage.c   |   8 +--
 4 files changed, 79 insertions(+), 47 deletions(-)

Comments

Matthew Wilcox April 5, 2024, 3:42 a.m. UTC | #1

On Thu, Apr 04, 2024 at 06:36:37PM +0200, David Hildenbrand wrote:
> On my journey to remove page_mapcount(), I got hooked up on other folio
> cleanups that Willy most certainly will enjoy.
> 
> This series removes the s390x usage of:
> * page_mapcount() [patches WIP]
> * page_has_private() [have patches to remove it]
> 
> ... and makes PG_arch_1 only be set on folio->flags (i.e., never on tail
> pages of large folios).
> 
> Further, one "easy" fix upfront.

Looks like you didn't see:

https://lore.kernel.org/linux-s390/20240322161149.2327518-1-willy@infradead.org/

> ... unfortunately there is one other issue I spotted that I am not
> tackling in this series, because I am not 100% sure what we want to
> do: the usage of page_ref_freeze()/folio_ref_freeze() in
> make_folio_secure() is unsafe. :(
> 
> In make_folio_secure(), we're holding the folio lock, the mmap lock and
> the PT lock. So we are protected against concurrent fork(), zap, GUP,
> swapin, migration ... The page_ref_freeze()/ folio_ref_freeze() should
> also block concurrent GUP-fast very reliably.
> 
> But if the folio is mapped into multiple page tables, we could see
> concurrent zapping of the folio, a pagecache folios could get mapped/
> accessed concurrent, we could see fork() sharing the page in another
> process, GUP ... trying to adjust the folio refcount while we froze it.
> Very bad.

Hmmm.  Why is that not then a problem for, eg, splitting or migrating?
Is it because they unmap first and then try to freeze?

> For anonymous folios, it would likely be sufficient to check that
> folio_mapcount() == 1. For pagecache folios, that's insufficient, likely
> we would have to lock the pagecache. To handle folios mapped into
> multiple page tables, we would have to do what
> split_huge_page_to_list_to_order() does (temporary migration entries).
> 
> So it's a bit more involved, and I'll have to leave that to s390x folks to
> figure out. There are othe reasonable cleanups I think, but I'll have to
> focus on other stuff.
> 
> Compile tested, but not runtime tested, I'll appreiate some testing help
> from people with UV access and experience.
> 
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Thomas Huth <thuth@redhat.com>
> 
> David Hildenbrand (5):
>   s390/uv: don't call wait_on_page_writeback() without a reference
>   s390/uv: convert gmap_make_secure() to work on folios
>   s390/uv: convert PG_arch_1 users to only work on small folios
>   s390/uv: update PG_arch_1 comment
>   s390/hugetlb: convert PG_arch_1 code to work on folio->flags
> 
>  arch/s390/include/asm/page.h |   2 +
>  arch/s390/kernel/uv.c        | 112 ++++++++++++++++++++++-------------
>  arch/s390/mm/gmap.c          |   4 +-
>  arch/s390/mm/hugetlbpage.c   |   8 +--
>  4 files changed, 79 insertions(+), 47 deletions(-)
> 
> -- 
> 2.44.0
>

David Hildenbrand April 5, 2024, 7:07 a.m. UTC | #2

On 05.04.24 05:42, Matthew Wilcox wrote:
> On Thu, Apr 04, 2024 at 06:36:37PM +0200, David Hildenbrand wrote:
>> On my journey to remove page_mapcount(), I got hooked up on other folio
>> cleanups that Willy most certainly will enjoy.
>>
>> This series removes the s390x usage of:
>> * page_mapcount() [patches WIP]
>> * page_has_private() [have patches to remove it]
>>
>> ... and makes PG_arch_1 only be set on folio->flags (i.e., never on tail
>> pages of large folios).
>>
>> Further, one "easy" fix upfront.
> 
> Looks like you didn't see:
> 
> https://lore.kernel.org/linux-s390/20240322161149.2327518-1-willy@infradead.org/

Yes, I only skimmed linux-mm.

I think s390x certainly wants to handle PTE-mapped THP in that code, I 
think there are ways to trigger that, we're just mostly lucky that it 
doesn't happen in the common case.

But thinking about it, the current page_mapcount() based check could not 
possibly have worked for them and rejected any PTE-mapped THP.

So I can just base my changes on top of yours (we might want to get the 
first fix in ahead of time).

> 
>> ... unfortunately there is one other issue I spotted that I am not
>> tackling in this series, because I am not 100% sure what we want to
>> do: the usage of page_ref_freeze()/folio_ref_freeze() in
>> make_folio_secure() is unsafe. :(
>>
>> In make_folio_secure(), we're holding the folio lock, the mmap lock and
>> the PT lock. So we are protected against concurrent fork(), zap, GUP,
>> swapin, migration ... The page_ref_freeze()/ folio_ref_freeze() should
>> also block concurrent GUP-fast very reliably.
>>
>> But if the folio is mapped into multiple page tables, we could see
>> concurrent zapping of the folio, a pagecache folios could get mapped/
>> accessed concurrent, we could see fork() sharing the page in another
>> process, GUP ... trying to adjust the folio refcount while we froze it.
>> Very bad.
> 
> Hmmm.  Why is that not then a problem for, eg, splitting or migrating?
> Is it because they unmap first and then try to freeze?

Yes, exactly. Using mapcount in combination with ref freezing is 
problematic. Except maybe for anonymous folios with mapcount=1, while 
holding a bunch of locks to stop anybody from stumbling over that.