mbox series

[0/2] hugetlbfs: use i_mmap_rwsem for more synchronization

Message ID 20200305002650.160855-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlbfs: use i_mmap_rwsem for more synchronization | expand

Message

Mike Kravetz March 5, 2020, 12:26 a.m. UTC
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races.  These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as described
at [2].  However, those patches were reverted starting with [3] due to
locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to
be held (in read mode) during page fault processing.  However, during
fault processing we need to lock the page we will be adding.  Lock
ordering requires we take page lock before i_mmap_rwsem.  Waiting until
after taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the
lock ordering for hugetlb pages.  This is not too invasive as hugetlbfs
processing is done separate from core mm in many places.  However, I
don't really like this idea.  Much ugliness is contained in the new
routine hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races.  After catching a race, cleanup, backout, retry ... etc,
as needed.  This can get really ugly, especially for huge page reservations.
At one time, I started writing some of the reservation backout code for
page faults and it got so ugly and complicated I went down the path of
adding synchronization to avoid the races.  Any other suggestions would
be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com

Mike Kravetz (2):
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race

 fs/hugetlbfs/inode.c    |  34 +++++---
 include/linux/fs.h      |   5 ++
 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c            | 175 +++++++++++++++++++++++++++++++++++-----
 mm/memory-failure.c     |  29 ++++++-
 mm/migrate.c            |  24 +++++-
 mm/rmap.c               |  17 +++-
 mm/userfaultfd.c        |  11 ++-
 8 files changed, 264 insertions(+), 39 deletions(-)

Comments

Qian Cai March 12, 2020, 3:57 p.m. UTC | #1
On Wed, 2020-03-04 at 16:26 -0800, Mike Kravetz wrote:
> While discussing the issue with huge_pte_offset [1], I remembered that
> there were more outstanding hugetlb races.  These issues are:
> 
> 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
>    invalid via a call to huge_pmd_unshare by another thread.
> 2) hugetlbfs page faults can race with truncation causing invalid global
>    reserve counts and state.
> 
> A previous attempt was made to use i_mmap_rwsem in this manner as described
> at [2].  However, those patches were reverted starting with [3] due to
> locking issues.
> 
> To effectively use i_mmap_rwsem to address the above issues it needs to
> be held (in read mode) during page fault processing.  However, during
> fault processing we need to lock the page we will be adding.  Lock
> ordering requires we take page lock before i_mmap_rwsem.  Waiting until
> after taking the page lock is too late in the fault process for the
> synchronization we want to do.
> 
> To address this lock ordering issue, the following patches change the
> lock ordering for hugetlb pages.  This is not too invasive as hugetlbfs
> processing is done separate from core mm in many places.  However, I
> don't really like this idea.  Much ugliness is contained in the new
> routine hugetlb_page_mapping_lock_write() of patch 1.
> 
> The only other way I can think of to address these issues is by catching
> all the races.  After catching a race, cleanup, backout, retry ... etc,
> as needed.  This can get really ugly, especially for huge page reservations.
> At one time, I started writing some of the reservation backout code for
> page faults and it got so ugly and complicated I went down the path of
> adding synchronization to avoid the races.  Any other suggestions would
> be welcome.

Reverted this series on the top of today's linux-next fixed the hang with LTP
move_pages12 on both powerpc and arm64,

# /opt/ltp/testcases/bin/move_pages12
tst_test.c:1217: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:263: INFO: Free RAM 260577280 kB
move_pages12.c:281: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:291: INFO: Increasing 2048kB hugepages pool on node 8 to 4
move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 8
<hang>

[ 3948.791155][  T688] INFO: task move_pages12:32930 can't die for more than
3072 seconds.
[ 3948.791181][  T688] move_pages12    D26224 32930      1 0x00040002
[ 3948.791199][  T688] Call Trace:
[ 3948.791210][  T688] [c000200816b4f680] [c0000000010b7a68]
cpufreq_update_util_data+0x0/0x8 (unreliable)
[ 3948.791234][  T688] [c000200816b4f860] [c00000000002615c]
__switch_to+0x38c/0x520
[ 3948.791247][  T688] [c000200816b4f8d0] [c0000000009a1c94]
__schedule+0x4b4/0xba0
[ 3948.791268][  T688] [c000200816b4f9a0] [c0000000009a2428] schedule+0xa8/0x170
[ 3948.791288][  T688] [c000200816b4f9d0] [c0000000009a2d0c]
io_schedule+0x2c/0x50
[ 3948.791300][  T688] [c000200816b4fa00] [c000000000331020]
__lock_page+0x150/0x3c0
[ 3948.791322][  T688] [c000200816b4fac0] [c000000000420164]
hugetlb_no_page+0xb04/0xd40
lock_page at include/linux/pagemap.h:480
(inlined by) hugetlb_no_page at mm/hugetlb.c:4286
[ 3948.791342][  T688] [c000200816b4fc10] [c000000000420bd8]
hugetlb_fault+0x738/0xc00
[ 3948.791363][  T688] [c000200816b4fcd0] [c0000000003b9c44]
handle_mm_fault+0x444/0x450
[ 3948.791384][  T688] [c000200816b4fd20] [c000000000070b98]
__do_page_fault+0x2b8/0xf90
[ 3948.791406][  T688] [c000200816b4fe20] [c00000000000aa88]
handle_page_fault+0x10/0x30

> 
> [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
> [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
> [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
> 
> Mike Kravetz (2):
>   hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
>   hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
> 
>  fs/hugetlbfs/inode.c    |  34 +++++---
>  include/linux/fs.h      |   5 ++
>  include/linux/hugetlb.h |   8 ++
>  mm/hugetlb.c            | 175 +++++++++++++++++++++++++++++++++++-----
>  mm/memory-failure.c     |  29 ++++++-
>  mm/migrate.c            |  24 +++++-
>  mm/rmap.c               |  17 +++-
>  mm/userfaultfd.c        |  11 ++-
>  8 files changed, 264 insertions(+), 39 deletions(-)
>
Mike Kravetz March 12, 2020, 4:34 p.m. UTC | #2
On 3/12/20 8:57 AM, Qian Cai wrote:
> On Wed, 2020-03-04 at 16:26 -0800, Mike Kravetz wrote:
>> While discussing the issue with huge_pte_offset [1], I remembered that
>> there were more outstanding hugetlb races.  These issues are:
> 
> Reverted this series on the top of today's linux-next fixed the hang with LTP
> move_pages12 on both powerpc and arm64,
> 
> # /opt/ltp/testcases/bin/move_pages12
> tst_test.c:1217: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:263: INFO: Free RAM 260577280 kB
> move_pages12.c:281: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:291: INFO: Increasing 2048kB hugepages pool on node 8 to 4
> move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:207: INFO: Allocating and freeing 4 hugepages on node 8
> <hang>

Thank you for finding this.
I'll dig into it.  It is timing related as it takes a few test runs for
me to reproduce.

Sorry for the issues.  Feel free to revert upstream and mm tree until
there is a resolution.