[0/5] hugetlbfs: Disable PMD sharing for large systems

Message ID	20190911150537.19527-1-longman@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=hojy=XG=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Waiman Long <longman@redhat.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Mike Kravetz <mike.kravetz@oracle.com> Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>, Waiman Long <longman@redhat.com> Subject: [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems Date: Wed, 11 Sep 2019 16:05:32 +0100 Message-Id: <20190911150537.19527-1-longman@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	hugetlbfs: Disable PMD sharing for large systems \| expand [0/5] hugetlbfs: Disable PMD sharing for large systems [1/5] locking/rwsem: Add down_write_timedlock() [2/5] locking/rwsem: Enable timeout check when spinning on owner [3/5] locking/osq: Allow early break from OSQ [4/5] locking/rwsem: Enable timeout check when staying in the OSQ [5/5] hugetlbfs: Limit wait time when trying to share huge PMD

Message ID

20190911150537.19527-1-longman@redhat.com (mailing list archive)

Headers

From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>,
        Waiman Long <longman@redhat.com>
Subject: [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems
Date: Wed, 11 Sep 2019 16:05:32 +0100
Message-Id: <20190911150537.19527-1-longman@redhat.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

hugetlbfs: Disable PMD sharing for large systems | expand

Message

Waiman Long Sept. 11, 2019, 3:05 p.m. UTC

A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are experiencing
random multisecond delays. These delays was caused by the long time it
took to scan the VMA interval tree with mmap_sem held.

To fix this problem while perserving existing behavior as much as
possible, we need to allow timeout in down_write() and disabling PMD
sharing when it is taking too long to do so. Since a transaction can
involving touching multiple huge pages, timing out for each of the huge
page interactions does not completely solve the problem. So a threshold
is set to completely disable PMD sharing if too many timeouts happen.

The first 4 patches of this 5-patch series adds a new
down_write_timedlock() API which accepts a timeout argument and return
true is locking is successful or false otherwise. It works more or less
than a down_write_trylock() but the calling thread may sleep.

The last patch implements the timeout mechanism as described above. With
the patched kernel installed, the customer confirmed that the problem
was gone.

Waiman Long (5):
  locking/rwsem: Add down_write_timedlock()
  locking/rwsem: Enable timeout check when spinning on owner
  locking/osq: Allow early break from OSQ
  locking/rwsem: Enable timeout check when staying in the OSQ
  hugetlbfs: Limit wait time when trying to share huge PMD

 include/linux/fs.h                |   7 ++
 include/linux/osq_lock.h          |  13 +--
 include/linux/rwsem.h             |   4 +-
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/mutex.c            |   2 +-
 kernel/locking/osq_lock.c         |  12 +-
 kernel/locking/rwsem.c            | 183 +++++++++++++++++++++++++-----
 mm/hugetlb.c                      |  24 +++-
 8 files changed, 201 insertions(+), 45 deletions(-)

Comments

Dave Chinner Sept. 13, 2019, 1:50 a.m. UTC | #1

On Wed, Sep 11, 2019 at 04:05:32PM +0100, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
> 
> To fix this problem while perserving existing behavior as much as
> possible, we need to allow timeout in down_write() and disabling PMD
> sharing when it is taking too long to do so. Since a transaction can
> involving touching multiple huge pages, timing out for each of the huge
> page interactions does not completely solve the problem. So a threshold
> is set to completely disable PMD sharing if too many timeouts happen.
> 
> The first 4 patches of this 5-patch series adds a new
> down_write_timedlock() API which accepts a timeout argument and return
> true is locking is successful or false otherwise. It works more or less
> than a down_write_trylock() but the calling thread may sleep.

Just on general principle, this is a non-starter. If a lock is being
held too long, then whatever the lock is protecting needs fixing.
Adding timeouts to locks and sysctls to tune them is not a viable
solution to address latencies caused by algorithm scalability
issues.

Cheers,

Dave.

Peter Zijlstra Sept. 25, 2019, 8:35 a.m. UTC | #2

On Fri, Sep 13, 2019 at 11:50:43AM +1000, Dave Chinner wrote:
> On Wed, Sep 11, 2019 at 04:05:32PM +0100, Waiman Long wrote:
> > A customer with large SMP systems (up to 16 sockets) with application
> > that uses large amount of static hugepages (~500-1500GB) are experiencing
> > random multisecond delays. These delays was caused by the long time it
> > took to scan the VMA interval tree with mmap_sem held.
> > 
> > To fix this problem while perserving existing behavior as much as
> > possible, we need to allow timeout in down_write() and disabling PMD
> > sharing when it is taking too long to do so. Since a transaction can
> > involving touching multiple huge pages, timing out for each of the huge
> > page interactions does not completely solve the problem. So a threshold
> > is set to completely disable PMD sharing if too many timeouts happen.
> > 
> > The first 4 patches of this 5-patch series adds a new
> > down_write_timedlock() API which accepts a timeout argument and return
> > true is locking is successful or false otherwise. It works more or less
> > than a down_write_trylock() but the calling thread may sleep.
> 
> Just on general principle, this is a non-starter. If a lock is being
> held too long, then whatever the lock is protecting needs fixing.
> Adding timeouts to locks and sysctls to tune them is not a viable
> solution to address latencies caused by algorithm scalability
> issues.

I'm very much agreeing here. Lock functions with timeouts are a sign of
horrific design.