[0/4,v2] fs/dcache: Resolve the last RT woes.

Message ID	20220727114904.130761-1-bigeasy@linutronix.de (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-fsdevel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Matthew Wilcox <willy@infradead.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH 0/4 v2] fs/dcache: Resolve the last RT woes. Date: Wed, 27 Jul 2022 13:49:00 +0200 Message-Id: <20220727114904.130761-1-bigeasy@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	fs/dcache: Resolve the last RT woes. \| expand [0/4,v2] fs/dcache: Resolve the last RT woes. [1/4,v2] fs/dcache: d_add_ci() needs to complete parallel lookup. [2/4,v2] fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT [3/4,v2] fs/dcache: Move the wakeup from __d_lookup_done() to the caller. [4/4,v2] fs/dcache: Move wakeup out of i_seq_dir write held region.

Message ID

20220727114904.130761-1-bigeasy@linutronix.de (mailing list archive)

Headers

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: linux-fsdevel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
        Matthew Wilcox <willy@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH 0/4 v2] fs/dcache: Resolve the last RT woes.
Date: Wed, 27 Jul 2022 13:49:00 +0200
Message-Id: <20220727114904.130761-1-bigeasy@linutronix.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

fs/dcache: Resolve the last RT woes. | expand

Message

Sebastian Andrzej Siewior July 27, 2022, 11:49 a.m. UTC

This is v2 of the series, v1 is available at
https://https://lkml.kernel.org/r/.kernel.org/all/20220613140712.77932-1-bigeasy@linutronix.de/

v1…v2:
- Make patch around Al's description of a bug in d_add_ci(). I took
the liberty to make him Author and added his signed-off-by since I
sinmply added a patch-body around his words.

- The reasoning of why delaying the wakeup is reasonable has been
replaced with Al's analysis of the code.

- The split of wake up has been done differently (and I hope this is
what Al meant). First the wake up has been pushed to the caller and
then delayed to end_dir_add() after preemption has been enabled.

- There is still __d_lookup_unhash(), __d_lookup_unhash_wake() and
__d_lookup_done() is removed.
__d_lookup_done() is removed because it is exported and the return
value changes which will affect OOT users which are not aware of
it.
There is still d_lookup_done() which invokes
__d_lookup_unhash_wake(). This can't remain in the header file due
to cyclic depencies which in turn can't resolve wake_up_all()
within the inline function.

The original cover letter:

PREEMPT_RT has two issues with the dcache code:

1) i_dir_seq is a special cased sequence counter which uses the lowest
bit as writer lock. This requires that preemption is disabled.

On !RT kernels preemption is implictly disabled by spin_lock(), but
that's not the case on RT kernels.

Replacing i_dir_seq on RT with a seqlock_t comes with its own
problems due to arbitrary lock nesting. Using a seqcount with an
associated lock is not possible either because the locks held by the
callers are not necessarily related.

Explicitly disabling preemption on RT kernels across the i_dir_seq
write held critical section is the obvious and simplest solution. The
critical section is small, so latency is not really a concern.

2) The wake up of dentry::d_wait waiters is in a preemption disabled
section, which violates the RT constraints as wake_up_all() has
to acquire the wait queue head lock which is a 'sleeping' spinlock
on RT.

There are two reasons for the non-preemtible section:

A) The wake up happens under the hash list bit lock

B) The wake up happens inside the i_dir_seq write side
critical section

#A is solvable by moving it outside of the hash list bit lock held
section.

Making #B preemptible on RT is hard or even impossible due to lock
nesting constraints.

A possible solution would be to replace the waitqueue by a simple
waitqueue which can be woken up inside atomic sections on RT.

But aside of Linus not being a fan of simple waitqueues, there is
another observation vs. this wake up. It's likely for the woken up
waiter to immediately contend on dentry::lock.

It turns out that there is no reason to do the wake up within the
i_dir_seq write held region. The only requirement is to do the wake
up within the dentry::lock held region. Further details in the
individual patches.

That allows to move the wake up out of the non-preemptible section
on RT, which also reduces the dentry::lock held time after wake up.

Thanks,

Sebastian

Comments

Al Viro July 30, 2022, 4:41 a.m. UTC | #1

On Wed, Jul 27, 2022 at 01:49:00PM +0200, Sebastian Andrzej Siewior wrote:
> This is v2 of the series, v1 is available at
>    https://https://lkml.kernel.org/r/.kernel.org/all/20220613140712.77932-1-bigeasy@linutronix.de/
> 
> v1…v2:
>    - Make patch around Al's description of a bug in d_add_ci(). I took
>      the liberty to make him Author and added his signed-off-by since I
>      sinmply added a patch-body around his words.
> 
>    - The reasoning of why delaying the wakeup is reasonable has been
>      replaced with Al's analysis of the code.
> 
>    - The split of wake up has been done differently (and I hope this is
>      what Al meant). First the wake up has been pushed to the caller and
>      then delayed to end_dir_add() after preemption has been enabled.
> 
>    - There is still __d_lookup_unhash(), __d_lookup_unhash_wake() and
>      __d_lookup_done() is removed.
>      __d_lookup_done() is removed because it is exported and the return
>      value changes which will affect OOT users which are not aware of
>      it.
>      There is still d_lookup_done() which invokes
>      __d_lookup_unhash_wake(). This can't remain in the header file due
>      to cyclic depencies which in turn can't resolve wake_up_all()
>      within the inline function.

Applied, with slightly different first commit (we only care about
d_lookup_done() if d_splice_alias() returns non-NULL).

See vfs.git#work.dcache.