mbox series

[v2,0/2] Add a seqcount between gup_fast and copy_page_range()

Message ID 0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com (mailing list archive)
Headers show
Series Add a seqcount between gup_fast and copy_page_range() | expand

Message

Jason Gunthorpe Oct. 30, 2020, 2:46 p.m. UTC
As discussed and suggested by Linus use a seqcount to close the small race
between gup_fast and copy_page_range().

Unfortunately the good suggestion to just use write_seqcount_begin() blows
up lockdep immediately due to the (new?) requirement that the write side
of seqcount be in a preempt disabled region. For this application it does
not seem like a good idea, nor is it necessary as we don't spin on retry.
This is solved by being the first place to use raw_write_seqcount_t_begin()

This can go after the merge window. I was table to test it using two
threads, one forking and the other using ibv_reg_mr() to trigger GUP
fast. Modifying copy_page_range() to sleep made the window large enough to
reliably hit to test the logic.

v2:
 - Use start not addr in lockless_pages_from_mm
 - Replace unsigned long casts with using the proper variable type
 - Update comments
 - Use raw_write_seqcount_t_begin() instead of open coding
 - Update commit messages
v1: https://lore.kernel.org/r/0-v1-281e425c752f+2df-gup_fork_jgg@nvidia.com

To: linux-kernel@vger.kernel.org
To: Peter Xu <peterx@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>

Jason Gunthorpe (2):
  mm: reorganize internal_get_user_pages_fast()
  mm: prevent gup_fast from racing with COW during fork

 arch/x86/kernel/tboot.c    |   1 +
 drivers/firmware/efi/efi.c |   1 +
 include/linux/mm_types.h   |   7 +++
 kernel/fork.c              |   1 +
 mm/gup.c                   | 118 +++++++++++++++++++++++--------------
 mm/init-mm.c               |   1 +
 mm/memory.c                |  10 +++-
 7 files changed, 93 insertions(+), 46 deletions(-)

Comments

Ahmed S. Darwish Nov. 2, 2020, 10:19 p.m. UTC | #1
Hello Jason,

Thanks for keeping me in the loop on this.

I've also added the locking maintainers in Cc. IMHO there are some
seqlock.h API violations in this series, and they should have the final
say on this.

On Fri, Oct 30, 2020 at 11:46:19AM -0300, Jason Gunthorpe wrote:
>
> As discussed and suggested by Linus use a seqcount to close the small race
> between gup_fast and copy_page_range().
>
> Unfortunately the good suggestion to just use write_seqcount_begin() blows
> up lockdep immediately due to the (new?) requirement that the write side
> of seqcount be in a preempt disabled region.

Disabling preemption for seqcount_t write-side critical sections was
never a new requirement. It has always been this way, for the reasons
explained at Documentation/locking/seqlock.rst, "Introduction" section.

The recent seqcount_t changes did not mandate any new rules. This was
done explicitly, and on-purpose, not to break any of the *large* set of
existing seqcount_t call sites. It added multiple lockdep asserts
though, to catch a number of (already) buggy users, and they were fixed
beforehand.

It seems you have a special case here, so I'll continue discussing this
at patch #2 where the code resides. Just wanted to answer the "(new?)"
part above.

>                                               For this application it does
> not seem like a good idea, nor is it necessary as we don't spin on retry.
> This is solved by being the first place to use raw_write_seqcount_t_begin()
>

Regardless of this series write side preemptibility requirements, the
"_write_seqcount_*t*_()" interfaces are internal to seqlock.h and should
_never_ be used outside of it.

For plain seqcount_t, raw_write_seqcount_begin() is equivalent to
raw_write_seqcount_*t*_begin() anyway, and should already satisfy your
needs.

/me jumps to patch #2 now...

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH
Linus Torvalds Nov. 2, 2020, 10:39 p.m. UTC | #2
On Mon, Nov 2, 2020 at 2:19 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote:
>
> Disabling preemption for seqcount_t write-side critical sections was
> never a new requirement. It has always been this way, for the reasons
> explained at Documentation/locking/seqlock.rst, "Introduction" section.

Note that that is only true if you spin on the reading side (either of
the two kinds of spinning: (a) spinning to wait for it to become even,
or (b) repeating if they don't match)

Which this code doesn't do, it just fails.

I'm not sure how to perhaps document that.

             Linus
Ahmed S. Darwish Nov. 2, 2020, 11:18 p.m. UTC | #3
On Mon, Nov 02, 2020 at 02:39:49PM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2020 at 2:19 PM Ahmed S. Darwish <a.darwish@linutronix.de> wrote:
> >
> > Disabling preemption for seqcount_t write-side critical sections was
> > never a new requirement. It has always been this way, for the reasons
> > explained at Documentation/locking/seqlock.rst, "Introduction" section.
>
> Note that that is only true if you spin on the reading side (either of
> the two kinds of spinning: (a) spinning to wait for it to become even,
> or (b) repeating if they don't match)
>
> Which this code doesn't do, it just fails.
>
> I'm not sure how to perhaps document that.
>

Sure, and this is one of the reasons the lockdep non-preemptibility
check is only added to the non-raw variants of the seqcount write APIs.

Presumably, users of the raw_*() part of the API know what they're
doing, and they don't need to read seqlock.rst :)

(I'm in progress of replying to patch #2, which touches a bit on this
 and other points)..

>              Linus

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH