diff mbox series

[4/4] mm, notifier: Catch sleeping/blocking for !blockable

Message ID 20190820081902.24815-5-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show
Series mmu notifier debug annotations/checks | expand

Commit Message

Daniel Vetter Aug. 20, 2019, 8:19 a.m. UTC
We need to make sure implementations don't cheat and don't have a
possible schedule/blocking point deeply burried where review can't
catch it.

I'm not sure whether this is the best way to make sure all the
might_sleep() callsites trigger, and it's a bit ugly in the code flow.
But it gets the job done.

Inspired by an i915 patch series which did exactly that, because the
rules haven't been entirely clear to us.

v2: Use the shiny new non_block_start/end annotations instead of
abusing preempt_disable/enable.

v3: Rebase on top of Glisse's arg rework.

v4: Rebase on top of more Glisse rework.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 mm/mmu_notifier.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Jason Gunthorpe Aug. 20, 2019, 1:34 p.m. UTC | #1
On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.
> 
> v2: Use the shiny new non_block_start/end annotations instead of
> abusing preempt_disable/enable.
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: Rebase on top of more Glisse rework.
> 
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Reviewed-by: Christian König <christian.koenig@amd.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>  mm/mmu_notifier.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 538d3bb87f9b..856636d06ee0 100644
> +++ b/mm/mmu_notifier.c
> @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start) {
> -			int _ret = mn->ops->invalidate_range_start(mn, range);
> +			int _ret;
> +
> +			if (!mmu_notifier_range_blockable(range))
> +				non_block_start();
> +			_ret = mn->ops->invalidate_range_start(mn, range);
> +			if (!mmu_notifier_range_blockable(range))
> +				non_block_end();

If someone Acks all the sched changes then I can pick this for
hmm.git, but I still think the existing pre-emption debugging is fine
for this use case.

Also, same comment as for the lockdep map, this needs to apply to the
non-blocking range_end also.

Anyhow, since this series has conflicts with hmm.git it would be best
to flow through the whole thing through that tree. If there are no
remarks on the first two patches I'll grab them in a few days.

Regards,
Jason
Daniel Vetter Aug. 20, 2019, 3:18 p.m. UTC | #2
On Tue, Aug 20, 2019 at 10:34:18AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
> > We need to make sure implementations don't cheat and don't have a
> > possible schedule/blocking point deeply burried where review can't
> > catch it.
> > 
> > I'm not sure whether this is the best way to make sure all the
> > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > But it gets the job done.
> > 
> > Inspired by an i915 patch series which did exactly that, because the
> > rules haven't been entirely clear to us.
> > 
> > v2: Use the shiny new non_block_start/end annotations instead of
> > abusing preempt_disable/enable.
> > 
> > v3: Rebase on top of Glisse's arg rework.
> > 
> > v4: Rebase on top of more Glisse rework.
> > 
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: "Christian König" <christian.koenig@amd.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > Cc: linux-mm@kvack.org
> > Reviewed-by: Christian König <christian.koenig@amd.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >  mm/mmu_notifier.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 538d3bb87f9b..856636d06ee0 100644
> > +++ b/mm/mmu_notifier.c
> > @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start) {
> > -			int _ret = mn->ops->invalidate_range_start(mn, range);
> > +			int _ret;
> > +
> > +			if (!mmu_notifier_range_blockable(range))
> > +				non_block_start();
> > +			_ret = mn->ops->invalidate_range_start(mn, range);
> > +			if (!mmu_notifier_range_blockable(range))
> > +				non_block_end();
> 
> If someone Acks all the sched changes then I can pick this for
> hmm.git, but I still think the existing pre-emption debugging is fine
> for this use case.

Ok, I'll ping Peter Z. for an ack, iirc he was involved.

> Also, same comment as for the lockdep map, this needs to apply to the
> non-blocking range_end also.

Hm, I thought the page table locks we're holding there already prevent any
sleeping, so would be redundant? But reading through code I think that's
not guaranteed, so yeah makes sense to add it for invalidate_range_end
too. I'll respin once I have the ack/nack from scheduler people.

> Anyhow, since this series has conflicts with hmm.git it would be best
> to flow through the whole thing through that tree. If there are no
> remarks on the first two patches I'll grab them in a few days.

Thanks, Daniel
Jason Gunthorpe Aug. 20, 2019, 3:27 p.m. UTC | #3
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
> > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > > index 538d3bb87f9b..856636d06ee0 100644
> > > +++ b/mm/mmu_notifier.c
> > > @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> > >  	id = srcu_read_lock(&srcu);
> > >  	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
> > >  		if (mn->ops->invalidate_range_start) {
> > > -			int _ret = mn->ops->invalidate_range_start(mn, range);
> > > +			int _ret;
> > > +
> > > +			if (!mmu_notifier_range_blockable(range))
> > > +				non_block_start();
> > > +			_ret = mn->ops->invalidate_range_start(mn, range);
> > > +			if (!mmu_notifier_range_blockable(range))
> > > +				non_block_end();
> > 
> > If someone Acks all the sched changes then I can pick this for
> > hmm.git, but I still think the existing pre-emption debugging is fine
> > for this use case.
> 
> Ok, I'll ping Peter Z. for an ack, iirc he was involved.
> 
> > Also, same comment as for the lockdep map, this needs to apply to the
> > non-blocking range_end also.
> 
> Hm, I thought the page table locks we're holding there already prevent any
> sleeping, so would be redundant?

AFAIK no. All callers of invalidate_range_start/end pairs do so a few
lines apart and don't change their locking in between - thus since
start can block so can end.

Would love to know if that is not true??

Similarly I've also been idly wondering if we should add a
'might_sleep()' to invalidate_rangestart/end() to make this constraint
clear & tested to the mm side?

Jason
Daniel Vetter Aug. 21, 2019, 9:34 a.m. UTC | #4
On Wed, Aug 21, 2019 at 9:33 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
> > > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > > > index 538d3bb87f9b..856636d06ee0 100644
> > > > +++ b/mm/mmu_notifier.c
> > > > @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> > > >   id = srcu_read_lock(&srcu);
> > > >   hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
> > > >           if (mn->ops->invalidate_range_start) {
> > > > -                 int _ret = mn->ops->invalidate_range_start(mn, range);
> > > > +                 int _ret;
> > > > +
> > > > +                 if (!mmu_notifier_range_blockable(range))
> > > > +                         non_block_start();
> > > > +                 _ret = mn->ops->invalidate_range_start(mn, range);
> > > > +                 if (!mmu_notifier_range_blockable(range))
> > > > +                         non_block_end();
> > >
> > > If someone Acks all the sched changes then I can pick this for
> > > hmm.git, but I still think the existing pre-emption debugging is fine
> > > for this use case.
> >
> > Ok, I'll ping Peter Z. for an ack, iirc he was involved.
> >
> > > Also, same comment as for the lockdep map, this needs to apply to the
> > > non-blocking range_end also.
> >
> > Hm, I thought the page table locks we're holding there already prevent any
> > sleeping, so would be redundant?
>
> AFAIK no. All callers of invalidate_range_start/end pairs do so a few
> lines apart and don't change their locking in between - thus since
> start can block so can end.
>
> Would love to know if that is not true??

Yeah I reviewed them, I think I mixed up a discussion I had a while
ago with Jerome. It's a bit tricky to follow in the code since in some
places ->invalidate_range and ->invalidate_range_end seem to be called
from the same place, in others not at all.

> Similarly I've also been idly wondering if we should add a
> 'might_sleep()' to invalidate_rangestart/end() to make this constraint
> clear & tested to the mm side?

Hm, sounds like a useful idea. Since in general you wont test with mmu
notifiers, but they could happen, and then they will block for at
least some mutex usually. I'll throw that as an idea on top for the
next round.
-Daniel
Daniel Vetter Aug. 21, 2019, 3:41 p.m. UTC | #5
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
> On Tue, Aug 20, 2019 at 10:34:18AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
> > > We need to make sure implementations don't cheat and don't have a
> > > possible schedule/blocking point deeply burried where review can't
> > > catch it.
> > > 
> > > I'm not sure whether this is the best way to make sure all the
> > > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > > But it gets the job done.
> > > 
> > > Inspired by an i915 patch series which did exactly that, because the
> > > rules haven't been entirely clear to us.
> > > 
> > > v2: Use the shiny new non_block_start/end annotations instead of
> > > abusing preempt_disable/enable.
> > > 
> > > v3: Rebase on top of Glisse's arg rework.
> > > 
> > > v4: Rebase on top of more Glisse rework.
> > > 
> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: "Christian König" <christian.koenig@amd.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > > Cc: linux-mm@kvack.org
> > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > >  mm/mmu_notifier.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > > index 538d3bb87f9b..856636d06ee0 100644
> > > +++ b/mm/mmu_notifier.c
> > > @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> > >  	id = srcu_read_lock(&srcu);
> > >  	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
> > >  		if (mn->ops->invalidate_range_start) {
> > > -			int _ret = mn->ops->invalidate_range_start(mn, range);
> > > +			int _ret;
> > > +
> > > +			if (!mmu_notifier_range_blockable(range))
> > > +				non_block_start();
> > > +			_ret = mn->ops->invalidate_range_start(mn, range);
> > > +			if (!mmu_notifier_range_blockable(range))
> > > +				non_block_end();
> > 
> > If someone Acks all the sched changes then I can pick this for
> > hmm.git, but I still think the existing pre-emption debugging is fine
> > for this use case.
> 
> Ok, I'll ping Peter Z. for an ack, iirc he was involved.
> 
> > Also, same comment as for the lockdep map, this needs to apply to the
> > non-blocking range_end also.
> 
> Hm, I thought the page table locks we're holding there already prevent any
> sleeping, so would be redundant? But reading through code I think that's
> not guaranteed, so yeah makes sense to add it for invalidate_range_end
> too. I'll respin once I have the ack/nack from scheduler people.

So I started to look into this, and I'm a bit confused. There's no
_nonblock version of this, so does this means blocking is never allowed,
or always allowed?

From a quick look through implementations I've only seen spinlocks, and
one up_read. So I guess I should wrape this callback in some unconditional
non_block_start/end, but I'm not sure.

Thanks, Daniel


> > Anyhow, since this series has conflicts with hmm.git it would be best
> > to flow through the whole thing through that tree. If there are no
> > remarks on the first two patches I'll grab them in a few days.
> 
> Thanks, Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Jason Gunthorpe Aug. 21, 2019, 4:16 p.m. UTC | #6
On Wed, Aug 21, 2019 at 05:41:51PM +0200, Daniel Vetter wrote:

> > Hm, I thought the page table locks we're holding there already prevent any
> > sleeping, so would be redundant? But reading through code I think that's
> > not guaranteed, so yeah makes sense to add it for invalidate_range_end
> > too. I'll respin once I have the ack/nack from scheduler people.
> 
> So I started to look into this, and I'm a bit confused. There's no
> _nonblock version of this, so does this means blocking is never allowed,
> or always allowed?

RDMA has a mutex:

ib_umem_notifier_invalidate_range_end
  rbt_ib_umem_for_each_in_range
   invalidate_range_start_trampoline
    ib_umem_notifier_end_account
      mutex_lock(&umem_odp->umem_mutex);

I'm working to delete this path though!

nonblocking or not follows the start, the same flag gets placed into
the mmu_notifier_range struct passed to end.

> From a quick look through implementations I've only seen spinlocks, and
> one up_read. So I guess I should wrape this callback in some unconditional
> non_block_start/end, but I'm not sure.

For now, we should keep it the same as start, conditionally blocking.

Hopefully before LPC I can send a RFC series that eliminates most
invalidate_range_end users in favor of common locking..

Jason
Daniel Vetter Aug. 22, 2019, 8:42 a.m. UTC | #7
On Thu, Aug 22, 2019 at 10:16 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Aug 21, 2019 at 05:41:51PM +0200, Daniel Vetter wrote:
>
> > > Hm, I thought the page table locks we're holding there already prevent any
> > > sleeping, so would be redundant? But reading through code I think that's
> > > not guaranteed, so yeah makes sense to add it for invalidate_range_end
> > > too. I'll respin once I have the ack/nack from scheduler people.
> >
> > So I started to look into this, and I'm a bit confused. There's no
> > _nonblock version of this, so does this means blocking is never allowed,
> > or always allowed?
>
> RDMA has a mutex:
>
> ib_umem_notifier_invalidate_range_end
>   rbt_ib_umem_for_each_in_range
>    invalidate_range_start_trampoline
>     ib_umem_notifier_end_account
>       mutex_lock(&umem_odp->umem_mutex);
>
> I'm working to delete this path though!
>
> nonblocking or not follows the start, the same flag gets placed into
> the mmu_notifier_range struct passed to end.

Ok, makes sense.

I guess that also means the might_sleep (I started on that) in
invalidate_range_end also needs to be conditional? Or not bother with
a might_sleep in invalidate_range_end since you're working on removing
the last sleep in there?

> > From a quick look through implementations I've only seen spinlocks, and
> > one up_read. So I guess I should wrape this callback in some unconditional
> > non_block_start/end, but I'm not sure.
>
> For now, we should keep it the same as start, conditionally blocking.
>
> Hopefully before LPC I can send a RFC series that eliminates most
> invalidate_range_end users in favor of common locking..

Thanks, Daniel
Jason Gunthorpe Aug. 22, 2019, 2:24 p.m. UTC | #8
On Thu, Aug 22, 2019 at 10:42:39AM +0200, Daniel Vetter wrote:

> > RDMA has a mutex:
> >
> > ib_umem_notifier_invalidate_range_end
> >   rbt_ib_umem_for_each_in_range
> >    invalidate_range_start_trampoline
> >     ib_umem_notifier_end_account
> >       mutex_lock(&umem_odp->umem_mutex);
> >
> > I'm working to delete this path though!
> >
> > nonblocking or not follows the start, the same flag gets placed into
> > the mmu_notifier_range struct passed to end.
> 
> Ok, makes sense.
> 
> I guess that also means the might_sleep (I started on that) in
> invalidate_range_end also needs to be conditional? Or not bother with
> a might_sleep in invalidate_range_end since you're working on removing
> the last sleep in there?

I might suggest the same pattern as used for locked, the might_sleep
unconditionally on the start, and a 2nd might sleep after the IF in
__mmu_notifier_invalidate_range_end()

Observing that by audit all the callers already have the same locking
context for start/end

Jason
Daniel Vetter Aug. 22, 2019, 2:27 p.m. UTC | #9
On Thu, Aug 22, 2019 at 4:24 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 22, 2019 at 10:42:39AM +0200, Daniel Vetter wrote:
>
> > > RDMA has a mutex:
> > >
> > > ib_umem_notifier_invalidate_range_end
> > >   rbt_ib_umem_for_each_in_range
> > >    invalidate_range_start_trampoline
> > >     ib_umem_notifier_end_account
> > >       mutex_lock(&umem_odp->umem_mutex);
> > >
> > > I'm working to delete this path though!
> > >
> > > nonblocking or not follows the start, the same flag gets placed into
> > > the mmu_notifier_range struct passed to end.
> >
> > Ok, makes sense.
> >
> > I guess that also means the might_sleep (I started on that) in
> > invalidate_range_end also needs to be conditional? Or not bother with
> > a might_sleep in invalidate_range_end since you're working on removing
> > the last sleep in there?
>
> I might suggest the same pattern as used for locked, the might_sleep
> unconditionally on the start, and a 2nd might sleep after the IF in
> __mmu_notifier_invalidate_range_end()
>
> Observing that by audit all the callers already have the same locking
> context for start/end

My question was more about enforcing that going forward, since you're
working to remove all the sleeps from invalidate_range_end. I don't
want to add debug annotations which are stricter than what the other
side actually expects. But since currently there is still sleeping
locks in invalidate_range_end I think I'll just stick them in both
places. You can then (re)move it when the cleanup lands.
-Daniel
diff mbox series

Patch

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 538d3bb87f9b..856636d06ee0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -181,7 +181,13 @@  int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start) {
-			int _ret = mn->ops->invalidate_range_start(mn, range);
+			int _ret;
+
+			if (!mmu_notifier_range_blockable(range))
+				non_block_start();
+			_ret = mn->ops->invalidate_range_start(mn, range);
+			if (!mmu_notifier_range_blockable(range))
+				non_block_end();
 			if (_ret) {
 				pr_info("%pS callback failed with %d in %sblockable context.\n",
 					mn->ops->invalidate_range_start, _ret,