diff mbox series

[v4] x86/sgx: Fix the call order of synchronize_srcu() in sgx_release()

Message ID 20210115014638.15037-1-jarkko@kernel.org (mailing list archive)
State New, archived
Headers show
Series [v4] x86/sgx: Fix the call order of synchronize_srcu() in sgx_release() | expand

Commit Message

Jarkko Sakkinen Jan. 15, 2021, 1:46 a.m. UTC
From: Jarkko Sakkinen <jarkko@kernel.org>

The most trivial example of a race condition can be demonstrated with this
example where mm_list contains just one entry:

CPU A                   CPU B
sgx_release()
                        sgx_mmu_notifier_release()
                        list_del_rcu()
sgx_encl_release()
                        synchronize_srcu()
cleanup_srcu_struct()

To fix this, call synchronize_srcu() before checking whether mm_list is
empty in sgx_release().

Cc: stable@vger.kernel.org
Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
v4:
- Rewrite the commit message.
- Just change the call order. *_expedited() is out of scope for this
  bug fix.
v3: Fine-tuned tags, and added missing change log for v2.
v2: Switch to synchronize_srcu_expedited().
 arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Borislav Petkov Jan. 15, 2021, 7:18 a.m. UTC | #1
On Fri, Jan 15, 2021 at 03:46:38AM +0200, jarkko@kernel.org wrote:
> From: Jarkko Sakkinen <jarkko@kernel.org>
> 
> The most trivial example of a race condition can be demonstrated with this
> example where mm_list contains just one entry:
> 
> CPU A                   CPU B
> sgx_release()
>                         sgx_mmu_notifier_release()
>                         list_del_rcu()
> sgx_encl_release()
>                         synchronize_srcu()
> cleanup_srcu_struct()
> 
> To fix this, call synchronize_srcu() before checking whether mm_list is
> empty in sgx_release().

To fix what?

Lemme explain one more time: a commit message for a race condition
fix needs to explain in *detail* what the race condition is. And that
explanation needs to be understandable months and years from now.

You have the function call order above, now you have to explain what can
happen. Just how you did here:

https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org

> Cc: stable@vger.kernel.org
> Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> ---
> v4:
> - Rewrite the commit message.
> - Just change the call order. *_expedited() is out of scope for this
>   bug fix.
> v3: Fine-tuned tags, and added missing change log for v2.
> v2: Switch to synchronize_srcu_expedited().
>  arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> index f2eac41bb4ff..53056345f5f8 100644
> --- a/arch/x86/kernel/cpu/sgx/driver.c
> +++ b/arch/x86/kernel/cpu/sgx/driver.c
> @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file)
>  
>  		spin_unlock(&encl->mm_lock);
>  
> +		/*
> +		 * The call is need even if the list empty, because sgx_encl_mmu_notifier_release()

"The call is need"?


$ git grep sgx_encl_mmu_notifier_release
$

WTF?

Please be more careful.
Jarkko Sakkinen Jan. 16, 2021, 5:12 a.m. UTC | #2
On Fri, Jan 15, 2021 at 08:18:09AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:46:38AM +0200, jarkko@kernel.org wrote:
> > From: Jarkko Sakkinen <jarkko@kernel.org>
> > 
> > The most trivial example of a race condition can be demonstrated with this
> > example where mm_list contains just one entry:
> > 
> > CPU A                   CPU B
> > sgx_release()
> >                         sgx_mmu_notifier_release()
> >                         list_del_rcu()
> > sgx_encl_release()
> >                         synchronize_srcu()
> > cleanup_srcu_struct()
> > 
> > To fix this, call synchronize_srcu() before checking whether mm_list is
> > empty in sgx_release().
> 
> To fix what?
> 
> Lemme explain one more time: a commit message for a race condition
> fix needs to explain in *detail* what the race condition is. And that
> explanation needs to be understandable months and years from now.
> 
> You have the function call order above, now you have to explain what can
> happen. Just how you did here:
> 
> https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org

OK, I could recall the race that from but that must be partly because I've
been proactively working on it, i.e. getting your point.

So let's say I add this after the sequence:

"The sequence demonstrates a scenario where CPU B starts a new
grace period, which goes unnoticed by CPU A in sgx_release(),
because it did not remove the final entry from the enclave's
mm list."

Would this be sufficient or not?

> > Cc: stable@vger.kernel.org
> > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Suggested-by: Haitao Huang <haitao.huang@linux.intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> > ---
> > v4:
> > - Rewrite the commit message.
> > - Just change the call order. *_expedited() is out of scope for this
> >   bug fix.
> > v3: Fine-tuned tags, and added missing change log for v2.
> > v2: Switch to synchronize_srcu_expedited().
> >  arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> > index f2eac41bb4ff..53056345f5f8 100644
> > --- a/arch/x86/kernel/cpu/sgx/driver.c
> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file)
> >  
> >  		spin_unlock(&encl->mm_lock);
> >  
> > +		/*
> > +		 * The call is need even if the list empty, because sgx_encl_mmu_notifier_release()
> 
> "The call is need"?
> 
> 
> $ git grep sgx_encl_mmu_notifier_release
> $
> 
> WTF?
> 
> Please be more careful.

Note taken.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

/Jarkko
Borislav Petkov Jan. 18, 2021, 6:57 p.m. UTC | #3
On Sat, Jan 16, 2021 at 07:12:54AM +0200, Jarkko Sakkinen wrote:
> > https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org
> 
> OK, I could recall the race that from but that must be partly because I've
> been proactively working on it, i.e. getting your point.
> 
> So let's say I add this after the sequence:
> 
> "The sequence demonstrates a scenario where CPU B starts a new
> grace period, which goes unnoticed by CPU A in sgx_release(),
> because it did not remove the final entry from the enclave's
> mm list."
> 
> Would this be sufficient or not?

Not sure.

That link above says:

"Now, let's imagine that there is exactly one entry in the encl->mm_list.
and sgx_release() execution gets scheduled right after returning from
synchronize_srcu().

With some bad luck, some process comes and removes that last entry befoe
sgx_release() acquires mm_lock."

So, the last entry gets removed by some other process before
sgx_release() acquires mm_lock. When it does acquire that lock, the test

	if (list_empty(&encl->mm_list))

will be true because "some other process" has removed that last entry.

So why do you need the synchronize_srcu() call when this process sees an
empty mm_list already?

Thx.
Jarkko Sakkinen Jan. 20, 2021, 2:43 p.m. UTC | #4
On Mon, Jan 18, 2021 at 07:57:12PM +0100, Borislav Petkov wrote:
> On Sat, Jan 16, 2021 at 07:12:54AM +0200, Jarkko Sakkinen wrote:
> > > https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org
> > 
> > OK, I could recall the race that from but that must be partly because I've
> > been proactively working on it, i.e. getting your point.
> > 
> > So let's say I add this after the sequence:
> > 
> > "The sequence demonstrates a scenario where CPU B starts a new
> > grace period, which goes unnoticed by CPU A in sgx_release(),
> > because it did not remove the final entry from the enclave's
> > mm list."
> > 
> > Would this be sufficient or not?
> 
> Not sure.
> 
> That link above says:
> 
> "Now, let's imagine that there is exactly one entry in the encl->mm_list.
> and sgx_release() execution gets scheduled right after returning from
> synchronize_srcu().
> 
> With some bad luck, some process comes and removes that last entry befoe
> sgx_release() acquires mm_lock."
> 
> So, the last entry gets removed by some other process before
> sgx_release() acquires mm_lock. When it does acquire that lock, the test
> 
> 	if (list_empty(&encl->mm_list))
> 
> will be true because "some other process" has removed that last entry.
> 
> So why do you need the synchronize_srcu() call when this process sees an
> empty mm_list already?
> 
> Thx.

The other process aka some process using the enclave calls list_del_rcu()
(and synchronize_srcu()), which starts a new grace period. If we don't
do it, then the cleanup_srcu() will race with that grace period.

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

/Jarkko
Dave Hansen Jan. 20, 2021, 5:34 p.m. UTC | #5
On 1/20/21 6:43 AM, Jarkko Sakkinen wrote:
>> So why do you need the synchronize_srcu() call when this process sees an
>> empty mm_list already?
>>
>> Thx.
> The other process aka some process using the enclave calls list_del_rcu()
> (and synchronize_srcu()), which starts a new grace period. If we don't
> do it, then the cleanup_srcu() will race with that grace period.

To me, this is only a partial explanation.

That goal of synchronize_srcu() is to wait for the completion of a
*previous* grace period: one that might have observed the old state of
the list.

Could you explain the *actual* effects of the misplaced
synchronize_srcu()?  If the race _occurs_, what is the side-effect?
Sean Christopherson Jan. 20, 2021, 5:35 p.m. UTC | #6
On Fri, Jan 15, 2021, jarkko@kernel.org wrote:
> From: Jarkko Sakkinen <jarkko@kernel.org>
> 
> The most trivial example of a race condition can be demonstrated with this
> example where mm_list contains just one entry:
> 
> CPU A                   CPU B
> sgx_release()
>                         sgx_mmu_notifier_release()
>                         list_del_rcu()
> sgx_encl_release()
>                         synchronize_srcu()
> cleanup_srcu_struct()
> 
> To fix this, call synchronize_srcu() before checking whether mm_list is
> empty in sgx_release().

Why haven't you included the splat that Haitao provided?  That would go a long
way to helping answer Boris' question about exactly what is broken...

> Cc: stable@vger.kernel.org
> Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Haitao Huang <haitao.huang@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> ---
> v4:
> - Rewrite the commit message.
> - Just change the call order. *_expedited() is out of scope for this
>   bug fix.
> v3: Fine-tuned tags, and added missing change log for v2.
> v2: Switch to synchronize_srcu_expedited().
>  arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> index f2eac41bb4ff..53056345f5f8 100644
> --- a/arch/x86/kernel/cpu/sgx/driver.c
> +++ b/arch/x86/kernel/cpu/sgx/driver.c
> @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file)
>  
>  		spin_unlock(&encl->mm_lock);
>  
> +		/*
> +		 * The call is need even if the list empty, because sgx_encl_mmu_notifier_release()
> +		 * could have initiated a new grace period.
> +		 */
> +		synchronize_srcu(&encl->srcu);

I don't think this completely fixes the issue as sgx_release() isn't guaranteed
to trigger cleanup_srcu_struct(), e.g. the reclaimer can also have a reference
to the enclave.

> +
>  		/* The enclave is no longer mapped by any mm. */
>  		if (!encl_mm)
>  			break;
>  
> -		synchronize_srcu(&encl->srcu);
>  		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
>  		kfree(encl_mm);
>  	}
> -- 
> 2.29.2
>
Jarkko Sakkinen Jan. 21, 2021, 12:26 a.m. UTC | #7
On Wed, Jan 20, 2021 at 09:34:10AM -0800, Dave Hansen wrote:
> On 1/20/21 6:43 AM, Jarkko Sakkinen wrote:
> >> So why do you need the synchronize_srcu() call when this process sees an
> >> empty mm_list already?
> >>
> >> Thx.
> > The other process aka some process using the enclave calls list_del_rcu()
> > (and synchronize_srcu()), which starts a new grace period. If we don't
> > do it, then the cleanup_srcu() will race with that grace period.
> 
> To me, this is only a partial explanation.
> 
> That goal of synchronize_srcu() is to wait for the completion of a
> *previous* grace period: one that might have observed the old state of
> the list.
> 
> Could you explain the *actual* effects of the misplaced
> synchronize_srcu()?  If the race _occurs_, what is the side-effect?

As I haven't been able to reproduce this regression myself, I need
to take steps back and try to reproduce the it with Graphene.

WARN_ON()'s trigger inside cleanup_srcu_struct(), which causes a memory
leak since free_percpu() gets never called. If I recall correctly, it
was srcu_readers_active() but unfortunately I don't have a log available.

Perhaps Haitao could provide us one.

/Jarkko
Jarkko Sakkinen Jan. 21, 2021, 12:29 a.m. UTC | #8
On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, jarkko@kernel.org wrote:
> > From: Jarkko Sakkinen <jarkko@kernel.org>
> > 
> > The most trivial example of a race condition can be demonstrated with this
> > example where mm_list contains just one entry:
> > 
> > CPU A                   CPU B
> > sgx_release()
> >                         sgx_mmu_notifier_release()
> >                         list_del_rcu()
> > sgx_encl_release()
> >                         synchronize_srcu()
> > cleanup_srcu_struct()
> > 
> > To fix this, call synchronize_srcu() before checking whether mm_list is
> > empty in sgx_release().
> 
> Why haven't you included the splat that Haitao provided?  That would go a long
> way to helping answer Boris' question about exactly what is broken...

I've lost the klog.

/Jarkko
Dave Hansen Jan. 21, 2021, 1:19 a.m. UTC | #9
On 1/20/21 4:29 PM, Jarkko Sakkinen wrote:
> On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote:
>> On Fri, Jan 15, 2021, jarkko@kernel.org wrote:
>>> From: Jarkko Sakkinen <jarkko@kernel.org>
>>>
>>> The most trivial example of a race condition can be demonstrated with this
>>> example where mm_list contains just one entry:
>>>
>>> CPU A                   CPU B
>>> sgx_release()
>>>                         sgx_mmu_notifier_release()
>>>                         list_del_rcu()
>>> sgx_encl_release()
>>>                         synchronize_srcu()
>>> cleanup_srcu_struct()
>>>
>>> To fix this, call synchronize_srcu() before checking whether mm_list is
>>> empty in sgx_release().
>> Why haven't you included the splat that Haitao provided?  That would go a long
>> way to helping answer Boris' question about exactly what is broken...
> I've lost the klog.

Haitao said he thought it was this:

> void cleanup_srcu_struct(struct srcu_struct *ssp)
> {
...
>         if (WARN_ON(srcu_readers_active(ssp)))
>                 return; /* Just leak it! */

Which would indicate that an 'encl' kref reached 0 while some other
thread was inside a

        idx = srcu_read_lock(&encl->srcu);
	...
	srcu_read_unlock(&encl->srcu, idx);

critical section.  A quick audit didn't turn up any obvious suspects,
though.

If that *is* it, it might be nice to try to catch the culprit at
srcu_read_{un}lock() time.  If there's ever a 0 refcount at those sites,
it would be nice to spew a warning:

        idx = srcu_read_lock(&encl->srcu);
	WARN_ON(!atomic_read(&encl->refcount->refcount);
	...
	WARN_ON(!atomic_read(&encl->refcount->refcount);
	srcu_read_unlock(&encl->srcu, idx);

at each site.
Jarkko Sakkinen Jan. 21, 2021, 12:55 p.m. UTC | #10
On Wed, Jan 20, 2021 at 05:19:44PM -0800, Dave Hansen wrote:
> On 1/20/21 4:29 PM, Jarkko Sakkinen wrote:
> > On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote:
> >> On Fri, Jan 15, 2021, jarkko@kernel.org wrote:
> >>> From: Jarkko Sakkinen <jarkko@kernel.org>
> >>>
> >>> The most trivial example of a race condition can be demonstrated with this
> >>> example where mm_list contains just one entry:
> >>>
> >>> CPU A                   CPU B
> >>> sgx_release()
> >>>                         sgx_mmu_notifier_release()
> >>>                         list_del_rcu()
> >>> sgx_encl_release()
> >>>                         synchronize_srcu()
> >>> cleanup_srcu_struct()
> >>>
> >>> To fix this, call synchronize_srcu() before checking whether mm_list is
> >>> empty in sgx_release().
> >> Why haven't you included the splat that Haitao provided?  That would go a long
> >> way to helping answer Boris' question about exactly what is broken...
> > I've lost the klog.
> 
> Haitao said he thought it was this:
> 
> > void cleanup_srcu_struct(struct srcu_struct *ssp)
> > {
> ...
> >         if (WARN_ON(srcu_readers_active(ssp)))
> >                 return; /* Just leak it! */
> 
> Which would indicate that an 'encl' kref reached 0 while some other
> thread was inside a
> 
>         idx = srcu_read_lock(&encl->srcu);
> 	...
> 	srcu_read_unlock(&encl->srcu, idx);
> 
> critical section.  A quick audit didn't turn up any obvious suspects,
> though.
> 
> If that *is* it, it might be nice to try to catch the culprit at
> srcu_read_{un}lock() time.  If there's ever a 0 refcount at those sites,
> it would be nice to spew a warning:
> 
>         idx = srcu_read_lock(&encl->srcu);
> 	WARN_ON(!atomic_read(&encl->refcount->refcount);
> 	...
> 	WARN_ON(!atomic_read(&encl->refcount->refcount);
> 	srcu_read_unlock(&encl->srcu, idx);
> 
> at each site.

The root cause is fully known already and audited.

An mm_list entry is kept up until the process exits *or* when
VFS close happens, and sgx_release() executes and removes it.

It's entirely possible that MMU notifier callback registered
by a process happens while sgx_release() is executing, which
causes list_del_rcu() to happen, unnoticed by sgx_release().

If that empties the list, cleanup_srcu_struct() is executed
unsynchronized in the middle a unfinished grace period.

/Jarkko
Dave Hansen Jan. 21, 2021, 6:19 p.m. UTC | #11
On 1/21/21 4:55 AM, Jarkko Sakkinen wrote:
> The root cause is fully known already and audited.
> 
> An mm_list entry is kept up until the process exits *or* when
> VFS close happens, and sgx_release() executes and removes it.
> 
> It's entirely possible that MMU notifier callback registered
> by a process happens while sgx_release() is executing, which
> causes list_del_rcu() to happen, unnoticed by sgx_release().

Just to be clear, are you talking about sgx_mmu_notifier_release()?

According to comments, mmu_notifier->release() can be called "before all
pages are freed."  That means it can be called before VMAs are torn
down, which is where the filp->release() is called.

There is also nothing to *stop* a VMA from being torn down or an fd
closed at the point that mmu_notifier->release() is called.


	fd = open("/dev/sgx") 	// filp count==1
	mmap(fd)		// filp count==2
	close(fd) 		// filp count==1
	munmap() 		// filp count==0,
		sgx_encl_release()
		cleanup_srcu_struct(&encl->srcu);
		kfree(encl);

Meanwhile, another thread is doing:

	exit() ... exit_mmap() ... mmu_notifier_release()

Which is touching 'encl'.

Shouldn't a kref be held on 'encl' to ensure it isn't referenced after
it is freed?

> If that empties the list, cleanup_srcu_struct() is executed
> unsynchronized in the middle a unfinished grace period.

> +	/*
> +	 * Each sgx_mmu_notifier_release() starts a grace period. Therefore, an
> +	 * additional sync is required here.
> +	 */

Except that sgx_mmu_notifier_release() doesn't do any srcu locking.  How
does sgx_mmu_notifier_release() start a new grace period?
Dave Hansen Jan. 22, 2021, 4:56 p.m. UTC | #12
On 1/20/21 9:35 AM, Sean Christopherson wrote:
> Why haven't you included the splat that Haitao provided?  That would go a long
> way to helping answer Boris' question about exactly what is broken...

The bad news is that the original splat seems to be lost.

The good news is that this is hard to reproduce and *might* not occur on
what got merged in mainline.

We're going to circle back around and make sure we have a clean
reproduction before we try to fix this again.
Haitao Huang Jan. 22, 2021, 6:20 p.m. UTC | #13
On Wed, 20 Jan 2021 18:26:56 -0600, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Wed, Jan 20, 2021 at 09:34:10AM -0800, Dave Hansen wrote:
>> On 1/20/21 6:43 AM, Jarkko Sakkinen wrote:
>> >> So why do you need the synchronize_srcu() call when this process  
>> sees an
>> >> empty mm_list already?
>> >>
>> >> Thx.
>> > The other process aka some process using the enclave calls  
>> list_del_rcu()
>> > (and synchronize_srcu()), which starts a new grace period. If we don't
>> > do it, then the cleanup_srcu() will race with that grace period.
>>
>> To me, this is only a partial explanation.
>>
>> That goal of synchronize_srcu() is to wait for the completion of a
>> *previous* grace period: one that might have observed the old state of
>> the list.
>>
>> Could you explain the *actual* effects of the misplaced
>> synchronize_srcu()?  If the race _occurs_, what is the side-effect?
>
> As I haven't been able to reproduce this regression myself, I need
> to take steps back and try to reproduce the it with Graphene.
>
> WARN_ON()'s trigger inside cleanup_srcu_struct(), which causes a memory
> leak since free_percpu() gets never called. If I recall correctly, it
> was srcu_readers_active() but unfortunately I don't have a log available.
>
> Perhaps Haitao could provide us one.
>
> /Jarkko

Sorry I lost those logs too as our email server automatically clean up old  
emails. I have been re-running the tests but have not been able to  
reproduce the same issue.

Haitao
Jarkko Sakkinen Jan. 23, 2021, 8:58 a.m. UTC | #14
On Fri, Jan 22, 2021 at 08:56:44AM -0800, Dave Hansen wrote:
> On 1/20/21 9:35 AM, Sean Christopherson wrote:
> > Why haven't you included the splat that Haitao provided?  That would go a long
> > way to helping answer Boris' question about exactly what is broken...
> 
> The bad news is that the original splat seems to be lost.
> 
> The good news is that this is hard to reproduce and *might* not occur on
> what got merged in mainline.
> 
> We're going to circle back around and make sure we have a clean
> reproduction before we try to fix this again.

Yeah, this a good opportunity to level up the QA.

/Jarkko
Dave Hansen Jan. 25, 2021, 3:49 p.m. UTC | #15
Haitao managed to create another splat over the weekend.  It was, indeed:

> WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100

which is:

>         if (WARN_ON(!srcu_get_delay(ssp)))
>                 return; /* Just leak it! */

That check means that there is an outstanding "expedited" grace period.
 The fact that it's expedited is not important.  This:

	https://lwn.net/Articles/202847/

describes the reasoning behind the warning:

	If the struct srcu_struct is dynamically allocated, then
	cleanup_srcu_struct() must be called before it is freed ... the
	caller must take care to ensure that all SRCU read-side critical
	sections have completed (and that no more will commence) before
	calling cleanup_srcu_struct().

synchronize_srcu() will (obviously) wait for the grace period to
complete.  Calling it will shut up the warning for sure, most of the time.

The required sequence of events is in here:

> https://lore.kernel.org/lkml/1492472726-3841-4-git-send-email-paulmck@linux.vnet.ibm.com/

I suspect that the mmu notifier's synchronize_srcu() is run in parallel
very close to when cleanup_srcu_struct() is called.  This violates the
"prevent any further calls to synchronize_srcu" rule.

So, while I suspect that adding a synchronize_srcu() is *part* of the
correct solution, I'm still not convinced that the
sgx_mmu_notifier_release() code is correct.
Jarkko Sakkinen Jan. 27, 2021, 5:31 p.m. UTC | #16
On Mon, Jan 25, 2021 at 07:49:04AM -0800, Dave Hansen wrote:
> Haitao managed to create another splat over the weekend.  It was, indeed:
> 
> > WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100
> 
> which is:
> 
> >         if (WARN_ON(!srcu_get_delay(ssp)))
> >                 return; /* Just leak it! */
> 
> That check means that there is an outstanding "expedited" grace period.
>  The fact that it's expedited is not important.  This:
> 
> 	https://lwn.net/Articles/202847/
> 
> describes the reasoning behind the warning:
> 
> 	If the struct srcu_struct is dynamically allocated, then
> 	cleanup_srcu_struct() must be called before it is freed ... the
> 	caller must take care to ensure that all SRCU read-side critical
> 	sections have completed (and that no more will commence) before
> 	calling cleanup_srcu_struct().
> 
> synchronize_srcu() will (obviously) wait for the grace period to
> complete.  Calling it will shut up the warning for sure, most of the time.
> 
> The required sequence of events is in here:
> 
> > https://lore.kernel.org/lkml/1492472726-3841-4-git-send-email-paulmck@linux.vnet.ibm.com/

I'll add "Link:" for this to the final fix. It's a good reference.

> I suspect that the mmu notifier's synchronize_srcu() is run in parallel
> very close to when cleanup_srcu_struct() is called.  This violates the
> "prevent any further calls to synchronize_srcu" rule.
> 
> So, while I suspect that adding a synchronize_srcu() is *part* of the
> correct solution, I'm still not convinced that the
> sgx_mmu_notifier_release() code is correct.

What Sean suggested in private discussion, i.e. using kref_get() in the MMU
notifier AFAIK should be enough to do the necessary serialization.

/Jarkko
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index f2eac41bb4ff..53056345f5f8 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -65,11 +65,16 @@  static int sgx_release(struct inode *inode, struct file *file)
 
 		spin_unlock(&encl->mm_lock);
 
+		/*
+		 * The call is need even if the list empty, because sgx_encl_mmu_notifier_release()
+		 * could have initiated a new grace period.
+		 */
+		synchronize_srcu(&encl->srcu);
+
 		/* The enclave is no longer mapped by any mm. */
 		if (!encl_mm)
 			break;
 
-		synchronize_srcu(&encl->srcu);
 		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
 		kfree(encl_mm);
 	}