diff mbox series

x86/sgx: Synchronize encl->srcu in sgx_encl_release().

Message ID 20201211113230.28909-1-jarkko@kernel.org (mailing list archive)
State New, archived
Headers show
Series x86/sgx: Synchronize encl->srcu in sgx_encl_release(). | expand

Commit Message

Jarkko Sakkinen Dec. 11, 2020, 11:32 a.m. UTC
Each sgx_mmun_notifier_release() starts a grace period, which means that
one extra synchronize_rcu() in sgx_encl_release(). Add it there.

sgx_release() has the loop that drains the list but with bad luck the
entry is already gone from the list before that loop processes it.

Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Sean Christopherson Dec. 14, 2020, 7:01 p.m. UTC | #1
On Fri, Dec 11, 2020, Jarkko Sakkinen wrote:
> Each sgx_mmun_notifier_release() starts a grace period, which means that

Should be sgx_mmu_notifier_release(), here and in the comment.

> one extra synchronize_rcu() in sgx_encl_release(). Add it there.
> 
> sgx_release() has the loop that drains the list but with bad luck the
> entry is already gone from the list before that loop processes it.

Why not include the actual analysis that "proves" the bug?  The splat that
Haitao reported would also be useful info.

> Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reported-by: Sean Christopherson <seanjc@google.com>

Haitao reported the bug, and for all intents and purposes provided the fix.  I
just did the analysis to verify that there was a legitimate bug and that the
synchronization in sgx_encl_release() was indeed necessary.

> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index ee50a5010277..48539a6ee315 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -438,6 +438,13 @@ void sgx_encl_release(struct kref *ref)
>  	if (encl->backing)
>  		fput(encl->backing);
>  
> +	/*
> +	 * Each sgx_mmun_notifier_release() starts a grace period. Thus one
> +	 * "extra" synchronize_rcu() is required here. This can go undetected by
> +	 * sgx_release() when it drains the mm list.
> +	 */
> +	synchronize_srcu(&encl->srcu);
> +
>  	cleanup_srcu_struct(&encl->srcu);
>  
>  	WARN_ON_ONCE(!list_empty(&encl->mm_list));
> -- 
> 2.27.0
>
Jarkko Sakkinen Dec. 15, 2020, 5:55 a.m. UTC | #2
On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
> On Fri, Dec 11, 2020, Jarkko Sakkinen wrote:
> > Each sgx_mmun_notifier_release() starts a grace period, which means that
> 
> Should be sgx_mmu_notifier_release(), here and in the comment.

Thanks.

> > one extra synchronize_rcu() in sgx_encl_release(). Add it there.
> > 
> > sgx_release() has the loop that drains the list but with bad luck the
> > entry is already gone from the list before that loop processes it.
> 
> Why not include the actual analysis that "proves" the bug?  The splat that
> Haitao reported would also be useful info.

True. I can include a snippet of dmesg to the commit message.

> > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Reported-by: Sean Christopherson <seanjc@google.com>
> 
> Haitao reported the bug, and for all intents and purposes provided the fix.  I
> just did the analysis to verify that there was a legitimate bug and that the
> synchronization in sgx_encl_release() was indeed necessary.

Good and valid point. The way I see it, the tags should be:

Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>

Haitao pointed out the bug but from your analysis I could resolve that
this is the fix to implement, and was able to write the long
description for the commit.

Does this make sense to you?

/Jarkko
Jarkko Sakkinen Dec. 15, 2020, 5:59 a.m. UTC | #3
On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote:
> On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
> > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote:
> > > Each sgx_mmun_notifier_release() starts a grace period, which means that
> > 
> > Should be sgx_mmu_notifier_release(), here and in the comment.
> 
> Thanks.
> 
> > > one extra synchronize_rcu() in sgx_encl_release(). Add it there.
> > > 
> > > sgx_release() has the loop that drains the list but with bad luck the
> > > entry is already gone from the list before that loop processes it.
> > 
> > Why not include the actual analysis that "proves" the bug?  The splat that
> > Haitao reported would also be useful info.
> 
> True. I can include a snippet of dmesg to the commit message.
> 
> > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> > > Cc: Borislav Petkov <bp@alien8.de>
> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > Reported-by: Sean Christopherson <seanjc@google.com>
> > 
> > Haitao reported the bug, and for all intents and purposes provided the fix.  I
> > just did the analysis to verify that there was a legitimate bug and that the
> > synchronization in sgx_encl_release() was indeed necessary.
> 
> Good and valid point. The way I see it, the tags should be:
> 
> Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> 
> Haitao pointed out the bug but from your analysis I could resolve that
> this is the fix to implement, and was able to write the long
> description for the commit.
> 
> Does this make sense to you?

I'm sending v2 next week (this week on vacation).

/Jarkko
Haitao Huang Dec. 15, 2020, 5:34 p.m. UTC | #4
On Mon, 14 Dec 2020 23:59:55 -0600, Jarkko Sakkinen <jarkko@kernel.org>
wrote:

> On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote:
>> On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
>> > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote:
>> > > Each sgx_mmun_notifier_release() starts a grace period, which means  
>> that
>> >
>> > Should be sgx_mmu_notifier_release(), here and in the comment.
>>
>> Thanks.
>>
>> > > one extra synchronize_rcu() in sgx_encl_release(). Add it there.
>> > >
>> > > sgx_release() has the loop that drains the list but with bad luck  
>> the
>> > > entry is already gone from the list before that loop processes it.
>> >
>> > Why not include the actual analysis that "proves" the bug?  The splat  
>> that
>> > Haitao reported would also be useful info.
>>
>> True. I can include a snippet of dmesg to the commit message.
>>
>> > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
>> > > Cc: Borislav Petkov <bp@alien8.de>
>> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> > > Reported-by: Sean Christopherson <seanjc@google.com>
>> >
>> > Haitao reported the bug, and for all intents and purposes provided  
>> the fix.  I
>> > just did the analysis to verify that there was a legitimate bug and  
>> that the
>> > synchronization in sgx_encl_release() was indeed necessary.
>>
>> Good and valid point. The way I see it, the tags should be:
>>
>> Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>>
>> Haitao pointed out the bug but from your analysis I could resolve that
>> this is the fix to implement, and was able to write the long
>> description for the commit.
>>
>> Does this make sense to you?
>
> I'm sending v2 next week (this week on vacation).
>
> /Jarkko

I don't mind either how tags are assigned. But our testing reveals
significant latency introduced in scenarios of heavy loading/unloading
enclaves. synchronize_srcu_expedited fixed the issue. Please analyze and
confirm if that's more appropriate than synchronize_srcu here.
Jarkko Sakkinen Dec. 15, 2020, 9:35 p.m. UTC | #5
On Tue, Dec 15, 2020 at 11:34:37AM -0600, Haitao Huang wrote:
> On Mon, 14 Dec 2020 23:59:55 -0600, Jarkko Sakkinen <jarkko@kernel.org>
> wrote:
> 
> > On Tue, Dec 15, 2020 at 07:56:01AM +0200, Jarkko Sakkinen wrote:
> > > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
> > > > On Fri, Dec 11, 2020, Jarkko Sakkinen wrote:
> > > > > Each sgx_mmun_notifier_release() starts a grace period, which
> > > means that
> > > >
> > > > Should be sgx_mmu_notifier_release(), here and in the comment.
> > > 
> > > Thanks.
> > > 
> > > > > one extra synchronize_rcu() in sgx_encl_release(). Add it there.
> > > > >
> > > > > sgx_release() has the loop that drains the list but with bad
> > > luck the
> > > > > entry is already gone from the list before that loop processes it.
> > > >
> > > > Why not include the actual analysis that "proves" the bug?  The
> > > splat that
> > > > Haitao reported would also be useful info.
> > > 
> > > True. I can include a snippet of dmesg to the commit message.
> > > 
> > > > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
> > > > > Cc: Borislav Petkov <bp@alien8.de>
> > > > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > Reported-by: Sean Christopherson <seanjc@google.com>
> > > >
> > > > Haitao reported the bug, and for all intents and purposes provided
> > > the fix.  I
> > > > just did the analysis to verify that there was a legitimate bug
> > > and that the
> > > > synchronization in sgx_encl_release() was indeed necessary.
> > > 
> > > Good and valid point. The way I see it, the tags should be:
> > > 
> > > Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > 
> > > Haitao pointed out the bug but from your analysis I could resolve that
> > > this is the fix to implement, and was able to write the long
> > > description for the commit.
> > > 
> > > Does this make sense to you?
> > 
> > I'm sending v2 next week (this week on vacation).
> > 
> > /Jarkko
> 
> I don't mind either how tags are assigned. But our testing reveals
> significant latency introduced in scenarios of heavy loading/unloading
> enclaves. synchronize_srcu_expedited fixed the issue. Please analyze and
> confirm if that's more appropriate than synchronize_srcu here.

I don't see any obvious reason why *_expedited could not be used here,
as most of the time sync's are taken care of sgx_release() loop, and the
final sync is with sgx_mmu_notifier_release(). More aggressive spinning
should not do any harm here.

About the tags. I just try to get them right, and it is sometimes not
straight-forward. So I guess, with all things considered, I'll put
suggested-by from you. Once I get a refined patch out, try it out with
your workloads and provide me tested-by, if it is working for you.

/Jarkko
Sean Christopherson Dec. 15, 2020, 10:04 p.m. UTC | #6
On Tue, Dec 15, 2020, Jarkko Sakkinen wrote:
> On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
> > Haitao reported the bug, and for all intents and purposes provided the fix.  I
> > just did the analysis to verify that there was a legitimate bug and that the
> > synchronization in sgx_encl_release() was indeed necessary.
> 
> Good and valid point. The way I see it, the tags should be:
> 
> Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> 
> Haitao pointed out the bug but from your analysis I could resolve that
> this is the fix to implement, and was able to write the long
> description for the commit.
> 
> Does this make sense to you?

Yep, works for me.
Jarkko Sakkinen Dec. 16, 2020, 12:25 p.m. UTC | #7
On Tue, Dec 15, 2020 at 02:04:10PM -0800, Sean Christopherson wrote:
> On Tue, Dec 15, 2020, Jarkko Sakkinen wrote:
> > On Mon, Dec 14, 2020 at 11:01:32AM -0800, Sean Christopherson wrote:
> > > Haitao reported the bug, and for all intents and purposes provided the fix.  I
> > > just did the analysis to verify that there was a legitimate bug and that the
> > > synchronization in sgx_encl_release() was indeed necessary.
> > 
> > Good and valid point. The way I see it, the tags should be:
> > 
> > Reported-by: Haitao Huang <haitao.huang@linux.intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > 
> > Haitao pointed out the bug but from your analysis I could resolve that
> > this is the fix to implement, and was able to write the long
> > description for the commit.
> > 
> > Does this make sense to you?
> 
> Yep, works for me.

I'll just add two suggested-by's. Process guide does not forbid that
and it best describes matters.

/Jarkko
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index ee50a5010277..48539a6ee315 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -438,6 +438,13 @@  void sgx_encl_release(struct kref *ref)
 	if (encl->backing)
 		fput(encl->backing);
 
+	/*
+	 * Each sgx_mmun_notifier_release() starts a grace period. Thus one
+	 * "extra" synchronize_rcu() is required here. This can go undetected by
+	 * sgx_release() when it drains the mm list.
+	 */
+	synchronize_srcu(&encl->srcu);
+
 	cleanup_srcu_struct(&encl->srcu);
 
 	WARN_ON_ONCE(!list_empty(&encl->mm_list));