diff mbox series

x86/sgx: Suppress WARN on inability to sanitize EPC if ksgxd is stopped

Message ID 20210616004458.2192889-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series x86/sgx: Suppress WARN on inability to sanitize EPC if ksgxd is stopped | expand

Commit Message

Sean Christopherson June 16, 2021, 12:44 a.m. UTC
Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
If ksgxd is stopped early, EPC pages may be left on the dirty list, but
that's ok because ksgxd is only stopped if SGX initialization failed or
if the kernel is going down.  In either case, the EPC won't be used.

This bug was exposed by the addition of KVM support, but has existed and
was hittable since the original sanitization code was added.  Prior to
adding KVM support, if Launch Control was not fully enabled, e.g. when
running on older hardware, sgx_init() bailed immediately before spawning
ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
unsupported.

With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
manually, so now there's any easy-to-hit case where sgx_init() will spawn
ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
ksgxd before it finishes sanitizing the EPC.

Prior to KVM support, the bug was much harder to hit because it basically
required char device registration to fail.

Reported-by: Du Cheng <ducheng2@gmail.com>
Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Lightly tested due to lack of hardware.  I hacked the flow to verify that
stopping early will leave work pending, and that rechecking should_stop()
suppress the resulting WARN.

 arch/x86/kernel/cpu/sgx/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Du Cheng June 16, 2021, 6:03 a.m. UTC | #1
Le Tue, Jun 15, 2021 at 05:44:58PM -0700, Sean Christopherson a écrit :
> Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
> e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
> If ksgxd is stopped early, EPC pages may be left on the dirty list, but
> that's ok because ksgxd is only stopped if SGX initialization failed or
> if the kernel is going down.  In either case, the EPC won't be used.
> 
> This bug was exposed by the addition of KVM support, but has existed and
> was hittable since the original sanitization code was added.  Prior to
> adding KVM support, if Launch Control was not fully enabled, e.g. when
> running on older hardware, sgx_init() bailed immediately before spawning
> ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
> unsupported.
> 
> With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
> manually, so now there's any easy-to-hit case where sgx_init() will spawn
> ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
> ksgxd before it finishes sanitizing the EPC.
> 
> Prior to KVM support, the bug was much harder to hit because it basically
> required char device registration to fail.
> 
> Reported-by: Du Cheng <ducheng2@gmail.com>
> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> Lightly tested due to lack of hardware.  I hacked the flow to verify that
> stopping early will leave work pending, and that rechecking should_stop()
> suppress the resulting WARN.
> 
>  arch/x86/kernel/cpu/sgx/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ad904747419e..fbad2b9625a5 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -425,7 +425,7 @@ static int ksgxd(void *p)
>  	__sgx_sanitize_pages(&sgx_dirty_page_list);
>  
>  	/* sanity check: */
> -	WARN_ON(!list_empty(&sgx_dirty_page_list));
> +	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
>  
>  	while (!kthread_should_stop()) {
>  		if (try_to_freeze())
> -- 
> 2.32.0.272.g935e593368-goog
> 

I applied this patch on 5.13-rc6, and it no longer causes to trigger WARN_ON()
on my NUC:

```

[    0.669411] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.669412] software IO TLB: mapped [mem 0x0000000017cb9000-0x000000001bcb9000] (64MB)
[    0.672788] platform rtc_cmos: registered platform RTC device (no PNP device found)
[    0.672805] sgx: EPC section 0x30200000-0x35f7ffff
[    0.674239] Initialise system trusted keyrings
[    0.674254] Key type blacklist registered

```

Regards,
Du Cheng
Borislav Petkov June 17, 2021, 4:45 p.m. UTC | #2
On Tue, Jun 15, 2021 at 05:44:58PM -0700, Sean Christopherson wrote:
> Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
> e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
> If ksgxd is stopped early, EPC pages may be left on the dirty list, but
> that's ok because ksgxd is only stopped if SGX initialization failed or
> if the kernel is going down.  In either case, the EPC won't be used.
> 
> This bug was exposed by the addition of KVM support, but has existed and
> was hittable since the original sanitization code was added.  Prior to
> adding KVM support, if Launch Control was not fully enabled, e.g. when
> running on older hardware, sgx_init() bailed immediately before spawning
> ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
> unsupported.
> 
> With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
> manually, so now there's any easy-to-hit case where sgx_init() will spawn
> ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
> ksgxd before it finishes sanitizing the EPC.
> 
> Prior to KVM support, the bug was much harder to hit because it basically
> required char device registration to fail.
> 
> Reported-by: Du Cheng <ducheng2@gmail.com>
> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> Lightly tested due to lack of hardware.  I hacked the flow to verify that
> stopping early will leave work pending, and that rechecking should_stop()
> suppress the resulting WARN.
> 
>  arch/x86/kernel/cpu/sgx/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ad904747419e..fbad2b9625a5 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -425,7 +425,7 @@ static int ksgxd(void *p)
>  	__sgx_sanitize_pages(&sgx_dirty_page_list);
>  
>  	/* sanity check: */
> -	WARN_ON(!list_empty(&sgx_dirty_page_list));
> +	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
>  
>  	while (!kthread_should_stop()) {
>  		if (try_to_freeze())
> -- 

Hmm, this looks weird. Why aren't we starting ksgxd only after
*everything* has initialized successfully? I.e., after both kvm and
native drivers' init functions have succeeded?

Then you won't have to do this kthread_should_stop() thing after the
fact.

Btw, you have the same thing in the while loop's termination condition
two lines down which, if I have to look at it later, would make me
scratch head as to what TH is going on here.

Thx.
Dave Hansen June 18, 2021, 3:41 p.m. UTC | #3
On 6/17/21 9:45 AM, Borislav Petkov wrote:
> Hmm, this looks weird. Why aren't we starting ksgxd only after
> *everything* has initialized successfully? I.e., after both kvm and
> native drivers' init functions have succeeded?

ksgxd has two roles.  I think that's why it looks weird.

The obvious role is its use as the kswapd equivalent for SGX.

But, it's also used to speed up SGX initialization.  It "sanitizes" the
EPC asynchronously because it can take quite a while.  That's why it
gets launched off early.  If it gets interrupted, that's when this
warning can trigger.

I think you're suggesting that we just defer starting ksgxd until we
*know* it won't be interrupted, basically moving
sgx_page_reclaimer_init() down below sgx_drv_init() and sgx_vepc_init().

While I can see why it's best to get it going as early as possible, I
don't see much going on in those init functions that would justify
needing to fork off ksgx earlier.  Am I missing anything?
Jarkko Sakkinen June 23, 2021, 1:31 p.m. UTC | #4
On Tue, Jun 15, 2021 at 05:44:58PM -0700, Sean Christopherson wrote:
> Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
> e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
> If ksgxd is stopped early, EPC pages may be left on the dirty list, but
> that's ok because ksgxd is only stopped if SGX initialization failed or
> if the kernel is going down.  In either case, the EPC won't be used.
> 
> This bug was exposed by the addition of KVM support, but has existed and
> was hittable since the original sanitization code was added.  Prior to
> adding KVM support, if Launch Control was not fully enabled, e.g. when
> running on older hardware, sgx_init() bailed immediately before spawning
> ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
> unsupported.
> 
> With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
> manually, so now there's any easy-to-hit case where sgx_init() will spawn
> ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
> ksgxd before it finishes sanitizing the EPC.
> 
> Prior to KVM support, the bug was much harder to hit because it basically
> required char device registration to fail.
> 
> Reported-by: Du Cheng <ducheng2@gmail.com>
> Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Thank you.

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

> Lightly tested due to lack of hardware.  I hacked the flow to verify that
> stopping early will leave work pending, and that rechecking should_stop()
> suppress the resulting WARN.
> 
>  arch/x86/kernel/cpu/sgx/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ad904747419e..fbad2b9625a5 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -425,7 +425,7 @@ static int ksgxd(void *p)
>  	__sgx_sanitize_pages(&sgx_dirty_page_list);
>  
>  	/* sanity check: */
> -	WARN_ON(!list_empty(&sgx_dirty_page_list));
> +	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
>  
>  	while (!kthread_should_stop()) {
>  		if (try_to_freeze())
> -- 
> 2.32.0.272.g935e593368-goog
> 
> 

/Jarkko
Jarkko Sakkinen June 23, 2021, 1:32 p.m. UTC | #5
On Wed, Jun 16, 2021 at 02:03:09PM +0800, Du Cheng wrote:
> Le Tue, Jun 15, 2021 at 05:44:58PM -0700, Sean Christopherson a écrit :
> > Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
> > e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
> > If ksgxd is stopped early, EPC pages may be left on the dirty list, but
> > that's ok because ksgxd is only stopped if SGX initialization failed or
> > if the kernel is going down.  In either case, the EPC won't be used.
> > 
> > This bug was exposed by the addition of KVM support, but has existed and
> > was hittable since the original sanitization code was added.  Prior to
> > adding KVM support, if Launch Control was not fully enabled, e.g. when
> > running on older hardware, sgx_init() bailed immediately before spawning
> > ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
> > unsupported.
> > 
> > With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
> > manually, so now there's any easy-to-hit case where sgx_init() will spawn
> > ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
> > ksgxd before it finishes sanitizing the EPC.
> > 
> > Prior to KVM support, the bug was much harder to hit because it basically
> > required char device registration to fail.
> > 
> > Reported-by: Du Cheng <ducheng2@gmail.com>
> > Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> > 
> > Lightly tested due to lack of hardware.  I hacked the flow to verify that
> > stopping early will leave work pending, and that rechecking should_stop()
> > suppress the resulting WARN.
> > 
> >  arch/x86/kernel/cpu/sgx/main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index ad904747419e..fbad2b9625a5 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -425,7 +425,7 @@ static int ksgxd(void *p)
> >  	__sgx_sanitize_pages(&sgx_dirty_page_list);
> >  
> >  	/* sanity check: */
> > -	WARN_ON(!list_empty(&sgx_dirty_page_list));
> > +	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
> >  
> >  	while (!kthread_should_stop()) {
> >  		if (try_to_freeze())
> > -- 
> > 2.32.0.272.g935e593368-goog
> > 
> 
> I applied this patch on 5.13-rc6, and it no longer causes to trigger WARN_ON()
> on my NUC:
> 
> ```
> 
> [    0.669411] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [    0.669412] software IO TLB: mapped [mem 0x0000000017cb9000-0x000000001bcb9000] (64MB)
> [    0.672788] platform rtc_cmos: registered platform RTC device (no PNP device found)
> [    0.672805] sgx: EPC section 0x30200000-0x35f7ffff
> [    0.674239] Initialise system trusted keyrings
> [    0.674254] Key type blacklist registered
> 
> ```
> 
> Regards,
> Du Cheng

Can you thus give a tested-by for this?

/Jarkko
Du Cheng June 25, 2021, 12:47 a.m. UTC | #6
Le Wed, Jun 23, 2021 at 04:32:19PM +0300, Jarkko Sakkinen a écrit :
> On Wed, Jun 16, 2021 at 02:03:09PM +0800, Du Cheng wrote:
> > Le Tue, Jun 15, 2021 at 05:44:58PM -0700, Sean Christopherson a écrit :
> > > Don't WARN on having unsanitized EPC pages if ksgxd is stopped early,
> > > e.g. if sgx_init() realizes there will be no downstream consumers of EPC.
> > > If ksgxd is stopped early, EPC pages may be left on the dirty list, but
> > > that's ok because ksgxd is only stopped if SGX initialization failed or
> > > if the kernel is going down.  In either case, the EPC won't be used.
> > > 
> > > This bug was exposed by the addition of KVM support, but has existed and
> > > was hittable since the original sanitization code was added.  Prior to
> > > adding KVM support, if Launch Control was not fully enabled, e.g. when
> > > running on older hardware, sgx_init() bailed immediately before spawning
> > > ksgxd because X86_FEATURE_SGX was cleared if X86_FEATURE_SGX_LC was
> > > unsupported.
> > > 
> > > With KVM support, sgx_drv_init() handles the X86_FEATURE_SGX_LC check
> > > manually, so now there's any easy-to-hit case where sgx_init() will spawn
> > > ksgxd and _then_ fail to initialize, which results in sgx_init() stopping
> > > ksgxd before it finishes sanitizing the EPC.
> > > 
> > > Prior to KVM support, the bug was much harder to hit because it basically
> > > required char device registration to fail.
> > > 
> > > Reported-by: Du Cheng <ducheng2@gmail.com>
> > > Fixes: e7e0545299d8 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > > 
> > > Lightly tested due to lack of hardware.  I hacked the flow to verify that
> > > stopping early will leave work pending, and that rechecking should_stop()
> > > suppress the resulting WARN.
> > > 
> > >  arch/x86/kernel/cpu/sgx/main.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > > index ad904747419e..fbad2b9625a5 100644
> > > --- a/arch/x86/kernel/cpu/sgx/main.c
> > > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > > @@ -425,7 +425,7 @@ static int ksgxd(void *p)
> > >  	__sgx_sanitize_pages(&sgx_dirty_page_list);
> > >  
> > >  	/* sanity check: */
> > > -	WARN_ON(!list_empty(&sgx_dirty_page_list));
> > > +	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
> > >  
> > >  	while (!kthread_should_stop()) {
> > >  		if (try_to_freeze())
> > > -- 
> > > 2.32.0.272.g935e593368-goog
> > > 
> > 
> > I applied this patch on 5.13-rc6, and it no longer causes to trigger WARN_ON()
> > on my NUC:
> > 
> > ```
> > 
> > [    0.669411] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> > [    0.669412] software IO TLB: mapped [mem 0x0000000017cb9000-0x000000001bcb9000] (64MB)
> > [    0.672788] platform rtc_cmos: registered platform RTC device (no PNP device found)
> > [    0.672805] sgx: EPC section 0x30200000-0x35f7ffff
> > [    0.674239] Initialise system trusted keyrings
> > [    0.674254] Key type blacklist registered
> > 
> > ```
> > 
> > Regards,
> > Du Cheng
> 
> Can you thus give a tested-by for this?
> 
> /Jarkko

Certainly.

Tested-by: Du Cheng <ducheng2@gmail.com>
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ad904747419e..fbad2b9625a5 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -425,7 +425,7 @@  static int ksgxd(void *p)
 	__sgx_sanitize_pages(&sgx_dirty_page_list);
 
 	/* sanity check: */
-	WARN_ON(!list_empty(&sgx_dirty_page_list));
+	WARN_ON(!list_empty(&sgx_dirty_page_list) && !kthread_should_stop());
 
 	while (!kthread_should_stop()) {
 		if (try_to_freeze())