Message ID | 6fad9ec14ee94eaeb6d287988db60875da83b7bb.1651171455.git.reinette.chatre@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/4] x86/sgx: Do not free backing memory on ENCLS[ELDU] failure | expand |
On 4/28/22 13:11, Reinette Chatre wrote: > > The backing storage is freed after running ENCLS[ELDU], > whether ENCLS[ELDU] succeeded or not. If ENCLS[ELDU] > thus failed then the data within that page is lost. > > Exit with error without removing the backing storage if > it could not be restored to the enclave. > > Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") > Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> > --- > arch/x86/kernel/cpu/sgx/encl.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 1a2cbe44b8d9..e5d2661800ac 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -81,6 +81,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, > ENCLS_WARN(ret, "ELDU"); > > ret = -EFAULT; > + kunmap_atomic(pcmd_page); > + kunmap_atomic((void *)(unsigned long)pginfo.contents); > + sgx_encl_put_backing(&b, false); > + return ret; > } > > memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd)); Are there any transient, recoverable errors that can come back from ELDU? If so, this makes a lot of sense. If not, then it doesn't make a lot of sense to preserve the swapped-out content because they enclave is going to die anyway.
Hi Dave, On 4/28/2022 2:30 PM, Dave Hansen wrote: > On 4/28/22 13:11, Reinette Chatre wrote: > Are there any transient, recoverable errors that can come back from > ELDU? If so, this makes a lot of sense. If not, then it doesn't make a > lot of sense to preserve the swapped-out content because they enclave is > going to die anyway. Good point. Theoretically ELDU could encounter a page fault while accessing the regions it needs to read from and write to. These faults are passed through and the instruction would return with a #PF that is propagated with the page fault handler returning SIGBUS. Even so, this flow also impacts the SGX2 flows that need to load pages from the backing store. In this case the kernel would pass it as an error (-EFAULT) to the runtime but it would not result in the enclave being killed. If it was a #PF that caused the issue then perhaps theoretically the SGX2 instruction has a chance of succeeding if the runtime attempts it again? Reinette
On 4/28/22 15:20, Reinette Chatre wrote: > Hi Dave, > > On 4/28/2022 2:30 PM, Dave Hansen wrote: >> On 4/28/22 13:11, Reinette Chatre wrote: > >> Are there any transient, recoverable errors that can come back from >> ELDU? If so, this makes a lot of sense. If not, then it doesn't make a >> lot of sense to preserve the swapped-out content because they enclave is >> going to die anyway. > > Good point. > > Theoretically ELDU could encounter a page fault while accessing the > regions it needs to read from and write to. These faults are passed > through and the instruction would return with a #PF that is > propagated with the page fault handler returning SIGBUS. We don't have to worry about those, though, do we? We're operating entirely on kernel mappings that won't cause #PF. > Even so, this flow also impacts the SGX2 flows that need to load pages from > the backing store. In this case the kernel would pass it as an error > (-EFAULT) to the runtime but it would not result in the > enclave being killed. If it was a #PF that caused the issue then > perhaps theoretically the SGX2 instruction has a chance of succeeding > if the runtime attempts it again? How are the SGX2 flows different than what we have now? I also looked a little deeper at this transient failure problem. The ELDU documentation also mentions a possible error code of: SGX_EPC_PAGE_CONFLICT It *looks* like there can be conflicts on the SECS page as well as the EPC page being explicitly accessed. Is that a possible problem here?
Hi Dave, On 4/28/2022 3:53 PM, Dave Hansen wrote: > On 4/28/22 15:20, Reinette Chatre wrote: >> Hi Dave, >> >> On 4/28/2022 2:30 PM, Dave Hansen wrote: >>> On 4/28/22 13:11, Reinette Chatre wrote: >> >>> Are there any transient, recoverable errors that can come back from >>> ELDU? If so, this makes a lot of sense. If not, then it doesn't make a >>> lot of sense to preserve the swapped-out content because they enclave is >>> going to die anyway. >> >> Good point. >> >> Theoretically ELDU could encounter a page fault while accessing the >> regions it needs to read from and write to. These faults are passed >> through and the instruction would return with a #PF that is >> propagated with the page fault handler returning SIGBUS. > > We don't have to worry about those, though, do we? We're operating > entirely on kernel mappings that won't cause #PF. Indeed, yes, I do not see how an ELDU error or fault is recoverable. > >> Even so, this flow also impacts the SGX2 flows that need to load pages from >> the backing store. In this case the kernel would pass it as an error >> (-EFAULT) to the runtime but it would not result in the >> enclave being killed. If it was a #PF that caused the issue then >> perhaps theoretically the SGX2 instruction has a chance of succeeding >> if the runtime attempts it again? > > How are the SGX2 flows different than what we have now? SGX2 uses the same flow as the page fault handler to load the page into the enclave. The only difference is that the SGX2 flow removed the VMA permission checks. See: https://lore.kernel.org/lkml/db3a14f2d2df7678dec23375d48c96b603f8cfb5.1649878359.git.reinette.chatre@intel.com/ As per the trace printed in the WARN the issue being investigated is triggered by the ELDU run as part of the page fault handler, not the SGX2 flows. > > I also looked a little deeper at this transient failure problem. The > ELDU documentation also mentions a possible error code of: > > SGX_EPC_PAGE_CONFLICT > > It *looks* like there can be conflicts on the SECS page as well as the > EPC page being explicitly accessed. Is that a possible problem here? I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code supported by newer ELDUC - the ELDU used in current code would indeed #GP in this case. The SDM text describing ELDUC as "This leaf function behaves like ELDU but with improved conflict handling for oversubscription" really does seem relevant to the test that triggers this issue. I stopped pursuing this because from what I understand if SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") then it should also be encountered without it. The issue is not present with 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") removed. I am thus currently investigating based on the assumption that the #GP is encountered because of MAC verification problem. I may be wrong here also and need more information since the SDM documents two seemingly related errors: #GP(0) -> If the instruction fails to verify MAC. SGX_MAC_COMPARE_FAIL -> If the MAC check fails. Reinette
On Thu, 2022-04-28 at 16:49 -0700, Reinette Chatre wrote: > > > > I also looked a little deeper at this transient failure problem. The > > ELDU documentation also mentions a possible error code of: > > > > SGX_EPC_PAGE_CONFLICT > > > > It *looks* like there can be conflicts on the SECS page as well as the > > EPC page being explicitly accessed. Is that a possible problem here? > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code > supported by newer ELDUC - the ELDU used in current code would indeed > #GP in this case. The SDM text describing ELDUC as "This leaf function > behaves like ELDU but with improved conflict handling for oversubscription" > really does seem relevant to the test that triggers this issue. This new error code and the new leaf functions with "C" postfix (ELDUC, etc) are introduced to support VMM oversubscription of EPC. VMM oversubscription of EPC runs independently with guest so theoretically when VMM is performing some operation on EPC in one CPU, guest running in another CPU can touch the EPC simultaneously. The new "C" variants are supposed to be used by VMM when it supports VMM oversubscription of EPC, so that the VMM can a ENCLS instruction error code, rather than a #GP when this case happens. At guest side, when ENCLS conflicting happens, VMM will get a VM-exit so it can be handed by VMM, i.e. by letting the guest to run the same ENCLS again. For now the SGX driver doesn't need to use the "C" variant, nor should it expect the new SGX_EPC_PAGE_CONFLICT error code, because the driver already needs to serialize those ENCLS leaf functions which can not run concurrently. I assume this should apply to SGX2 support too.
On Thu, 2022-04-28 at 14:30 -0700, Dave Hansen wrote: > On 4/28/22 13:11, Reinette Chatre wrote: > > > > The backing storage is freed after running ENCLS[ELDU], > > whether ENCLS[ELDU] succeeded or not. If ENCLS[ELDU] > > thus failed then the data within that page is lost. > > > > Exit with error without removing the backing storage if > > it could not be restored to the enclave. > > > > Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") > > Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> > > --- > > arch/x86/kernel/cpu/sgx/encl.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > > index 1a2cbe44b8d9..e5d2661800ac 100644 > > --- a/arch/x86/kernel/cpu/sgx/encl.c > > +++ b/arch/x86/kernel/cpu/sgx/encl.c > > @@ -81,6 +81,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, > > ENCLS_WARN(ret, "ELDU"); > > > > ret = -EFAULT; > > + kunmap_atomic(pcmd_page); > > + kunmap_atomic((void *)(unsigned long)pginfo.contents); > > + sgx_encl_put_backing(&b, false); > > + return ret; > > } > > > > memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd)); > > Are there any transient, recoverable errors that can come back from > ELDU? If so, this makes a lot of sense. If not, then it doesn't make a > lot of sense to preserve the swapped-out content because they enclave is > going to die anyway. Nope, it's pretty much game over then. BR, Jarkko
On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: > > I also looked a little deeper at this transient failure problem. The > > ELDU documentation also mentions a possible error code of: > > > > SGX_EPC_PAGE_CONFLICT > > > > It *looks* like there can be conflicts on the SECS page as well as the > > EPC page being explicitly accessed. Is that a possible problem here? > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code > supported by newer ELDUC - the ELDU used in current code would indeed > #GP in this case. The SDM text describing ELDUC as "This leaf function > behaves like ELDU but with improved conflict handling for oversubscription" > really does seem relevant to the test that triggers this issue. > > I stopped pursuing this because from what I understand if > SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: > Free backing memory after faulting the enclave page") then it should > also be encountered without it. The issue is not present with > 08999b2489b4 ("x86/sgx: Free backing memory after faulting the > enclave page") removed. I am thus currently investigating based on > the assumption that the #GP is encountered because of MAC > verification problem. I may be wrong here also and need more information > since the SDM documents two seemingly related errors: > #GP(0) -> If the instruction fails to verify MAC. > SGX_MAC_COMPARE_FAIL -> If the MAC check fails. This part puzzles me in the pseudo-code. The version is read first: TMP_VER := DS:RDX[63:0]; Then there's MAC calculation, comparison, and finally this check: (* Check version before committing *) IF (DS:RDX ≠ 0) THEN #GP(0); ELSE DS:RDX := TMP_VER; FI; For me it is a mystery what does zero the slot and in what condition it would be non-zero. Perhaps the #GP refers anyway to this check? BR, Jarkko
Hi Jarkko, On 5/7/2022 10:25 AM, Jarkko Sakkinen wrote: > On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: >>> I also looked a little deeper at this transient failure problem. The >>> ELDU documentation also mentions a possible error code of: >>> >>> SGX_EPC_PAGE_CONFLICT >>> >>> It *looks* like there can be conflicts on the SECS page as well as the >>> EPC page being explicitly accessed. Is that a possible problem here? >> >> I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code >> supported by newer ELDUC - the ELDU used in current code would indeed >> #GP in this case. The SDM text describing ELDUC as "This leaf function >> behaves like ELDU but with improved conflict handling for oversubscription" >> really does seem relevant to the test that triggers this issue. >> >> I stopped pursuing this because from what I understand if >> SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: >> Free backing memory after faulting the enclave page") then it should >> also be encountered without it. The issue is not present with >> 08999b2489b4 ("x86/sgx: Free backing memory after faulting the >> enclave page") removed. I am thus currently investigating based on >> the assumption that the #GP is encountered because of MAC >> verification problem. I may be wrong here also and need more information >> since the SDM documents two seemingly related errors: >> #GP(0) -> If the instruction fails to verify MAC. >> SGX_MAC_COMPARE_FAIL -> If the MAC check fails. > > This part puzzles me in the pseudo-code. > > The version is read first: > > TMP_VER := DS:RDX[63:0]; > > Then there's MAC calculation, comparison, and finally this check: > > (* Check version before committing *) > IF (DS:RDX ≠ 0) > THEN #GP(0); > ELSE > DS:RDX := TMP_VER; > FI; > > For me it is a mystery what does zero the slot and in what condition > it would be non-zero. Perhaps the #GP refers anyway to this check? RDX contains the VA slot information and that appears to be correct in these scenarios. The issue is the PCMD data pointed to by the PAGEINFO.PCMD (link to PAGEINFO found in RBX) is all zeroes. There are two scenarios under which the PCMD data could be zeroes. They are documented in: https://lore.kernel.org/linux-sgx/8157fa40-8d02-8819-e1b6-fd2d8863fb56@intel.com/ https://lore.kernel.org/linux-sgx/da387afc-e666-45d0-1e99-066d8c4aab03@intel.com/ I understand that context may be lost by pointing you to various emails in this thread - I'll wrap up all learnings when I submit the new version of this series today. Reinette
On Mon, 2022-05-09 at 10:17 -0700, Reinette Chatre wrote: > Hi Jarkko, > > On 5/7/2022 10:25 AM, Jarkko Sakkinen wrote: > > On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: > > > > I also looked a little deeper at this transient failure problem. The > > > > ELDU documentation also mentions a possible error code of: > > > > > > > > SGX_EPC_PAGE_CONFLICT > > > > > > > > It *looks* like there can be conflicts on the SECS page as well as the > > > > EPC page being explicitly accessed. Is that a possible problem here? > > > > > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code > > > supported by newer ELDUC - the ELDU used in current code would indeed > > > #GP in this case. The SDM text describing ELDUC as "This leaf function > > > behaves like ELDU but with improved conflict handling for oversubscription" > > > really does seem relevant to the test that triggers this issue. > > > > > > I stopped pursuing this because from what I understand if > > > SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: > > > Free backing memory after faulting the enclave page") then it should > > > also be encountered without it. The issue is not present with > > > 08999b2489b4 ("x86/sgx: Free backing memory after faulting the > > > enclave page") removed. I am thus currently investigating based on > > > the assumption that the #GP is encountered because of MAC > > > verification problem. I may be wrong here also and need more information > > > since the SDM documents two seemingly related errors: > > > #GP(0) -> If the instruction fails to verify MAC. > > > SGX_MAC_COMPARE_FAIL -> If the MAC check fails. > > > > This part puzzles me in the pseudo-code. > > > > The version is read first: > > > > TMP_VER := DS:RDX[63:0]; > > > > Then there's MAC calculation, comparison, and finally this check: > > > > (* Check version before committing *) > > IF (DS:RDX ≠ 0) > > THEN #GP(0); > > ELSE > > DS:RDX := TMP_VER; > > FI; > > > > For me it is a mystery what does zero the slot and in what condition > > it would be non-zero. Perhaps the #GP refers anyway to this check? > > RDX contains the VA slot information and that appears to be correct > in these scenarios. The issue is the PCMD data pointed to by the > PAGEINFO.PCMD (link to PAGEINFO found in RBX) is all zeroes. > > There are two scenarios under which the PCMD data could be zeroes. They > are documented in: > https://lore.kernel.org/linux-sgx/8157fa40-8d02-8819-e1b6-fd2d8863fb56@intel.com/ > https://lore.kernel.org/linux-sgx/da387afc-e666-45d0-1e99-066d8c4aab03@intel.com/ > > I understand that context may be lost by pointing you to various emails > in this thread - I'll wrap up all learnings when I submit the new version > of this series today. > Hi Reinette, Regardless the root cause of this problem, I agree with Jarkko above pseudo-code in the spec is quite confusing. I can try to get it clarified from Intel internally if you want.
On Tue, May 10, 2022 at 12:36:19PM +1200, Kai Huang wrote: > On Mon, 2022-05-09 at 10:17 -0700, Reinette Chatre wrote: > > Hi Jarkko, > > > > On 5/7/2022 10:25 AM, Jarkko Sakkinen wrote: > > > On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: > > > > > I also looked a little deeper at this transient failure problem. The > > > > > ELDU documentation also mentions a possible error code of: > > > > > > > > > > SGX_EPC_PAGE_CONFLICT > > > > > > > > > > It *looks* like there can be conflicts on the SECS page as well as the > > > > > EPC page being explicitly accessed. Is that a possible problem here? > > > > > > > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code > > > > supported by newer ELDUC - the ELDU used in current code would indeed > > > > #GP in this case. The SDM text describing ELDUC as "This leaf function > > > > behaves like ELDU but with improved conflict handling for oversubscription" > > > > really does seem relevant to the test that triggers this issue. > > > > > > > > I stopped pursuing this because from what I understand if > > > > SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: > > > > Free backing memory after faulting the enclave page") then it should > > > > also be encountered without it. The issue is not present with > > > > 08999b2489b4 ("x86/sgx: Free backing memory after faulting the > > > > enclave page") removed. I am thus currently investigating based on > > > > the assumption that the #GP is encountered because of MAC > > > > verification problem. I may be wrong here also and need more information > > > > since the SDM documents two seemingly related errors: > > > > #GP(0) -> If the instruction fails to verify MAC. > > > > SGX_MAC_COMPARE_FAIL -> If the MAC check fails. > > > > > > This part puzzles me in the pseudo-code. > > > > > > The version is read first: > > > > > > TMP_VER := DS:RDX[63:0]; > > > > > > Then there's MAC calculation, comparison, and finally this check: > > > > > > (* Check version before committing *) > > > IF (DS:RDX ≠ 0) > > > THEN #GP(0); > > > ELSE > > > DS:RDX := TMP_VER; > > > FI; > > > > > > For me it is a mystery what does zero the slot and in what condition > > > it would be non-zero. Perhaps the #GP refers anyway to this check? > > > > RDX contains the VA slot information and that appears to be correct > > in these scenarios. The issue is the PCMD data pointed to by the > > PAGEINFO.PCMD (link to PAGEINFO found in RBX) is all zeroes. > > > > There are two scenarios under which the PCMD data could be zeroes. They > > are documented in: > > https://lore.kernel.org/linux-sgx/8157fa40-8d02-8819-e1b6-fd2d8863fb56@intel.com/ > > https://lore.kernel.org/linux-sgx/da387afc-e666-45d0-1e99-066d8c4aab03@intel.com/ > > > > I understand that context may be lost by pointing you to various emails > > in this thread - I'll wrap up all learnings when I submit the new version > > of this series today. > > > > Hi Reinette, > > Regardless the root cause of this problem, I agree with Jarkko above pseudo-code > in the spec is quite confusing. I can try to get it clarified from Intel > internally if you want. It is :-) Yeah, it would be great if it could be made a bit more punctual! BR, Jarkko
Hi Jarkko On Wed, 11 May 2022 05:26:15 -0500, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Tue, May 10, 2022 at 12:36:19PM +1200, Kai Huang wrote: >> On Mon, 2022-05-09 at 10:17 -0700, Reinette Chatre wrote: >> > Hi Jarkko, >> > >> > On 5/7/2022 10:25 AM, Jarkko Sakkinen wrote: >> > > On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: >> > > > > I also looked a little deeper at this transient failure >> problem. The >> > > > > ELDU documentation also mentions a possible error code of: >> > > > > >> > > > > SGX_EPC_PAGE_CONFLICT >> > > > > >> > > > > It *looks* like there can be conflicts on the SECS page as well >> as the >> > > > > EPC page being explicitly accessed. Is that a possible problem >> here? >> > > > >> > > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error >> code >> > > > supported by newer ELDUC - the ELDU used in current code would >> indeed >> > > > #GP in this case. The SDM text describing ELDUC as "This leaf >> function >> > > > behaves like ELDU but with improved conflict handling for >> oversubscription" >> > > > really does seem relevant to the test that triggers this issue. >> > > > >> > > > I stopped pursuing this because from what I understand if >> > > > SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 >> ("x86/sgx: >> > > > Free backing memory after faulting the enclave page") then it >> should >> > > > also be encountered without it. The issue is not present with >> > > > 08999b2489b4 ("x86/sgx: Free backing memory after faulting the >> > > > enclave page") removed. I am thus currently investigating based on >> > > > the assumption that the #GP is encountered because of MAC >> > > > verification problem. I may be wrong here also and need more >> information >> > > > since the SDM documents two seemingly related errors: >> > > > #GP(0) -> If the instruction fails to verify MAC. >> > > > SGX_MAC_COMPARE_FAIL -> If the MAC check fails. >> > > >> > > This part puzzles me in the pseudo-code. >> > > >> > > The version is read first: >> > > >> > > TMP_VER := DS:RDX[63:0]; >> > > >> > > Then there's MAC calculation, comparison, and finally this check: >> > > >> > > (* Check version before committing *) >> > > IF (DS:RDX ≠ 0) >> > > THEN #GP(0); >> > > ELSE >> > > DS:RDX := TMP_VER; >> > > FI; >> > > >> > > For me it is a mystery what does zero the slot and in what condition >> > > it would be non-zero. Perhaps the #GP refers anyway to this check? >> > We discussed this internally, and agree this part of pseudo code needs be corrected/clarified. Here is what we think was going on when ELDU invoked with PCMD of all zeros: ELDU would check if the PCMD.SECINFO.FLAGS.PT is 0 which indicates that the page being loaded is a PT_SECS, and the PAGEINFO.SECS is not zero, then the instruction will #GP(0). Thus, ELDU is behaving correctly – it is an omission in the SDM pseudocode. The version checking code above also need be clarified because the VA slot would be cleared at this point and TMP_VER should be zero. We will follow up with the architect once he is back from sabbatical and make needed adjustment for SDM. Thanks Haitao >> > RDX contains the VA slot information and that appears to be correct >> > in these scenarios. The issue is the PCMD data pointed to by the >> > PAGEINFO.PCMD (link to PAGEINFO found in RBX) is all zeroes. >> > >> > There are two scenarios under which the PCMD data could be zeroes. >> They >> > are documented in: >> > >> https://lore.kernel.org/linux-sgx/8157fa40-8d02-8819-e1b6-fd2d8863fb56@intel.com/ >> > >> https://lore.kernel.org/linux-sgx/da387afc-e666-45d0-1e99-066d8c4aab03@intel.com/ >> > >> > I understand that context may be lost by pointing you to various >> emails >> > in this thread - I'll wrap up all learnings when I submit the new >> version >> > of this series today. >> > >> >> Hi Reinette, >> >> Regardless the root cause of this problem, I agree with Jarkko above >> pseudo-code >> in the spec is quite confusing. I can try to get it clarified from >> Intel >> internally if you want. > It is :-) Yeah, it would be great if it could be made a bit more > punctual! > > BR, Jarkko
> > > > > > > > > > This part puzzles me in the pseudo-code. > > > > > > > > > > The version is read first: > > > > > > > > > > TMP_VER := DS:RDX[63:0]; > > > > > > > > > > Then there's MAC calculation, comparison, and finally this check: > > > > > > > > > > (* Check version before committing *) > > > > > IF (DS:RDX ≠ 0) > > > > > THEN #GP(0); > > > > > ELSE > > > > > DS:RDX := TMP_VER; > > > > > FI; > > > > > > > > > > For me it is a mystery what does zero the slot and in what condition > > > > > it would be non-zero. Perhaps the #GP refers anyway to this check? > > > > > > > We discussed this internally, and agree this part of pseudo code needs be > corrected/clarified. > > Here is what we think was going on when ELDU invoked with PCMD of all > zeros: ELDU would check if the PCMD.SECINFO.FLAGS.PT is 0 which indicates > that the page being loaded is a PT_SECS, and the PAGEINFO.SECS is not > zero, then the instruction will #GP(0). Thus, ELDU is behaving correctly > – it is an omission in the SDM pseudocode. > > The version checking code above also need be clarified because the VA slot > would be cleared at this point and TMP_VER should be zero. "VA slot would be cleared at this point" isn't accurate. The VA slot itself is still occupied at this point. The original TMP_VER before the decryption is the VA slot value stored by EWB, and after decryption it becomes 0, if the decryption is correct. The correct pseudo-code should be: IF (TMP_VER != 0) THEN #GP(0); ELSE DS:RDX = TMP_VER; FI; The check of TMP_VER against 0 is just an additional safe guard to make sure decryption didn't fail.
On Wed, May 11, 2022 at 01:29:59PM -0500, Haitao Huang wrote: > Hi Jarkko > > On Wed, 11 May 2022 05:26:15 -0500, Jarkko Sakkinen <jarkko@kernel.org> > wrote: > > > On Tue, May 10, 2022 at 12:36:19PM +1200, Kai Huang wrote: > > > On Mon, 2022-05-09 at 10:17 -0700, Reinette Chatre wrote: > > > > Hi Jarkko, > > > > > > > > On 5/7/2022 10:25 AM, Jarkko Sakkinen wrote: > > > > > On Thu, Apr 28, 2022 at 04:49:00PM -0700, Reinette Chatre wrote: > > > > > > > I also looked a little deeper at this transient failure > > > problem. The > > > > > > > ELDU documentation also mentions a possible error code of: > > > > > > > > > > > > > > SGX_EPC_PAGE_CONFLICT > > > > > > > > > > > > > > It *looks* like there can be conflicts on the SECS page as > > > well as the > > > > > > > EPC page being explicitly accessed. Is that a possible > > > problem here? > > > > > > > > > > > > I went down this path myself. SGX_EPC_PAGE_CONFLICT is an > > > error code > > > > > > supported by newer ELDUC - the ELDU used in current code would > > > indeed > > > > > > #GP in this case. The SDM text describing ELDUC as "This leaf > > > function > > > > > > behaves like ELDU but with improved conflict handling for > > > oversubscription" > > > > > > really does seem relevant to the test that triggers this issue. > > > > > > > > > > > > I stopped pursuing this because from what I understand if > > > > > > SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 > > > ("x86/sgx: > > > > > > Free backing memory after faulting the enclave page") then it > > > should > > > > > > also be encountered without it. The issue is not present with > > > > > > 08999b2489b4 ("x86/sgx: Free backing memory after faulting the > > > > > > enclave page") removed. I am thus currently investigating based on > > > > > > the assumption that the #GP is encountered because of MAC > > > > > > verification problem. I may be wrong here also and need more > > > information > > > > > > since the SDM documents two seemingly related errors: > > > > > > #GP(0) -> If the instruction fails to verify MAC. > > > > > > SGX_MAC_COMPARE_FAIL -> If the MAC check fails. > > > > > > > > > > This part puzzles me in the pseudo-code. > > > > > > > > > > The version is read first: > > > > > > > > > > TMP_VER := DS:RDX[63:0]; > > > > > > > > > > Then there's MAC calculation, comparison, and finally this check: > > > > > > > > > > (* Check version before committing *) > > > > > IF (DS:RDX ≠ 0) > > > > > THEN #GP(0); > > > > > ELSE > > > > > DS:RDX := TMP_VER; > > > > > FI; > > > > > > > > > > For me it is a mystery what does zero the slot and in what condition > > > > > it would be non-zero. Perhaps the #GP refers anyway to this check? > > > > > > > We discussed this internally, and agree this part of pseudo code needs be > corrected/clarified. > > Here is what we think was going on when ELDU invoked with PCMD of all zeros: > ELDU would check if the PCMD.SECINFO.FLAGS.PT is 0 which indicates that the > page being loaded is a PT_SECS, and the PAGEINFO.SECS is not zero, then the > instruction will #GP(0). Thus, ELDU is behaving correctly – it is an > omission in the SDM pseudocode. > > The version checking code above also need be clarified because the VA slot > would be cleared at this point and TMP_VER should be zero. > > We will follow up with the architect once he is back from sabbatical and > make needed adjustment for SDM. OK, great to hear, thank you! BR, Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 1a2cbe44b8d9..e5d2661800ac 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -81,6 +81,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, ENCLS_WARN(ret, "ELDU"); ret = -EFAULT; + kunmap_atomic(pcmd_page); + kunmap_atomic((void *)(unsigned long)pginfo.contents); + sgx_encl_put_backing(&b, false); + return ret; } memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd));
Recent commit 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") frees the backing storage after it becomes unneeded due to the data from it being loaded back into the EPC (Enclave Page Cache). The backing storage is freed after running ENCLS[ELDU], whether ENCLS[ELDU] succeeded or not. If ENCLS[ELDU] thus failed then the data within that page is lost. Exit with error without removing the backing storage if it could not be restored to the enclave. Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> --- arch/x86/kernel/cpu/sgx/encl.c | 4 ++++ 1 file changed, 4 insertions(+)