[v9,0/7] Basic recovery for machine checks inside SGX

Message ID	20211011185924.374213-1-tony.luck@intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=SV65=O7=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 293EC61054 From: Tony Luck <tony.luck@intel.com> To: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>, naoya.horiguchi@nec.com Cc: Andrew Morton <akpm@linux-foundation.org>, Sean Christopherson <seanjc@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Dave Hansen <dave.hansen@intel.com>, Cathy Zhang <cathy.zhang@intel.com>, linux-sgx@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, Tony Luck <tony.luck@intel.com> Subject: [PATCH v9 0/7] Basic recovery for machine checks inside SGX Date: Mon, 11 Oct 2021 11:59:17 -0700 Message-Id: <20211011185924.374213-1-tony.luck@intel.com> In-Reply-To: <20211001164724.220532-1-tony.luck@intel.com> References: <20211001164724.220532-1-tony.luck@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Basic recovery for machine checks inside SGX \| expand [v9,0/7] Basic recovery for machine checks inside SGX [v9,1/7] x86/sgx: Add new sgx_epc_page flag bit to mark in-use pages [v9,2/7] x86/sgx: Add infrastructure to identify SGX EPC pages [v9,3/7] x86/sgx: Initial poison handling for dirty and free pages [v9,4/7] x86/sgx: Add SGX infrastructure to recover from poison [v9,5/7] x86/sgx: Hook arch_memory_failure() into mainline code [v9,6/7] x86/sgx: Add hook to error injection address validation [v9,7/7] x86/sgx: Add check for SGX pages to ghes_do_memory_failure()

Message ID

20211011185924.374213-1-tony.luck@intel.com (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 293EC61054
From: Tony Luck <tony.luck@intel.com>
To: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	naoya.horiguchi@nec.com
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Sean Christopherson <seanjc@google.com>,
	Jarkko Sakkinen <jarkko@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Cathy Zhang <cathy.zhang@intel.com>,
	linux-sgx@vger.kernel.org,
	linux-acpi@vger.kernel.org,
	linux-mm@kvack.org,
	Tony Luck <tony.luck@intel.com>
Subject: [PATCH v9 0/7] Basic recovery for machine checks inside SGX
Date: Mon, 11 Oct 2021 11:59:17 -0700
Message-Id: <20211011185924.374213-1-tony.luck@intel.com>
In-Reply-To: <20211001164724.220532-1-tony.luck@intel.com>
References: <20211001164724.220532-1-tony.luck@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Basic recovery for machine checks inside SGX | expand

Message

Luck, Tony Oct. 11, 2021, 6:59 p.m. UTC

Posting latest version to a slightly wider audience.

The big picture is that SGX uses some memory pages that are walled off
from access by the OS. This means they:
1) Don't have "struct page" describing them
2) Don't appear in the kernel 1:1 map

But they are still backed by normal DDR memory, so errors can occur.

Parts 1-4 of this series handle the internal SGX bits to keep track of
these pages in an error context. They've had a fair amount of review
on the linux-sgx list (but if any of the 37 subscribers to that list
not named Jarkko or Reinette want to chime in with extra comments and
{Acked,Reviewed,Tested}-by that would be great).

Linux-mm reviewers can (if they like) skip to part 5 where two changes are
made: 1) Hook into memory_failure() in the same spot as device mapping 2)
Skip trying to change 1:1 map (since SGX pages aren't there).

The hooks have generic looking names rather than specifically saying
"sgx" at the suggestion of Dave Hansen. I'm not wedded to the names,
so better suggestions welcome.  I could also change to using some
"ARCH_HAS_PLATFORM_PAGES" config bits if that's the current fashion.

Rafael (and other ACPI list readers) can skip to parts 6 & 7 where there
are hooks into error injection and reporting to simply say "these odd
looking physical addresses are actually ok to use). I added some extra
notes to the einj.rst documentation on how to inject into SGX memory.

Tony Luck (7):
  x86/sgx: Add new sgx_epc_page flag bit to mark in-use pages
  x86/sgx: Add infrastructure to identify SGX EPC pages
  x86/sgx: Initial poison handling for dirty and free pages
  x86/sgx: Add SGX infrastructure to recover from poison
  x86/sgx: Hook arch_memory_failure() into mainline code
  x86/sgx: Add hook to error injection address validation
  x86/sgx: Add check for SGX pages to ghes_do_memory_failure()

 .../firmware-guide/acpi/apei/einj.rst         |  19 ++++
 arch/x86/include/asm/processor.h              |   8 ++
 arch/x86/include/asm/set_memory.h             |   4 +
 arch/x86/kernel/cpu/sgx/main.c                | 104 +++++++++++++++++-
 arch/x86/kernel/cpu/sgx/sgx.h                 |   6 +-
 drivers/acpi/apei/einj.c                      |   3 +-
 drivers/acpi/apei/ghes.c                      |   2 +-
 include/linux/mm.h                            |  14 +++
 mm/memory-failure.c                           |  19 +++-
 9 files changed, 168 insertions(+), 11 deletions(-)


base-commit: 64570fbc14f8d7cb3fe3995f20e26bc25ce4b2cc

Comments

Jarkko Sakkinen Oct. 12, 2021, 4:48 p.m. UTC | #1

On Mon, 2021-10-11 at 11:59 -0700, Tony Luck wrote:
> Posting latest version to a slightly wider audience.
> 
> The big picture is that SGX uses some memory pages that are walled off
> from access by the OS. This means they:
> 1) Don't have "struct page" describing them
> 2) Don't appear in the kernel 1:1 map
> 
> But they are still backed by normal DDR memory, so errors can occur.
> 
> Parts 1-4 of this series handle the internal SGX bits to keep track of
> these pages in an error context. They've had a fair amount of review
> on the linux-sgx list (but if any of the 37 subscribers to that list
> not named Jarkko or Reinette want to chime in with extra comments and
> {Acked,Reviewed,Tested}-by that would be great).
> 
> Linux-mm reviewers can (if they like) skip to part 5 where two changes are
> made: 1) Hook into memory_failure() in the same spot as device mapping 2)
> Skip trying to change 1:1 map (since SGX pages aren't there).
> 
> The hooks have generic looking names rather than specifically saying
> "sgx" at the suggestion of Dave Hansen. I'm not wedded to the names,
> so better suggestions welcome.  I could also change to using some
> "ARCH_HAS_PLATFORM_PAGES" config bits if that's the current fashion.
> 
> Rafael (and other ACPI list readers) can skip to parts 6 & 7 where there
> are hooks into error injection and reporting to simply say "these odd
> looking physical addresses are actually ok to use). I added some extra
> notes to the einj.rst documentation on how to inject into SGX memory.
> 
> Tony Luck (7):
>   x86/sgx: Add new sgx_epc_page flag bit to mark in-use pages
>   x86/sgx: Add infrastructure to identify SGX EPC pages
>   x86/sgx: Initial poison handling for dirty and free pages
>   x86/sgx: Add SGX infrastructure to recover from poison
>   x86/sgx: Hook arch_memory_failure() into mainline code
>   x86/sgx: Add hook to error injection address validation
>   x86/sgx: Add check for SGX pages to ghes_do_memory_failure()
> 
>  .../firmware-guide/acpi/apei/einj.rst         |  19 ++++
>  arch/x86/include/asm/processor.h              |   8 ++
>  arch/x86/include/asm/set_memory.h             |   4 +
>  arch/x86/kernel/cpu/sgx/main.c                | 104 +++++++++++++++++-
>  arch/x86/kernel/cpu/sgx/sgx.h                 |   6 +-
>  drivers/acpi/apei/einj.c                      |   3 +-
>  drivers/acpi/apei/ghes.c                      |   2 +-
>  include/linux/mm.h                            |  14 +++
>  mm/memory-failure.c                           |  19 +++-
>  9 files changed, 168 insertions(+), 11 deletions(-)
> 
> 
> base-commit: 64570fbc14f8d7cb3fe3995f20e26bc25ce4b2cc

I think you instructed me on this before but I've forgot it:
how do I simulate this and test how it works?

/Jarkko

Luck, Tony Oct. 12, 2021, 5:57 p.m. UTC | #2

> I think you instructed me on this before but I've forgot it:
> how do I simulate this and test how it works?

Jarkko,

You can test the non-execution paths (e.g. where the memory error is
reported by a patrol scrubber in the memory controller) by:

# echo 0x{some_SGX_EPC_ADDRESS} > /sys/devices/system/memory/hard_offline_page

The execution paths are more difficult. You need a system that can inject
errors into EPC memory. There are some hints in the Documenation changes
in part 0006.

Reinette posted some changes to sgx tests that she used to validate.

-Tony