Message ID | 5aa7506d4fedbf625e3fe8ceeb88af3be1ce97ea.1685887183.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX host kernel support | expand |
On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote: > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > index 8ff07256a515..0aa413b712e8 100644 > --- a/arch/x86/virt/vmx/tdx/tdx.c > +++ b/arch/x86/virt/vmx/tdx/tdx.c > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > tdmr_pamt_base += pamt_size[pgsz]; > } > > + /* > + * tdx_memory_shutdown() also reads TDMR's PAMT during > + * kexec() or reboot, which could happen at anytime, even > + * during this particular code. Make sure pamt_4k_base > + * is firstly set otherwise tdx_memory_shutdown() may > + * get an invalid PAMT base when it sees a valid number > + * of PAMT pages. > + */ Hmm? What prevents compiler from messing this up. It can reorder as it wishes, no? Maybe add a proper locking? Anything that prevent preemption would do, right? > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote: > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote: > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > index 8ff07256a515..0aa413b712e8 100644 > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > > tdmr_pamt_base += pamt_size[pgsz]; > > } > > > > + /* > > + * tdx_memory_shutdown() also reads TDMR's PAMT during > > + * kexec() or reboot, which could happen at anytime, even > > + * during this particular code. Make sure pamt_4k_base > > + * is firstly set otherwise tdx_memory_shutdown() may > > + * get an invalid PAMT base when it sees a valid number > > + * of PAMT pages. > > + */ > > Hmm? What prevents compiler from messing this up. It can reorder as it > wishes, no? Hmm.. Right. Sorry I missed. > > Maybe add a proper locking? Anything that prevent preemption would do, > right? > > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; > I think a simple memory barrier will do. How does below look? --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, * tdx_memory_shutdown() also reads TDMR's PAMT during * kexec() or reboot, which could happen at anytime, even * during this particular code. Make sure pamt_4k_base - * is firstly set otherwise tdx_memory_shutdown() may - * get an invalid PAMT base when it sees a valid number - * of PAMT pages. + * is firstly set and place a __mb() after it otherwise + * tdx_memory_shutdown() may get an invalid PAMT base + * when it sees a valid number of PAMT pages. */ tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; + __mb();
On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote: > On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote: > > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote: > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > > index 8ff07256a515..0aa413b712e8 100644 > > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > > > tdmr_pamt_base += pamt_size[pgsz]; > > > } > > > > > > + /* > > > + * tdx_memory_shutdown() also reads TDMR's PAMT during > > > + * kexec() or reboot, which could happen at anytime, even > > > + * during this particular code. Make sure pamt_4k_base > > > + * is firstly set otherwise tdx_memory_shutdown() may > > > + * get an invalid PAMT base when it sees a valid number > > > + * of PAMT pages. > > > + */ > > > > Hmm? What prevents compiler from messing this up. It can reorder as it > > wishes, no? > > Hmm.. Right. Sorry I missed. > > > > > Maybe add a proper locking? Anything that prevent preemption would do, > > right? > > > > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; > > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; > > > > I think a simple memory barrier will do. How does below look? > > --- a/arch/x86/virt/vmx/tdx/tdx.c > +++ b/arch/x86/virt/vmx/tdx/tdx.c > @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > * tdx_memory_shutdown() also reads TDMR's PAMT during > * kexec() or reboot, which could happen at anytime, even > * during this particular code. Make sure pamt_4k_base > - * is firstly set otherwise tdx_memory_shutdown() may > - * get an invalid PAMT base when it sees a valid number > - * of PAMT pages. > + * is firstly set and place a __mb() after it otherwise > + * tdx_memory_shutdown() may get an invalid PAMT base > + * when it sees a valid number of PAMT pages. > */ > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > + __mb(); If you want to play with barriers, assign pamt_4k_base the last with smp_store_release() and read it first in tdmr_get_pamt() with smp_load_acquire(). If it is non-zero, all pamt_* fields are valid. Or just drop this non-sense and use a spin lock for serialization.
On Mon, 2023-06-12 at 10:58 +0300, kirill.shutemov@linux.intel.com wrote: > On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote: > > On Fri, 2023-06-09 at 16:23 +0300, kirill.shutemov@linux.intel.com wrote: > > > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote: > > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > > > index 8ff07256a515..0aa413b712e8 100644 > > > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > > > > tdmr_pamt_base += pamt_size[pgsz]; > > > > } > > > > > > > > + /* > > > > + * tdx_memory_shutdown() also reads TDMR's PAMT during > > > > + * kexec() or reboot, which could happen at anytime, even > > > > + * during this particular code. Make sure pamt_4k_base > > > > + * is firstly set otherwise tdx_memory_shutdown() may > > > > + * get an invalid PAMT base when it sees a valid number > > > > + * of PAMT pages. > > > > + */ > > > > > > Hmm? What prevents compiler from messing this up. It can reorder as it > > > wishes, no? > > > > Hmm.. Right. Sorry I missed. > > > > > > > > Maybe add a proper locking? Anything that prevent preemption would do, > > > right? > > > > > > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > > > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; > > > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; > > > > > > > I think a simple memory barrier will do. How does below look? > > > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, > > * tdx_memory_shutdown() also reads TDMR's PAMT during > > * kexec() or reboot, which could happen at anytime, even > > * during this particular code. Make sure pamt_4k_base > > - * is firstly set otherwise tdx_memory_shutdown() may > > - * get an invalid PAMT base when it sees a valid number > > - * of PAMT pages. > > + * is firstly set and place a __mb() after it otherwise > > + * tdx_memory_shutdown() may get an invalid PAMT base > > + * when it sees a valid number of PAMT pages. > > */ > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; > > + __mb(); > > If you want to play with barriers, assign pamt_4k_base the last with > smp_store_release() and read it first in tdmr_get_pamt() with > smp_load_acquire(). If it is non-zero, all pamt_* fields are valid. > > Or just drop this non-sense and use a spin lock for serialization. > We don't need to guarantee when pamt_4k_base is valid, all other pamt_* are valid. Instead, we need to guarantee when (at least) _one_ of pamt_*_size is valid, the pamt_4k_base is valid. For example, pamt_4k_base -> valid pamt_4k_size -> invalid (0) pamt_2m_size -> invalid pamt_1g_size -> invalid and pamt_4k_base -> valid pamt_4k_size -> valid pamt_2m_size -> invalid pamt_1g_size -> invalid are both OK. The reason is the PAMTs are only written by the TDX module in init_tdmrs(). So if tdx_memory_shutdown() sees a part of PAMT (the second case above), those PAMT pages are not yet TDX private pages, thus converting part of PAMT is fine. The invalid case is when any pamt_*_size is valid, pamt_4k_base is invalid, e.g.: pamt_4k_base -> invalid pamt_4k_size -> valid pamt_2m_size -> invalid pamt_1g_size -> invalid as this case tdx_memory_shutdown() will convert a incorrect (not partial) PAMT area. So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base will be seen by other cpus. Does it make sense?
On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote:
> Does it make sense?
I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb()
would be better), but it is not justified from complexity PoV. This
lockless exercise gave me a pause to understand.
Lockless doesn't buy you anything here, only increases complexity.
Just take a lock.
Kernel is big. I'm sure you'll find a better opportunity to be clever
about serialization :P
From: kirill.shutemov@linux.intel.com > Sent: 12 June 2023 12:49 > > On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote: > > Does it make sense? > > I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb() > would be better), but it is not justified from complexity PoV. Given that x86 performs writes pretty much in code order. Do you need anything more than a compile barrier? > This lockless exercise gave me a pause to understand. > > Lockless doesn't buy you anything here, only increases complexity. > Just take a lock. Indeed... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 6/12/23 03:27, Huang, Kai wrote: > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > will be seen by other cpus. > > Does it make sense? Just use a normal old atomic_t or set_bit()/test_bit(). They have built-in memory barriers are are less likely to get botched.
On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > On 6/12/23 03:27, Huang, Kai wrote: > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > > will be seen by other cpus. > > > > Does it make sense? > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > built-in memory barriers are are less likely to get botched. Thanks for the suggestion. Hi Dave, Kirill, I'd like to check with you that whether we should introduce a mechanism to track TDX private pages for both this patch and the next. As you can see this patch only deals PAMT pages due to couple of reasons that mnentioned in the changelog. The next MCE patch handles all TDX private pages, but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is slow (probably doesn't matter, though); 2) it brings additional risk of triggering further #MC inside TDX module, although such risk should be a theoretical thing. If we introduce a helper to mark a page as TDX private page, then both above patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in this patch (we will need a way to loop all TDX-usable memory pages, but this needs to be done anyway with TDX guests). I believe eventually we can end up with less code. In terms of how to do, for PAMT pages, we can set page->private to a TDX magic number because they come out of page allocator directly. Secure-EPT pages are like PAMT pages too. For TDX guest private pages, Sean is moving to implement KVM's own pseudo filesystem so they will have a unique mapping to identify. https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c And my thinking is in this TDX host series, we can just handle PAMT pages. Both secure-EPT and TDX guest private pages can be handled later in KVM TDX series. I think eventually we can have a function like below to tell whether a page is TDX private page: bool page_is_tdx_private(struct page *page) { if (page->private == TDX_PRIVATE_MAGIC) return true; if (!page_mapping(page)) return false; return page_mapping(page)->a_ops == &kvm_gmem_ops; } How does this sound? Or any other comments? Thanks!
On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote: > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > > On 6/12/23 03:27, Huang, Kai wrote: > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > > > will be seen by other cpus. > > > > > > Does it make sense? > > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > built-in memory barriers are are less likely to get botched. > > Thanks for the suggestion. > > Hi Dave, Kirill, > > I'd like to check with you that whether we should introduce a mechanism to track > TDX private pages for both this patch and the next. > > As you can see this patch only deals PAMT pages due to couple of reasons that > mnentioned in the changelog. The next MCE patch handles all TDX private pages, > but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is > slow (probably doesn't matter, though); 2) it brings additional risk of > triggering further #MC inside TDX module, although such risk should be a > theoretical thing. > > If we introduce a helper to mark a page as TDX private page, then both above > patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in > this patch (we will need a way to loop all TDX-usable memory pages, but this > needs to be done anyway with TDX guests). I believe eventually we can end up > with less code. > > In terms of how to do, for PAMT pages, we can set page->private to a TDX magic > number because they come out of page allocator directly. Secure-EPT pages are > like PAMT pages too. For TDX guest private pages, Sean is moving to implement > KVM's own pseudo filesystem so they will have a unique mapping to identify. > > https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c > > And my thinking is in this TDX host series, we can just handle PAMT pages. Both > secure-EPT and TDX guest private pages can be handled later in KVM TDX series. > I think eventually we can have a function like below to tell whether a page is > TDX private page: > > bool page_is_tdx_private(struct page *page) > { > if (page->private == TDX_PRIVATE_MAGIC) > return true; > > if (!page_mapping(page)) > return false; > > return page_mapping(page)->a_ops == &kvm_gmem_ops; > } > > How does this sound? Or any other comments? Thanks! If you going to take this path it has to be supported natively by kvm_gmem_: it has to provide API for that. You should not assume that page->private is free to use. It is owned by kvm_gmmem.
On 6/12/23 17:51, Huang, Kai wrote:
> If we introduce a helper to mark a page as TDX private page,
Let me get this right: you have working, functional code for a
highly-unlikely scenario (kernel bugs or even more rare hardware
errors). But, you want to optimize this super-rare case? It's not fast
enough?
Is there any other motivation here that I'm missing?
On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote: > On 6/12/23 17:51, Huang, Kai wrote: > > If we introduce a helper to mark a page as TDX private page, > > Let me get this right: you have working, functional code for a > highly-unlikely scenario (kernel bugs or even more rare hardware > errors). But, you want to optimize this super-rare case? It's not fast > enough? > > Is there any other motivation here that I'm missing? > No it's not about speed. The motivation is to have a common code to yield less line of code, though I don't have clear number of how many LoC can be reduced.
On Tue, 2023-06-13 at 14:05 +0300, kirill.shutemov@linux.intel.com wrote: > On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote: > > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > > > On 6/12/23 03:27, Huang, Kai wrote: > > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > > > > will be seen by other cpus. > > > > > > > > Does it make sense? > > > > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > > built-in memory barriers are are less likely to get botched. > > > > Thanks for the suggestion. > > > > Hi Dave, Kirill, > > > > I'd like to check with you that whether we should introduce a mechanism to track > > TDX private pages for both this patch and the next. > > > > As you can see this patch only deals PAMT pages due to couple of reasons that > > mnentioned in the changelog. The next MCE patch handles all TDX private pages, > > but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is > > slow (probably doesn't matter, though); 2) it brings additional risk of > > triggering further #MC inside TDX module, although such risk should be a > > theoretical thing. > > > > If we introduce a helper to mark a page as TDX private page, then both above > > patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in > > this patch (we will need a way to loop all TDX-usable memory pages, but this > > needs to be done anyway with TDX guests). I believe eventually we can end up > > with less code. > > > > In terms of how to do, for PAMT pages, we can set page->private to a TDX magic > > number because they come out of page allocator directly. Secure-EPT pages are > > like PAMT pages too. For TDX guest private pages, Sean is moving to implement > > KVM's own pseudo filesystem so they will have a unique mapping to identify. > > > > https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c > > > > And my thinking is in this TDX host series, we can just handle PAMT pages. Both > > secure-EPT and TDX guest private pages can be handled later in KVM TDX series. > > I think eventually we can have a function like below to tell whether a page is > > TDX private page: > > > > bool page_is_tdx_private(struct page *page) > > { > > if (page->private == TDX_PRIVATE_MAGIC) > > return true; > > > > if (!page_mapping(page)) > > return false; > > > > return page_mapping(page)->a_ops == &kvm_gmem_ops; > > } > > > > How does this sound? Or any other comments? Thanks! > > If you going to take this path it has to be supported natively by > kvm_gmem_: it has to provide API for that. > Yes. > You should not assume that > page->private is free to use. It is owned by kvm_gmmem. > page->private is only for PAMT and SEPT pages. kmem_gmem has it's own mapping which can be used to identify the pages owned by it. Hmm.. I think we should just leave them out for now, as they theoretically are owned by KVM thus can be handled by KVM, e.g., in it's reboot or shutdown or module unload code path. If we are fine to use SEAMCALL in the #MC handler code path, I think perhaps we can just keep using TDMRs to locate PAMTs.
On 6/13/23 16:18, Huang, Kai wrote: > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote: >> On 6/12/23 17:51, Huang, Kai wrote: >>> If we introduce a helper to mark a page as TDX private page, >> Let me get this right: you have working, functional code for a >> highly-unlikely scenario (kernel bugs or even more rare hardware >> errors). But, you want to optimize this super-rare case? It's not fast >> enough? >> >> Is there any other motivation here that I'm missing? >> > No it's not about speed. The motivation is to have a common code to yield less > line of code, though I don't have clear number of how many LoC can be reduced. OK, so ... ballpark. How many lines of code are we going to _save_ for this super-rare case? 10? 100? 1000? The upside is saving X lines of code ... somewhere. The downside is adding Y lines of code ... somewhere else and maybe breaking things in the process. You've evidently done _some_ kind of calculus in your head to make this tradeoff worthwhile. I'd love to hear what your calculus is, even if it's just a gut feel. Could you share your logic here, please?
On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote: > On 6/13/23 16:18, Huang, Kai wrote: > > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote: > > > On 6/12/23 17:51, Huang, Kai wrote: > > > > If we introduce a helper to mark a page as TDX private page, > > > Let me get this right: you have working, functional code for a > > > highly-unlikely scenario (kernel bugs or even more rare hardware > > > errors). But, you want to optimize this super-rare case? It's not fast > > > enough? > > > > > > Is there any other motivation here that I'm missing? > > > > > No it's not about speed. The motivation is to have a common code to yield less > > line of code, though I don't have clear number of how many LoC can be reduced. > > OK, so ... ballpark. How many lines of code are we going to _save_ for > this super-rare case? 10? 100? 1000? ~50 LoC I guess, certainly < 100. > > The upside is saving X lines of code ... somewhere. The downside is > adding Y lines of code ... somewhere else and maybe breaking things in > the process. > > You've evidently done _some_ kind of calculus in your head to make this > tradeoff worthwhile. I'd love to hear what your calculus is, even if > it's just a gut feel. > > Could you share your logic here, please? The logic is the whole tdx_is_private_mem() function in the next patch (#MC handling one) can be significantly reduced from 100 -> ~10, and we roughly needs some more code (<50 LoC) to mark PAMT as private.
On Wed, 2023-06-14 at 00:38 +0000, Huang, Kai wrote: > On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote: > > On 6/13/23 16:18, Huang, Kai wrote: > > > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote: > > > > On 6/12/23 17:51, Huang, Kai wrote: > > > > > If we introduce a helper to mark a page as TDX private page, > > > > Let me get this right: you have working, functional code for a > > > > highly-unlikely scenario (kernel bugs or even more rare hardware > > > > errors). But, you want to optimize this super-rare case? It's not fast > > > > enough? > > > > > > > > Is there any other motivation here that I'm missing? > > > > > > > No it's not about speed. The motivation is to have a common code to yield less > > > line of code, though I don't have clear number of how many LoC can be reduced. > > > > OK, so ... ballpark. How many lines of code are we going to _save_ for > > this super-rare case? 10? 100? 1000? > > ~50 LoC I guess, certainly < 100. > > > > > The upside is saving X lines of code ... somewhere. The downside is > > adding Y lines of code ... somewhere else and maybe breaking things in > > the process. > > > > You've evidently done _some_ kind of calculus in your head to make this > > tradeoff worthwhile. I'd love to hear what your calculus is, even if > > it's just a gut feel. > > > > Could you share your logic here, please? > > The logic is the whole tdx_is_private_mem() function in the next patch (#MC > handling one) can be significantly reduced from 100 -> ~10, and we roughly needs > some more code (<50 LoC) to mark PAMT as private. > Apologize, should be "we roughly need some more code (<50 LoC) to mark PAMT and Secure-EPT and TDX guest private pages as TDX private pages". But now we only have PAMT.
On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote: > --- a/arch/x86/kernel/reboot.c > +++ b/arch/x86/kernel/reboot.c > @@ -720,6 +720,7 @@ void native_machine_shutdown(void) > > #ifdef CONFIG_X86_64 > x86_platform.iommu_shutdown(); > + x86_platform.memory_shutdown(); > #endif > } Hi Kirill/Dave, I missed that this solution doesn't reset TDX private for emergency restart or when reboot_force is set, because machine_shutdown() isn't called for them. Is it acceptable? Or should we handle them too?
On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote: > On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote: > > --- a/arch/x86/kernel/reboot.c > > +++ b/arch/x86/kernel/reboot.c > > @@ -720,6 +720,7 @@ void native_machine_shutdown(void) > > > > #ifdef CONFIG_X86_64 > > x86_platform.iommu_shutdown(); > > + x86_platform.memory_shutdown(); > > #endif > > } > > Hi Kirill/Dave, > > I missed that this solution doesn't reset TDX private for emergency restart or > when reboot_force is set, because machine_shutdown() isn't called for them. > > Is it acceptable? Or should we handle them too? Force reboot is not used in kexec path, right? And the platform has to handle erratum in BIOS to reset memory status on reboot anyway. I think we should be fine. But it worth mentioning it in the commit message.
On Wed, 2023-06-14 at 13:02 +0300, kirill.shutemov@linux.intel.com wrote: > On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote: > > On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote: > > > --- a/arch/x86/kernel/reboot.c > > > +++ b/arch/x86/kernel/reboot.c > > > @@ -720,6 +720,7 @@ void native_machine_shutdown(void) > > > > > > #ifdef CONFIG_X86_64 > > > x86_platform.iommu_shutdown(); > > > + x86_platform.memory_shutdown(); > > > #endif > > > } > > > > Hi Kirill/Dave, > > > > I missed that this solution doesn't reset TDX private for emergency restart or > > when reboot_force is set, because machine_shutdown() isn't called for them. > > > > Is it acceptable? Or should we handle them too? > > Force reboot is not used in kexec path, right? > Correct. > And the platform has to > handle erratum in BIOS to reset memory status on reboot anyway. So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX private pages, and the BIOS needs to disable "warm reset". IIUC this means the kernel needs to depend on specific BIOS setting to work normally, and IIUC the kernel even cannot be aware of this setting? Should the kernel just reset all TDX private pages when erratum is present during reboot so the kernel doesn't depend on BIOS? > > I think we should be fine. But it worth mentioning it in the commit > message. > Agreed.
On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote: > > And the platform has to > > handle erratum in BIOS to reset memory status on reboot anyway. > > So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX > private pages, and the BIOS needs to disable "warm reset". > > IIUC this means the kernel needs to depend on specific BIOS setting to work > normally, and IIUC the kernel even cannot be aware of this setting? > > Should the kernel just reset all TDX private pages when erratum is present > during reboot so the kernel doesn't depend on BIOS? Kernel cannot really function if we don't trust BIOS to do its job. Kernel depends on BIOS services anyway. We cannot try to handle everything in kernel just in case BIOS drops the ball.
On Wed, 2023-06-14 at 14:08 +0300, kirill.shutemov@linux.intel.com wrote: > On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote: > > > And the platform has to > > > handle erratum in BIOS to reset memory status on reboot anyway. > > > > So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX > > private pages, and the BIOS needs to disable "warm reset". > > > > IIUC this means the kernel needs to depend on specific BIOS setting to work > > normally, and IIUC the kernel even cannot be aware of this setting? > > > > Should the kernel just reset all TDX private pages when erratum is present > > during reboot so the kernel doesn't depend on BIOS? > > Kernel cannot really function if we don't trust BIOS to do its job. Kernel > depends on BIOS services anyway. We cannot try to handle everything in > kernel just in case BIOS drops the ball. > In other words, I assume we just need to take care of kexec(). The current patch tries to handle reboot too, so I'll change to only cover kexec(), assuming the BIOS will always disable warm reset reboot for platforms with this erratum. Thanks.
On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > On 6/12/23 03:27, Huang, Kai wrote: > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > > will be seen by other cpus. > > > > Does it make sense? > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > built-in memory barriers are are less likely to get botched. Hi Dave, Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a little bit silly or overkill IMHO. Looking at the code, it seems arch_atomic_set() simply uses __WRITE_ONCE(): static __always_inline void arch_atomic_set(atomic_t *v, int i) { __WRITE_ONCE(v->counter, i); } Is it better to just use __WRITE_ONCE() or WRITE_ONCE() here? - tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; + WRITE_ONCE(tdmr->pamt_4k_base, pamt_base[TDX_PS_4K]);
On 6/19/23 04:43, Huang, Kai wrote: > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: >> On 6/12/23 03:27, Huang, Kai wrote: >>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as >>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base >>> will be seen by other cpus. >>> >>> Does it make sense? >> Just use a normal old atomic_t or set_bit()/test_bit(). They have >> built-in memory barriers are are less likely to get botched. > Hi Dave, > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a > little bit silly or overkill IMHO. Looking at the code, it seems > arch_atomic_set() simply uses __WRITE_ONCE(): How about _adding_ a variable that protects tdmr->pamt_4k_base? Wouldn't that be more straightforward than mucking around with existing types?
On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote: > On 6/19/23 04:43, Huang, Kai wrote: > > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > >> On 6/12/23 03:27, Huang, Kai wrote: > >>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > >>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > >>> will be seen by other cpus. > >>> > >>> Does it make sense? > >> Just use a normal old atomic_t or set_bit()/test_bit(). They have > >> built-in memory barriers are are less likely to get botched. > > Hi Dave, > > > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a > > little bit silly or overkill IMHO. Looking at the code, it seems > > arch_atomic_set() simply uses __WRITE_ONCE(): > > How about _adding_ a variable that protects tdmr->pamt_4k_base? > Wouldn't that be more straightforward than mucking around with existing > types? What's wrong with simple global spinlock that protects all tdmr->pamt_*? It is much easier to follow than a custom serialization scheme.
On Mon, 2023-06-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote: > On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote: > > On 6/19/23 04:43, Huang, Kai wrote: > > > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote: > > > > On 6/12/23 03:27, Huang, Kai wrote: > > > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as > > > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base > > > > > will be seen by other cpus. > > > > > > > > > > Does it make sense? > > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > > > built-in memory barriers are are less likely to get botched. > > > Hi Dave, > > > > > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a > > > little bit silly or overkill IMHO. Looking at the code, it seems > > > arch_atomic_set() simply uses __WRITE_ONCE(): > > > > How about _adding_ a variable that protects tdmr->pamt_4k_base? > > Wouldn't that be more straightforward than mucking around with existing > > types? > > What's wrong with simple global spinlock that protects all tdmr->pamt_*? > It is much easier to follow than a custom serialization scheme. > For this patch I think it's overkill to use spinlock because when the rebooting cpu is reading this all other cpus have been stopped already, so there's no concurrent thing here. However I just recall that the next #MC handler patch can also take advantage of this too because #MC handler can truly run concurrently with module initialization. Currently that one reads tdx_module_status first but again we may have the same memory order issue. So having a spinlock makes sense from #MC handler patch's point of view. I'll change to use spinlock if Dave is fine? Thanks for feedback!
On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote: >>> >>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a >>> little bit silly or overkill IMHO. Looking at the code, it seems >>> arch_atomic_set() simply uses __WRITE_ONCE(): >> How about _adding_ a variable that protects tdmr->pamt_4k_base? >> Wouldn't that be more straightforward than mucking around with existing >> types? > What's wrong with simple global spinlock that protects all tdmr->pamt_*? > It is much easier to follow than a custom serialization scheme. Quick, what prevents a: spin_lock() => #MC => spin_lock() deadlock? Plain old test/sets don't deadlock ever.
On Mon, 2023-06-19 at 16:41 -0700, Dave Hansen wrote: > On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote: > > > > > > > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a > > > > little bit silly or overkill IMHO. Looking at the code, it seems > > > > arch_atomic_set() simply uses __WRITE_ONCE(): > > > How about _adding_ a variable that protects tdmr->pamt_4k_base? > > > Wouldn't that be more straightforward than mucking around with existing > > > types? > > What's wrong with simple global spinlock that protects all tdmr->pamt_*? > > It is much easier to follow than a custom serialization scheme. > > Quick, what prevents a: > > spin_lock() => #MC => spin_lock() > > deadlock? > > Plain old test/sets don't deadlock ever. Agreed. So I think having any locking in #MC handle is kinda dangerous. Adding "a" variable has another advantage: We can have a more precise result of whether we need to reset PAMT pages, even those PAMTs are already allocated and set to the TDMRs, because the TDX module only starts to write PAMTs using global KeyID until some SEAMCALL. Any comments to below? +static bool tdx_private_mem_begin; + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -1141,6 +1143,8 @@ static int init_tdx_module(void) */ wbinvd_on_all_cpus(); + WRITE_ONCE(tdx_private_mem_begin, true); + /* Config the key of global KeyID on all packages */ ret = config_global_keyid(); if (ret) @@ -1463,6 +1467,14 @@ static void tdx_memory_shutdown(void) */ WARN_ON_ONCE(num_online_cpus() != 1); + /* + * It's not possible to have any TDX private pages if the TDX + * module hasn't started to write any memory using the global + * KeyID. + */ + if (!READ_ONCE(tdx_private_mem_begin)) + return; + tdmrs_reset_pamt_all(&tdx_tdmr_list);
On 6/19/23 17:56, Huang, Kai wrote: > Any comments to below? Nothing that I haven't already said in this thread: > Just use a normal old atomic_t or set_bit()/test_bit(). They have > built-in memory barriers are are less likely to get botched. I kinda made a point of literally suggesting "atomic_t or set_bit()/test_bit()". I even told you why: "built-in memory barriers". Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
On Mon, Jun 19, 2023 at 04:41:13PM -0700, Dave Hansen wrote: > On 6/19/23 07:46, kirill.shutemov@linux.intel.com wrote: > >>> > >>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a > >>> little bit silly or overkill IMHO. Looking at the code, it seems > >>> arch_atomic_set() simply uses __WRITE_ONCE(): > >> How about _adding_ a variable that protects tdmr->pamt_4k_base? > >> Wouldn't that be more straightforward than mucking around with existing > >> types? > > What's wrong with simple global spinlock that protects all tdmr->pamt_*? > > It is much easier to follow than a custom serialization scheme. > > Quick, what prevents a: > > spin_lock() => #MC => spin_lock() > > deadlock? > > Plain old test/sets don't deadlock ever. Depends on what you mean; anything that spin-waits will deadlock, doesn't matter if its a test-and-set or not. The thing with these non-maskable exceptions/interrupts is that they must be wait-free. If serialization is required it needs to be try based and accept failure without waiting.
On Mon, Jun 19, 2023 at 06:06:30PM -0700, Dave Hansen wrote: > On 6/19/23 17:56, Huang, Kai wrote: > > Any comments to below? > > Nothing that I haven't already said in this thread: > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > built-in memory barriers are are less likely to get botched. > > I kinda made a point of literally suggesting "atomic_t or > set_bit()/test_bit()". I even told you why: "built-in memory barriers". > > Guess what READ/WRITE_ONCE() *don't* have. Memory barriers. x86 has built-in memory barriers for being TSO :-) Specifically all barriers provided by spinlock (acquire/release) are no-ops on x86. (strictly speaking locks imply stronger order than they have to because TSO atomic ops imply stronger ordering than required) There is one (and only the one) re-ordering possible on TSO and that is the store-buffer, later loads can fail to observe prior stores. If that is a concern, you need explicit barriers. This is #MC, much care and explicit open-coded crap is expected. Also, this is #MC, much broken is also expected :-( As in, the current #MC handler is a know pile of shit. Basically the whole of #MC should be noinstr -- it isn't and that's a significant problem. Also we still very much suffer the NMI <- #MC problem and the #MC latch is known broken garbage. Whatever you do, do it very carefully, double check and be more careful.
On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> + __mb();
__mb() is not a valid interface to use.
On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote: > On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote: > > > + __mb(); > > __mb() is not a valid interface to use. Thanks for feedback! May I ask why, for education purpose? :)
On Tue, Jun 20, 2023 at 10:42:32AM +0000, Huang, Kai wrote: > On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote: > > On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote: > > > > > + __mb(); > > > > __mb() is not a valid interface to use. > > Thanks for feedback! > > May I ask why, for education purpose? :) it's the raw MFENCE wrapper, not one of the *many* documented barriers. Also, typicaly you do *not* want MFENCE, MFENCE bad.
On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote: > On 6/19/23 17:56, Huang, Kai wrote: > > Any comments to below? > > Nothing that I haven't already said in this thread: > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > built-in memory barriers are are less likely to get botched. > > I kinda made a point of literally suggesting "atomic_t or > set_bit()/test_bit()". I even told you why: "built-in memory barriers". > > Guess what READ/WRITE_ONCE() *don't* have. Memory barriers. > Hi Dave, Sorry to bring this up again. I thought more on this topic, and I think using atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory barrier to make sure memory order (as Kirill commented in the first reply). However, if we add a new variable like below ... +static bool tdx_private_mem_begin; + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -1123,6 +1125,8 @@ static int init_tdx_module(void) */ wbinvd_on_all_cpus(); + tdx_private_mem_begin = true; ... then we don't need any more explicit barrier, because: 1) it's not possible for compiler to optimize the order between setting tdmr->pamt_* and tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a serializing instruction so the wbinvd_on_all_cpus() above has already implied memory barrier. Does this make sense?
On Sun, 2023-06-25 at 15:30 +0000, Huang, Kai wrote: > On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote: > > On 6/19/23 17:56, Huang, Kai wrote: > > > Any comments to below? > > > > Nothing that I haven't already said in this thread: > > > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have > > > built-in memory barriers are are less likely to get botched. > > > > I kinda made a point of literally suggesting "atomic_t or > > set_bit()/test_bit()". I even told you why: "built-in memory barriers". > > > > Guess what READ/WRITE_ONCE() *don't* have. Memory barriers. > > > > Hi Dave, > > Sorry to bring this up again. I thought more on this topic, and I think using > atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in > tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory > barrier to make sure memory order (as Kirill commented in the first reply). > > However, if we add a new variable like below ... > > +static bool tdx_private_mem_begin; > + > /* > * Wrapper of __seamcall() to convert SEAMCALL leaf function error code > * to kernel error code. @seamcall_ret and @out contain the SEAMCALL > @@ -1123,6 +1125,8 @@ static int init_tdx_module(void) > */ > wbinvd_on_all_cpus(); > > + tdx_private_mem_begin = true; > > > ... then we don't need any more explicit barrier, because: 1) it's not possible > for compiler to optimize the order between setting tdmr->pamt_* and > tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a > serializing instruction so the wbinvd_on_all_cpus() above has already implied > memory barrier. > > Does this make sense? Sorry please ignore this. I missed a corner case that the kexec() can happen when something goes wrong during module initialization and when PAMTs/TDMRs are being freed. We still need explicit memory barrier for this case. I will use atomic_t as suggested. Thanks!
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 88085f369ff6..d2c6742b185a 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -299,6 +299,7 @@ struct x86_platform_ops { void (*get_wallclock)(struct timespec64 *ts); int (*set_wallclock)(const struct timespec64 *ts); void (*iommu_shutdown)(void); + void (*memory_shutdown)(void); bool (*is_untracked_pat_range)(u64 start, u64 end); void (*nmi_init)(void); unsigned char (*get_nmi_reason)(void); diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index b3d0e015dae2..6aadfec8df7a 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -720,6 +720,7 @@ void native_machine_shutdown(void) #ifdef CONFIG_X86_64 x86_platform.iommu_shutdown(); + x86_platform.memory_shutdown(); #endif } diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index d82f4fa2f1bf..344250b35a5d 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -31,6 +31,7 @@ void x86_init_noop(void) { } void __init x86_init_uint_noop(unsigned int unused) { } static int __init iommu_init_noop(void) { return 0; } static void iommu_shutdown_noop(void) { } +static void memory_shutdown_noop(void) { } bool __init bool_x86_init_noop(void) { return false; } void x86_op_int_noop(int cpu) { } int set_rtc_noop(const struct timespec64 *now) { return -EINVAL; } @@ -142,6 +143,7 @@ struct x86_platform_ops x86_platform __ro_after_init = { .get_wallclock = mach_get_cmos_time, .set_wallclock = mach_set_cmos_time, .iommu_shutdown = iommu_shutdown_noop, + .memory_shutdown = memory_shutdown_noop, .is_untracked_pat_range = is_ISA_range, .nmi_init = default_nmi_init, .get_nmi_reason = default_get_nmi_reason, diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 8ff07256a515..0aa413b712e8 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr, tdmr_pamt_base += pamt_size[pgsz]; } + /* + * tdx_memory_shutdown() also reads TDMR's PAMT during + * kexec() or reboot, which could happen at anytime, even + * during this particular code. Make sure pamt_4k_base + * is firstly set otherwise tdx_memory_shutdown() may + * get an invalid PAMT base when it sees a valid number + * of PAMT pages. + */ tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; @@ -1318,6 +1326,46 @@ static struct notifier_block tdx_memory_nb = { .notifier_call = tdx_memory_notifier, }; +static void tdx_memory_shutdown(void) +{ + /* + * Convert all TDX private pages back to normal if the platform + * has "partial write machine check" erratum. + * + * For now there's no existing infrastructure to tell whether + * a page is TDX private memory. Using SEAMCALL to query TDX + * module isn't feasible either because: 1) VMX has been turned + * off by reaching here so SEAMCALL cannot be made; 2) Even + * SEAMCALL can be made the result from TDX module may not be + * accurate (e.g., remote CPU can be stopped while the kernel + * is in the middle of reclaiming one TDX private page and doing + * MOVDIR64B). + * + * One solution could be just converting all memory pages, but + * this may bring non-trivial latency on large memory systems + * (especially when the number of TDX private pages is small). + * Looks eventually the kernel should track TDX private pages and + * only convert these. + * + * Also, not all pages are mapped as writable in direct mapping, + * thus it's problematic to do so. It can be done by switching + * to the identical mapping page table built for kexec(), which + * maps all pages as writable, but the complexity looks overkill. + * + * Thus instead of doing something dramatic to convert all pages, + * only convert PAMTs for now as for now TDX private pages can + * only be PAMT. Converting TDX guest private pages and Secure + * EPT pages can be added later when the kernel has a proper way + * to track these pages. + * + * All other cpus are already dead, thus it's safe to read TDMRs + * to find PAMTs w/o holding any kind of locking here. + */ + WARN_ON_ONCE(num_online_cpus() != 1); + + tdmrs_reset_pamt_all(&tdx_tdmr_list); +} + static int __init tdx_init(void) { u32 tdx_keyid_start, nr_tdx_keyids; @@ -1356,6 +1404,15 @@ static int __init tdx_init(void) tdx_guest_keyid_start = ++tdx_keyid_start; tdx_nr_guest_keyids = --nr_tdx_keyids; + /* + * On the platform with erratum all TDX private pages need to + * be converted back to normal before rebooting (warm reset) or + * before kexec() booting to the new kernel, otherwise the (new) + * kernel may get unexpected SRAR machine check exception. + */ + if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) + x86_platform.memory_shutdown = tdx_memory_shutdown; + return 0; no_tdx: return -ENODEV;
The first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A fast warm reset doesn't reset TDX private memory. Kexec() can also boot into the new kernel directly. Thus if the old kernel has enabled TDX on the platform with this erratum, the new kernel may get unexpected machine check. Note that w/o this erratum any kernel read/write on TDX private memory should never cause machine check, thus it's OK for the old kernel to leave TDX private pages as is. == Solution == In short, with this erratum, the kernel needs to explicitly convert all TDX private pages back to normal to give the new kernel a clean slate after either a fast warm reset or kexec(). There's no existing infrastructure to track TDX private pages, which could be PAMT pages, TDX guest private pages, or SEPT (secure EPT) pages. The latter two are yet to be implemented thus it's not certain how to track them for now. It's not feasible to query the TDX module either because VMX has already been stopped when KVM receives the reboot notifier. Another option is to blindly convert all memory pages. But this may bring non-trivial latency to machine reboot and kexec() on large memory systems (especially when the number of TDX private pages is small). A final solution should be tracking TDX private pages and only converting them. Also, it's problematic to convert all memory pages because not all pages are mapped as writable in the direct-mapping. Thus to do so would require switching to a new page table which maps all pages as writable. Such page table can either be a new page table, or the identical mapping table built during kexec(). Using either seems too dramatic, especially considering the kernel should eventually be able to track all TDX private pages in which case the direct-mapping can be directly used. So for now just convert PAMT pages. Converting TDX guest private pages and SEPT pages can be added when supporting TDX guests is added to the kernel. Introduce a new "x86_platform_ops::memory_shutdown()" callback as a placeholder to convert all TDX private memory, and call it at the end of machine_shutdown() after all remote cpus have been stopped (thus no more TDX activities) and all dirty cachelines of TDX private memory have been flushed (thus no more later cacheline writeback). Implement the default callback as a noop function. Replace the callback with TDX's own implementation when the platform has this erratum in TDX early boot-time initialization. In this way only the platforms with this erratum carry this additional memory conversion burden. Signed-off-by: Kai Huang <kai.huang@intel.com> --- v10 -> v11: - New patch --- arch/x86/include/asm/x86_init.h | 1 + arch/x86/kernel/reboot.c | 1 + arch/x86/kernel/x86_init.c | 2 ++ arch/x86/virt/vmx/tdx/tdx.c | 57 +++++++++++++++++++++++++++++++++ 4 files changed, 61 insertions(+)