mbox series

[RFC,00/10] Process-local memory allocations for hiding KVM secrets

Message ID 20190612170834.14855-1-mhillenb@amazon.de (mailing list archive)
Headers show
Series Process-local memory allocations for hiding KVM secrets | expand

Message

Marius Hillenbrand June 12, 2019, 5:08 p.m. UTC
The Linux kernel has a global address space that is the same for any
kernel code. This address space becomes a liability in a world with
processor information leak vulnerabilities, such as L1TF. With the right
cache load gadget, an attacker-controlled hyperthread pair can leak
arbitrary data via L1TF. Disabling hyperthreading is one recommended
mitigation, but it comes with a large performance hit for a wide range
of workloads.

An alternative mitigation is to not make certain data in the kernel
globally visible, but only when the kernel executes in the context of
the process where this data belongs to.

This patch series proposes to introduce a region for what we call
process-local memory into the kernel's virtual address space. Page
tables and mappings in that region will be exclusive to one address
space, instead of implicitly shared between all kernel address spaces.
Any data placed in that region will be out of reach of cache load
gadgets that execute in different address spaces. To implement
process-local memory, we introduce a new interface kmalloc_proclocal() /
kfree_proclocal() that allocates and maps pages exclusively into the
current kernel address space. As a first use case, we move architectural
state of guest CPUs in KVM out of reach of other kernel address spaces.

The patch set is a prototype for x86-64 that we have developed on top of
kernel 4.20.17 (with cherry-picked commit d253ca0c3865 "x86/mm/cpa: Add
set_direct_map_*() functions"). I am aware that the integration with KVM
will see some changes while rebasing to 5.x. Patches 7 and 8, in
particular, help make patch 9 more readable, but will be dropped in
rebasing. We have tested the code on both Intel and AMDs, launching VMs
in a loop. So far, we have not done in-depth performance evaluation.
Impact on starting VMs was within measurement noise.

---

Julian Stecklina (2):
  kvm, vmx: move CR2 context switch out of assembly path
  kvm, vmx: move register clearing out of assembly path

Marius Hillenbrand (8):
  x86/mm/kaslr: refactor to use enum indices for regions
  x86/speculation, mm: add process local virtual memory region
  x86/mm, mm,kernel: add teardown for process-local memory to mm cleanup
  mm: allocate virtual space for process-local memory
  mm: allocate/release physical pages for process-local memory
  kvm/x86: add support for storing vCPU state in process-local memory
  kvm, vmx: move gprs to process local memory
  kvm, x86: move guest FPU state into process local memory

 Documentation/x86/x86_64/mm.txt         |  11 +-
 arch/x86/Kconfig                        |   1 +
 arch/x86/include/asm/kvm_host.h         |  40 ++-
 arch/x86/include/asm/page_64.h          |   4 +
 arch/x86/include/asm/pgtable_64_types.h |  12 +
 arch/x86/include/asm/proclocal.h        |  11 +
 arch/x86/kernel/head64.c                |   8 +
 arch/x86/kvm/Kconfig                    |  10 +
 arch/x86/kvm/kvm_cache_regs.h           |   4 +-
 arch/x86/kvm/svm.c                      | 104 +++++--
 arch/x86/kvm/vmx.c                      | 213 ++++++++++-----
 arch/x86/kvm/x86.c                      |  31 ++-
 arch/x86/mm/Makefile                    |   1 +
 arch/x86/mm/dump_pagetables.c           |   9 +
 arch/x86/mm/fault.c                     |  19 ++
 arch/x86/mm/kaslr.c                     |  63 ++++-
 arch/x86/mm/proclocal.c                 | 136 +++++++++
 include/linux/mm_types.h                |  13 +
 include/linux/proclocal.h               |  35 +++
 kernel/fork.c                           |   6 +
 mm/Makefile                             |   1 +
 mm/proclocal.c                          | 348 ++++++++++++++++++++++++
 security/Kconfig                        |  18 ++
 23 files changed, 978 insertions(+), 120 deletions(-)
 create mode 100644 arch/x86/include/asm/proclocal.h
 create mode 100644 arch/x86/mm/proclocal.c
 create mode 100644 include/linux/proclocal.h
 create mode 100644 mm/proclocal.c

Comments

Sean Christopherson June 12, 2019, 6:25 p.m. UTC | #1
On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
> The Linux kernel has a global address space that is the same for any
> kernel code. This address space becomes a liability in a world with
> processor information leak vulnerabilities, such as L1TF. With the right
> cache load gadget, an attacker-controlled hyperthread pair can leak
> arbitrary data via L1TF. Disabling hyperthreading is one recommended
> mitigation, but it comes with a large performance hit for a wide range
> of workloads.
> 
> An alternative mitigation is to not make certain data in the kernel
> globally visible, but only when the kernel executes in the context of
> the process where this data belongs to.
>
> This patch series proposes to introduce a region for what we call
> process-local memory into the kernel's virtual address space. Page
> tables and mappings in that region will be exclusive to one address
> space, instead of implicitly shared between all kernel address spaces.
> Any data placed in that region will be out of reach of cache load
> gadgets that execute in different address spaces. To implement
> process-local memory, we introduce a new interface kmalloc_proclocal() /
> kfree_proclocal() that allocates and maps pages exclusively into the
> current kernel address space. As a first use case, we move architectural
> state of guest CPUs in KVM out of reach of other kernel address spaces.

Can you briefly describe what types of attacks this is intended to
mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
make comments based on my potentially bad assumptions.
 
> The patch set is a prototype for x86-64 that we have developed on top of
> kernel 4.20.17 (with cherry-picked commit d253ca0c3865 "x86/mm/cpa: Add
> set_direct_map_*() functions"). I am aware that the integration with KVM
> will see some changes while rebasing to 5.x. Patches 7 and 8, in

Ha, "some" :-)

> particular, help make patch 9 more readable, but will be dropped in
> rebasing. We have tested the code on both Intel and AMDs, launching VMs
> in a loop. So far, we have not done in-depth performance evaluation.
> Impact on starting VMs was within measurement noise.
Dave Hansen June 12, 2019, 7:55 p.m. UTC | #2
On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> This patch series proposes to introduce a region for what we call
> process-local memory into the kernel's virtual address space. 

It might be fun to cc some x86 folks on this series.  They might have
some relevant opinions. ;)

A few high-level questions:

Why go to all this trouble to hide guest state like registers if all the
guest data itself is still mapped?

Where's the context-switching code?  Did I just miss it?

We've discussed having per-cpu page tables where a given PGD is only in
use from one CPU at a time.  I *think* this scheme still works in such a
case, it just adds one more PGD entry that would have to context-switched.
Andy Lutomirski June 12, 2019, 8:27 p.m. UTC | #3
> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. 
> 
> It might be fun to cc some x86 folks on this series.  They might have
> some relevant opinions. ;)
> 
> A few high-level questions:
> 
> Why go to all this trouble to hide guest state like registers if all the
> guest data itself is still mapped?
> 
> Where's the context-switching code?  Did I just miss it?
> 
> We've discussed having per-cpu page tables where a given PGD is only in
> use from one CPU at a time.  I *think* this scheme still works in such a
> case, it just adds one more PGD entry that would have to context-switched.

Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
Dave Hansen June 12, 2019, 8:41 p.m. UTC | #4
On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>> We've discussed having per-cpu page tables where a given PGD is
>> only in use from one CPU at a time.  I *think* this scheme still
>> works in such a case, it just adds one more PGD entry that would
>> have to context-switched.
> Fair warning: Linus is on record as absolutely hating this idea. He
> might change his mind, but it’s an uphill battle.

Just to be clear, are you referring to the per-cpu PGDs, or to this
patch set with a per-mm kernel area?
Andy Lutomirski June 12, 2019, 8:56 p.m. UTC | #5
> On Jun 12, 2019, at 1:41 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>>> We've discussed having per-cpu page tables where a given PGD is
>>> only in use from one CPU at a time.  I *think* this scheme still
>>> works in such a case, it just adds one more PGD entry that would
>>> have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He
>> might change his mind, but it’s an uphill battle.
> 
> Just to be clear, are you referring to the per-cpu PGDs, or to this
> patch set with a per-mm kernel area?

per-CPU PGDs
Andy Lutomirski June 13, 2019, 1:30 a.m. UTC | #6
On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>
> > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >> This patch series proposes to introduce a region for what we call
> >> process-local memory into the kernel's virtual address space.
> >
> > It might be fun to cc some x86 folks on this series.  They might have
> > some relevant opinions. ;)
> >
> > A few high-level questions:
> >
> > Why go to all this trouble to hide guest state like registers if all the
> > guest data itself is still mapped?
> >
> > Where's the context-switching code?  Did I just miss it?
> >
> > We've discussed having per-cpu page tables where a given PGD is only in
> > use from one CPU at a time.  I *think* this scheme still works in such a
> > case, it just adds one more PGD entry that would have to context-switched.
>
> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.

I looked at the patch, and it (sensibly) has nothing to do with
per-cpu PGDs.  So it's in great shape!

Seriously, though, here are some very high-level review comments:

Please don't call it "process local", since "process" is meaningless.
Call it "mm local" or something like that.

We already have a per-mm kernel mapping: the LDT.  So please nix all
the code that adds a new VA region, etc, except to the extent that
some of it consists of valid cleanups in and of itself.  Instead,
please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
it use a more general "mm local" address range, and then reuse the
same infrastructure for other fancy things.  The code that makes it
KASLR-able should be in its very own patch that applies *after* the
code that makes it all work so that, when the KASLR part causes a
crash, we can bisect it.

+ /*
+ * Faults in process-local memory may be caused by process-local
+ * addresses leaking into other contexts.
+ * tbd: warn and handle gracefully.
+ */
+ if (unlikely(fault_in_process_local(address))) {
+ pr_err("page fault in PROCLOCAL at %lx", address);
+ force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
+ }
+

Huh?  Either it's an OOPS or you shouldn't print any special
debugging.  As it is, you're just blatantly leaking the address of the
mm-local range to malicious user programs.

Also, you should IMO consider using this mechanism for kmap_atomic().
Hi, Nadav!
Nadav Amit June 13, 2019, 1:50 a.m. UTC | #7
> On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> 
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>> 
>>> A few high-level questions:
>>> 
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>> 
>>> Where's the context-switching code?  Did I just miss it?
>>> 
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> 
>> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> 
> I looked at the patch, and it (sensibly) has nothing to do with
> per-cpu PGDs.  So it's in great shape!
> 
> Seriously, though, here are some very high-level review comments:
> 
> Please don't call it "process local", since "process" is meaningless.
> Call it "mm local" or something like that.
> 
> We already have a per-mm kernel mapping: the LDT.  So please nix all
> the code that adds a new VA region, etc, except to the extent that
> some of it consists of valid cleanups in and of itself.  Instead,
> please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> it use a more general "mm local" address range, and then reuse the
> same infrastructure for other fancy things.  The code that makes it
> KASLR-able should be in its very own patch that applies *after* the
> code that makes it all work so that, when the KASLR part causes a
> crash, we can bisect it.
> 
> + /*
> + * Faults in process-local memory may be caused by process-local
> + * addresses leaking into other contexts.
> + * tbd: warn and handle gracefully.
> + */
> + if (unlikely(fault_in_process_local(address))) {
> + pr_err("page fault in PROCLOCAL at %lx", address);
> + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> + }
> +
> 
> Huh?  Either it's an OOPS or you shouldn't print any special
> debugging.  As it is, you're just blatantly leaking the address of the
> mm-local range to malicious user programs.
> 
> Also, you should IMO consider using this mechanism for kmap_atomic().
> Hi, Nadav!

Well, some context for the “hi” would have been helpful. (Do I have a bug
and I still don’t understand it?)

Perhaps you regard some use-case for a similar mechanism that I mentioned
before. I did implement something similar (but not the way that you wanted)
to improve the performance of seccomp and system-calls when retpolines are
used. I set per-mm code area that held code that used direct calls to invoke
seccomp filters and frequently used system-calls.

My mechanism, I think, is more not suitable for this use-case. I needed my
code-page to be at the same 2GB range as the kernel text/modules, which does
complicate things. Due to the same reason, it is also limited in the size of
the data/code that it can hold.
Alexander Graf June 13, 2019, 7:20 a.m. UTC | #8
On 12.06.19 20:25, Sean Christopherson wrote:
> On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
>> The Linux kernel has a global address space that is the same for any
>> kernel code. This address space becomes a liability in a world with
>> processor information leak vulnerabilities, such as L1TF. With the right
>> cache load gadget, an attacker-controlled hyperthread pair can leak
>> arbitrary data via L1TF. Disabling hyperthreading is one recommended
>> mitigation, but it comes with a large performance hit for a wide range
>> of workloads.
>>
>> An alternative mitigation is to not make certain data in the kernel
>> globally visible, but only when the kernel executes in the context of
>> the process where this data belongs to.
>>
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. Page
>> tables and mappings in that region will be exclusive to one address
>> space, instead of implicitly shared between all kernel address spaces.
>> Any data placed in that region will be out of reach of cache load
>> gadgets that execute in different address spaces. To implement
>> process-local memory, we introduce a new interface kmalloc_proclocal() /
>> kfree_proclocal() that allocates and maps pages exclusively into the
>> current kernel address space. As a first use case, we move architectural
>> state of guest CPUs in KVM out of reach of other kernel address spaces.
> Can you briefly describe what types of attacks this is intended to
> mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
> make comments based on my potentially bad assumptions.


(quickly jumping in for Marius, he's offline today)

The main purpose of this is to protect from leakage of data from one 
guest into another guest using speculation gadgets on the host.

The same mechanism can be used to prevent leakage of secrets from one 
host process into another host process though, as host processes 
potentially have access to gadgets via the syscall interface.


Alex
Alexander Graf June 13, 2019, 7:27 a.m. UTC | #9
On 12.06.19 21:55, Dave Hansen wrote:
> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space.
> It might be fun to cc some x86 folks on this series.  They might have
> some relevant opinions. ;)
>
> A few high-level questions:
>
> Why go to all this trouble to hide guest state like registers if all the
> guest data itself is still mapped?


(jumping in for Marius, he's offline today)

Glad you asked :). I hope this cover letter explains well how to achieve 
guest data not being mapped:

https://lkml.org/lkml/2019/1/31/933


> Where's the context-switching code?  Did I just miss it?


I'm not sure I understand the question. With this mechanism, the global 
linear map pages are just not present anymore, so there is no context 
switching needed. For the process local memory, the page table is 
already mm local, so we don't need to do anything special during context 
switch, no?


Alex
Alexander Graf June 13, 2019, 7:52 a.m. UTC | #10
On 13.06.19 03:30, Andy Lutomirski wrote:
> On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>>
>>> A few high-level questions:
>>>
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>>
>>> Where's the context-switching code?  Did I just miss it?
>>>
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> I looked at the patch, and it (sensibly) has nothing to do with
> per-cpu PGDs.  So it's in great shape!


Thanks a lot for the very timely review!


>
> Seriously, though, here are some very high-level review comments:
>
> Please don't call it "process local", since "process" is meaningless.
> Call it "mm local" or something like that.


Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm 
not fully convinced, but I don't find it better or worse than proclocal. 
So whatever flies with the majority works for me :).


> We already have a per-mm kernel mapping: the LDT.  So please nix all
> the code that adds a new VA region, etc, except to the extent that
> some of it consists of valid cleanups in and of itself.  Instead,
> please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> it use a more general "mm local" address range, and then reuse the
> same infrastructure for other fancy things.  The code that makes it


I don't fully understand how those two are related. Are you referring to 
the KPTI enabling code in there? That just maps the LDT at the same 
address in both kernel and user mappings, no?

So you're suggesting we use the new mm local address as LDT address 
instead and have that mapped in both kernel and user space? This patch 
set today maps "mm local" data only in kernel space, not in user space, 
as it's meant for kernel data structures.

So I'm not really seeing the path to adapt any of the LDT logic to this. 
Could you please elaborate?


> KASLR-able should be in its very own patch that applies *after* the
> code that makes it all work so that, when the KASLR part causes a
> crash, we can bisect it.


That sounds very reasonable, yes.


>
> + /*
> + * Faults in process-local memory may be caused by process-local
> + * addresses leaking into other contexts.
> + * tbd: warn and handle gracefully.
> + */
> + if (unlikely(fault_in_process_local(address))) {
> + pr_err("page fault in PROCLOCAL at %lx", address);
> + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> + }
> +
>
> Huh?  Either it's an OOPS or you shouldn't print any special
> debugging.  As it is, you're just blatantly leaking the address of the
> mm-local range to malicious user programs.


Yes, this is a left over bit from an idea that we discussed and rejected 
yesterday. The idea was to have a DEBUG config option that allows 
proclocal memory to leak into other processes, but print debug output so 
that it's easier to catch bugs. After discussion, I think we managed to 
convince everyone that an OOPS is the better tool to find bugs :).

Any trace of this will disappear in the next version.


>
> Also, you should IMO consider using this mechanism for kmap_atomic().


It might make sense to use it for kmap_atomic() for debug purposes, as 
it ensures that other users can no longer access the same mapping 
through the linear map. However, it does come at quite a big cost, as we 
need to shoot down the TLB of all other threads in the system. So I'm 
not sure it's of general value?


Alex


> Hi, Nadav!
Liran Alon June 13, 2019, 10:54 a.m. UTC | #11
> On 12 Jun 2019, at 21:25, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote:
>> The Linux kernel has a global address space that is the same for any
>> kernel code. This address space becomes a liability in a world with
>> processor information leak vulnerabilities, such as L1TF. With the right
>> cache load gadget, an attacker-controlled hyperthread pair can leak
>> arbitrary data via L1TF. Disabling hyperthreading is one recommended
>> mitigation, but it comes with a large performance hit for a wide range
>> of workloads.
>> 
>> An alternative mitigation is to not make certain data in the kernel
>> globally visible, but only when the kernel executes in the context of
>> the process where this data belongs to.
>> 
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. Page
>> tables and mappings in that region will be exclusive to one address
>> space, instead of implicitly shared between all kernel address spaces.
>> Any data placed in that region will be out of reach of cache load
>> gadgets that execute in different address spaces. To implement
>> process-local memory, we introduce a new interface kmalloc_proclocal() /
>> kfree_proclocal() that allocates and maps pages exclusively into the
>> current kernel address space. As a first use case, we move architectural
>> state of guest CPUs in KVM out of reach of other kernel address spaces.
> 
> Can you briefly describe what types of attacks this is intended to
> mitigate?  E.g. guest-guest, userspace-guest, etc...  I don't want to
> make comments based on my potentially bad assumptions.

I think I can assist in the explanation.

Consider the following scenario:
1) Hyperthread A in CPU core runs in guest and triggers a VMExit which is handled by host kernel.
While hyperthread A runs VMExit handler, it populates CPU core cache / internal-resources (e.g. MDS buffers)
with some sensitive data it have speculatively/architecturally access.
2) During hyperthread A running on host kernel, hyperthread B on same CPU core runs in guest and use
some CPU speculative execution vulnerability to leak the sensitive host data populated by hyperthread A
in CPU core cache / internal-resources.

Current CPU microcode mitigations (L1D/MDS flush) only handle the case of a single hyperthread and don’t
provide a mechanism to mitigate this hyperthreading attack scenario.

Assuming there is some guest triggerable speculative load gadget in some VMExit path,
it can be used to force any data that is mapped into kernel address space to be loaded into CPU resource that is subject to leak.
Therefore, there were multiple attempts to reduce sensitive information from being mapped into the kernel address space
that is accessible by this VMExit path.

One attempt was XPFO which attempts to remove from kernel direct-map any page that is currently used only by userspace.
Unfortunately, XPFO currently exhibits multiple performance issues that *currently* makes it impractical as far as I know.

Another attempt is this patch-series which attempts to remove from one vCPU thread host kernel address space,
the state of vCPUs of other guests. Which is very specific but I personally have additional ideas on how this patch series can be further used.
For example, vhost-net needs to kmap entire guest memory into kernel-space to write ingress packets data into guest memory.
Thus, vCPU thread kernel address space now maps entire other guest memory which can be leaked using the technique described above.
Therefore, it should be useful to also move this kmap() to happen on process-local kernel virtual address region.

One could argue however that there is still a much bigger issue because of kernel direct-map that maps all physical pages that kernel
manage (i.e. have struct page) in kernel virtual address space. And all of those pages can theoretically be leaked.
However, this could be handled by complementary techniques such as booting host kernel with “mem=X” and mapping guest memory
by directly mmap relevant portion of /dev/mem.
Which is probably what AWS does given these upstream KVM patches they have contributed:
bd53cb35a3e9 X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
e45adf665a53 KVM: Introduce a new guest mapping API
0c55671f84ff kvm, x86: Properly check whether a pfn is an MMIO or not

Also note that when using such “mem=X” technique, you can also avoid performance penalties introduced by CPU microcode mitigations.
E.g. You can avoid doing L1D flush on VMEntry if VMExit handler run only in kernel and didn’t context-switch as you assume kernel address
space don’t map any host sensitive data.

It’s also worth mentioning that another alternative that I have attempted to this “mem=X” technique
was to create an isolated address space that is only used when running KVM VMExit handlers.
For more information, refer to:
https://lkml.org/lkml/2019/5/13/515
(See some of my comments on that thread)

This is my 2cents on this at least.

-Liran
Dave Hansen June 13, 2019, 2:19 p.m. UTC | #12
On 6/13/19 12:27 AM, Alexander Graf wrote:
>> Where's the context-switching code?  Did I just miss it?
> 
> I'm not sure I understand the question. With this mechanism, the global
> linear map pages are just not present anymore, so there is no context
> switching needed. For the process local memory, the page table is
> already mm local, so we don't need to do anything special during context
> switch, no?

Thanks for explaining, I was just confused.

Andy reminded me when comparing it to the LDT area: since this area is
per-mm/pgd and we context switch that *obviously* there's no extra work
to do at context switch time, as long as the area is marked non-Global.
Andy Lutomirski June 13, 2019, 4:13 p.m. UTC | #13
On Thu, Jun 13, 2019 at 12:53 AM Alexander Graf <graf@amazon.com> wrote:
>
>
> On 13.06.19 03:30, Andy Lutomirski wrote:
> > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >>
> >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >>>
> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >>>> This patch series proposes to introduce a region for what we call
> >>>> process-local memory into the kernel's virtual address space.
> >>> It might be fun to cc some x86 folks on this series.  They might have
> >>> some relevant opinions. ;)
> >>>
> >>> A few high-level questions:
> >>>
> >>> Why go to all this trouble to hide guest state like registers if all the
> >>> guest data itself is still mapped?
> >>>
> >>> Where's the context-switching code?  Did I just miss it?
> >>>
> >>> We've discussed having per-cpu page tables where a given PGD is only in
> >>> use from one CPU at a time.  I *think* this scheme still works in such a
> >>> case, it just adds one more PGD entry that would have to context-switched.
> >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> > I looked at the patch, and it (sensibly) has nothing to do with
> > per-cpu PGDs.  So it's in great shape!
>
>
> Thanks a lot for the very timely review!
>
>
> >
> > Seriously, though, here are some very high-level review comments:
> >
> > Please don't call it "process local", since "process" is meaningless.
> > Call it "mm local" or something like that.
>
>
> Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm
> not fully convinced, but I don't find it better or worse than proclocal.
> So whatever flies with the majority works for me :).

My objection to "proc" is that we have many concepts of "process" in
the kernel: task, mm, signal handling context, etc.  These memory
ranges are specifically local to the mm.  Admittedly, it would be very
surprising to have memory that is local to a signal handling context,
but still.

>
>
> > We already have a per-mm kernel mapping: the LDT.  So please nix all
> > the code that adds a new VA region, etc, except to the extent that
> > some of it consists of valid cleanups in and of itself.  Instead,
> > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> > it use a more general "mm local" address range, and then reuse the
> > same infrastructure for other fancy things.  The code that makes it
>
>
> I don't fully understand how those two are related. Are you referring to
> the KPTI enabling code in there? That just maps the LDT at the same
> address in both kernel and user mappings, no?

The relevance here is that, when KPTI is on, the exact same address
refers to a different LDT in different mms, so it's genuinely an
mm-local mapping.  It works just like yours: a whole top-level paging
entry is reserved for it.  What I'm suggesting is that, when you're
all done, the LDT should be more or less just one more mm-local
mapping, with two caveats.  First, the LDT needs special KPTI
handling, but that's fine.  Second, the LDT address is visible to user
code on non-UMIP systems, so you'll have to decide if that's okay.  My
suggestion is to have the LDT be the very first address in the
mm-local range and then to randomize everything else in the mm-local
range.

>
> So you're suggesting we use the new mm local address as LDT address
> instead and have that mapped in both kernel and user space? This patch
> set today maps "mm local" data only in kernel space, not in user space,
> as it's meant for kernel data structures.

Yes, exactly.

>
> So I'm not really seeing the path to adapt any of the LDT logic to this.
> Could you please elaborate?
>
>
> > KASLR-able should be in its very own patch that applies *after* the
> > code that makes it all work so that, when the KASLR part causes a
> > crash, we can bisect it.
>
>
> That sounds very reasonable, yes.
>
>
> >
> > + /*
> > + * Faults in process-local memory may be caused by process-local
> > + * addresses leaking into other contexts.
> > + * tbd: warn and handle gracefully.
> > + */
> > + if (unlikely(fault_in_process_local(address))) {
> > + pr_err("page fault in PROCLOCAL at %lx", address);
> > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> > + }
> > +
> >
> > Huh?  Either it's an OOPS or you shouldn't print any special
> > debugging.  As it is, you're just blatantly leaking the address of the
> > mm-local range to malicious user programs.
>
>
> Yes, this is a left over bit from an idea that we discussed and rejected
> yesterday. The idea was to have a DEBUG config option that allows
> proclocal memory to leak into other processes, but print debug output so
> that it's easier to catch bugs. After discussion, I think we managed to
> convince everyone that an OOPS is the better tool to find bugs :).
>
> Any trace of this will disappear in the next version.
>
>
> >
> > Also, you should IMO consider using this mechanism for kmap_atomic().
>
>
> It might make sense to use it for kmap_atomic() for debug purposes, as
> it ensures that other users can no longer access the same mapping
> through the linear map. However, it does come at quite a big cost, as we
> need to shoot down the TLB of all other threads in the system. So I'm
> not sure it's of general value?

What I meant was that kmap_atomic() could use mm-local memory so that
it doesn't need to do a global shootdown.  But I guess it's not
actually used for real on 64-bit, so this is mostly moot.  Are you
planning to support mm-local on 32-bit?

--Andy
>
>
> Alex
>
>
> > Hi, Nadav!
Andy Lutomirski June 13, 2019, 4:16 p.m. UTC | #14
On Wed, Jun 12, 2019 at 6:50 PM Nadav Amit <namit@vmware.com> wrote:
>
> > On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >>>
> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >>>> This patch series proposes to introduce a region for what we call
> >>>> process-local memory into the kernel's virtual address space.
> >>>
> >>> It might be fun to cc some x86 folks on this series.  They might have
> >>> some relevant opinions. ;)
> >>>
> >>> A few high-level questions:
> >>>
> >>> Why go to all this trouble to hide guest state like registers if all the
> >>> guest data itself is still mapped?
> >>>
> >>> Where's the context-switching code?  Did I just miss it?
> >>>
> >>> We've discussed having per-cpu page tables where a given PGD is only in
> >>> use from one CPU at a time.  I *think* this scheme still works in such a
> >>> case, it just adds one more PGD entry that would have to context-switched.
> >>
> >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
> >
> > I looked at the patch, and it (sensibly) has nothing to do with
> > per-cpu PGDs.  So it's in great shape!
> >
> > Seriously, though, here are some very high-level review comments:
> >
> > Please don't call it "process local", since "process" is meaningless.
> > Call it "mm local" or something like that.
> >
> > We already have a per-mm kernel mapping: the LDT.  So please nix all
> > the code that adds a new VA region, etc, except to the extent that
> > some of it consists of valid cleanups in and of itself.  Instead,
> > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make
> > it use a more general "mm local" address range, and then reuse the
> > same infrastructure for other fancy things.  The code that makes it
> > KASLR-able should be in its very own patch that applies *after* the
> > code that makes it all work so that, when the KASLR part causes a
> > crash, we can bisect it.
> >
> > + /*
> > + * Faults in process-local memory may be caused by process-local
> > + * addresses leaking into other contexts.
> > + * tbd: warn and handle gracefully.
> > + */
> > + if (unlikely(fault_in_process_local(address))) {
> > + pr_err("page fault in PROCLOCAL at %lx", address);
> > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current);
> > + }
> > +
> >
> > Huh?  Either it's an OOPS or you shouldn't print any special
> > debugging.  As it is, you're just blatantly leaking the address of the
> > mm-local range to malicious user programs.
> >
> > Also, you should IMO consider using this mechanism for kmap_atomic().
> > Hi, Nadav!
>
> Well, some context for the “hi” would have been helpful. (Do I have a bug
> and I still don’t understand it?)

Fair enough :)

>
> Perhaps you regard some use-case for a similar mechanism that I mentioned
> before. I did implement something similar (but not the way that you wanted)
> to improve the performance of seccomp and system-calls when retpolines are
> used. I set per-mm code area that held code that used direct calls to invoke
> seccomp filters and frequently used system-calls.
>
> My mechanism, I think, is more not suitable for this use-case. I needed my
> code-page to be at the same 2GB range as the kernel text/modules, which does
> complicate things. Due to the same reason, it is also limited in the size of
> the data/code that it can hold.
>

I actually meant the opposite.  If we had a general-purpose per-mm
kernel address range, could it be used to optimize kmap_atomic() by
limiting the scope of any shootdowns?  As a rough sketch, we'd have
some kmap_atomic slots for each cpu *in the mm-local region*.  I'm not
entirely sure this is a win.

--Andy
Dave Hansen June 13, 2019, 4:20 p.m. UTC | #15
On 6/13/19 9:13 AM, Andy Lutomirski wrote:
>> It might make sense to use it for kmap_atomic() for debug purposes, as
>> it ensures that other users can no longer access the same mapping
>> through the linear map. However, it does come at quite a big cost, as we
>> need to shoot down the TLB of all other threads in the system. So I'm
>> not sure it's of general value?
> What I meant was that kmap_atomic() could use mm-local memory so that
> it doesn't need to do a global shootdown.  But I guess it's not
> actually used for real on 64-bit, so this is mostly moot.  Are you
> planning to support mm-local on 32-bit?

Do we *do* global shootdowns on kmap_atomic()s on 32-bit?  I thought we
used entirely per-cpu addresses, so a stale entry from another CPU can
get loaded in the TLB speculatively but it won't ever actually get used.
 I think it goes:

kunmap_atomic() ->
__kunmap_atomic() ->
kpte_clear_flush() ->
__flush_tlb_one_kernel() ->
__flush_tlb_one_user() ->
__native_flush_tlb_one_user() ->
invlpg

The per-cpu address calculation is visible in kmap_atomic_prot():

        idx = type + KM_TYPE_NR*smp_processor_id();
Nadav Amit June 13, 2019, 5:29 p.m. UTC | #16
> On Jun 13, 2019, at 9:20 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/13/19 9:13 AM, Andy Lutomirski wrote:
>>> It might make sense to use it for kmap_atomic() for debug purposes, as
>>> it ensures that other users can no longer access the same mapping
>>> through the linear map. However, it does come at quite a big cost, as we
>>> need to shoot down the TLB of all other threads in the system. So I'm
>>> not sure it's of general value?
>> What I meant was that kmap_atomic() could use mm-local memory so that
>> it doesn't need to do a global shootdown.  But I guess it's not
>> actually used for real on 64-bit, so this is mostly moot.  Are you
>> planning to support mm-local on 32-bit?
> 
> Do we *do* global shootdowns on kmap_atomic()s on 32-bit?  I thought we
> used entirely per-cpu addresses, so a stale entry from another CPU can
> get loaded in the TLB speculatively but it won't ever actually get used.
> I think it goes:
> 
> kunmap_atomic() ->
> __kunmap_atomic() ->
> kpte_clear_flush() ->
> __flush_tlb_one_kernel() ->
> __flush_tlb_one_user() ->
> __native_flush_tlb_one_user() ->
> invlpg
> 
> The per-cpu address calculation is visible in kmap_atomic_prot():
> 
>        idx = type + KM_TYPE_NR*smp_processor_id();

From a security point-of-view, having such an entry is still not too good,
since the mapping protection might override the default protection. This
might lead to potential W+X cases, for example, that might stay for a long
time if they are speculatively cached in the TLB and not invalidated upon
kunmap_atomic().

Having said that, I am not too excited to deal with this issue. Do people
still care about x86/32-bit? In addition, if kunmap_atomic() is used when
IRQs are disabled, sending a TLB shootdown during kunmap_atomic() can cause
a deadlock.
Dave Hansen June 13, 2019, 5:49 p.m. UTC | #17
On 6/13/19 10:29 AM, Nadav Amit wrote:
> Having said that, I am not too excited to deal with this issue. Do
> people still care about x86/32-bit?
No, not really.  It's just fun to try to give history lessons about the
good old days. ;)
Sean Christopherson June 13, 2019, 8:05 p.m. UTC | #18
On Thu, Jun 13, 2019 at 10:49:42AM -0700, Dave Hansen wrote:
> On 6/13/19 10:29 AM, Nadav Amit wrote:
> > Having said that, I am not too excited to deal with this issue. Do
> > people still care about x86/32-bit?
> No, not really.

Especially not for KVM, given the number of times 32-bit KVM has been
broken recently without anyone noticing for several kernel releases.
Thomas Gleixner June 14, 2019, 2:21 p.m. UTC | #19
On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> >> This patch series proposes to introduce a region for what we call
> >> process-local memory into the kernel's virtual address space. 
> > 
> > It might be fun to cc some x86 folks on this series.  They might have
> > some relevant opinions. ;)
> > 
> > A few high-level questions:
> > 
> > Why go to all this trouble to hide guest state like registers if all the
> > guest data itself is still mapped?
> > 
> > Where's the context-switching code?  Did I just miss it?
> > 
> > We've discussed having per-cpu page tables where a given PGD is only in
> > use from one CPU at a time.  I *think* this scheme still works in such a
> > case, it just adds one more PGD entry that would have to context-switched.
>
> Fair warning: Linus is on record as absolutely hating this idea. He might
> change his mind, but it’s an uphill battle.

Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
the entry code as we could just put the percpu space into the local PGD.

Thanks,

	tglx
Andy Lutomirski June 16, 2019, 10:18 p.m. UTC | #20
On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> > >> This patch series proposes to introduce a region for what we call
> > >> process-local memory into the kernel's virtual address space.
> > >
> > > It might be fun to cc some x86 folks on this series.  They might have
> > > some relevant opinions. ;)
> > >
> > > A few high-level questions:
> > >
> > > Why go to all this trouble to hide guest state like registers if all the
> > > guest data itself is still mapped?
> > >
> > > Where's the context-switching code?  Did I just miss it?
> > >
> > > We've discussed having per-cpu page tables where a given PGD is only in
> > > use from one CPU at a time.  I *think* this scheme still works in such a
> > > case, it just adds one more PGD entry that would have to context-switched.
> >
> > Fair warning: Linus is on record as absolutely hating this idea. He might
> > change his mind, but it’s an uphill battle.
>
> Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> the entry code as we could just put the percpu space into the local PGD.
>

I have personally suggested this to Linus on a couple of occasions,
and he seemed quite skeptical.
Thomas Gleixner June 16, 2019, 10:28 p.m. UTC | #21
On Sun, 16 Jun 2019, Andy Lutomirski wrote:
> On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 12 Jun 2019, Andy Lutomirski wrote:
> > >
> > > Fair warning: Linus is on record as absolutely hating this idea. He might
> > > change his mind, but it’s an uphill battle.
> >
> > Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> > the entry code as we could just put the percpu space into the local PGD.
> >
> 
> I have personally suggested this to Linus on a couple of occasions,
> and he seemed quite skeptical.

The only way to find out is the good old: numbers talk ....

So someone has to bite the bullet, implement it and figure out whether it's
bollocks or not. :)

Thanks,

	tglx
Alexander Graf June 17, 2019, 7:38 a.m. UTC | #22
On 14.06.19 16:21, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Andy Lutomirski wrote:
>>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>>>
>>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>>>> This patch series proposes to introduce a region for what we call
>>>> process-local memory into the kernel's virtual address space.
>>> It might be fun to cc some x86 folks on this series.  They might have
>>> some relevant opinions. ;)
>>>
>>> A few high-level questions:
>>>
>>> Why go to all this trouble to hide guest state like registers if all the
>>> guest data itself is still mapped?
>>>
>>> Where's the context-switching code?  Did I just miss it?
>>>
>>> We've discussed having per-cpu page tables where a given PGD is only in
>>> use from one CPU at a time.  I *think* this scheme still works in such a
>>> case, it just adds one more PGD entry that would have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He might
>> change his mind, but it’s an uphill battle.
> Yes I know, but as a benefit we could get rid of all the GSBASE horrors in
> the entry code as we could just put the percpu space into the local PGD.


Would that mean that with Meltdown affected CPUs we open speculation 
attacks against the mmlocal memory from KVM user space?


Alex
Dave Hansen June 17, 2019, 3:50 p.m. UTC | #23
On 6/17/19 12:38 AM, Alexander Graf wrote:
>> Yes I know, but as a benefit we could get rid of all the GSBASE
>> horrors in
>> the entry code as we could just put the percpu space into the local PGD.
> 
> Would that mean that with Meltdown affected CPUs we open speculation
> attacks against the mmlocal memory from KVM user space?

Not necessarily.  There would likely be a _set_ of local PGDs.  We could
still have pair of PTI PGDs just like we do know, they'd just be a local
PGD pair.
Andy Lutomirski June 17, 2019, 3:54 p.m. UTC | #24
On Mon, Jun 17, 2019 at 8:50 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/17/19 12:38 AM, Alexander Graf wrote:
> >> Yes I know, but as a benefit we could get rid of all the GSBASE
> >> horrors in
> >> the entry code as we could just put the percpu space into the local PGD.
> >
> > Would that mean that with Meltdown affected CPUs we open speculation
> > attacks against the mmlocal memory from KVM user space?
>
> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
> still have pair of PTI PGDs just like we do know, they'd just be a local
> PGD pair.
>

Unfortunately, this would mean that we need to sync twice as many
top-level entries when we context switch.
Dave Hansen June 17, 2019, 4:03 p.m. UTC | #25
On 6/17/19 8:54 AM, Andy Lutomirski wrote:
>>> Would that mean that with Meltdown affected CPUs we open speculation
>>> attacks against the mmlocal memory from KVM user space?
>> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
>> still have pair of PTI PGDs just like we do know, they'd just be a local
>> PGD pair.
>>
> Unfortunately, this would mean that we need to sync twice as many
> top-level entries when we context switch.

Yeah, PTI sucks. :)

For anyone following along at home, I'm going to go off into crazy
per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)

But, I was thinking we could get away with not doing this on _every_
context switch at least.  For instance, couldn't 'struct tlb_context'
have PGD pointer (or two with PTI) in addition to the TLB info?  That
way we only do the copying when we change the context.  Or does that tie
the implementation up too much with PCIDs?
Andy Lutomirski June 17, 2019, 4:14 p.m. UTC | #26
On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/17/19 8:54 AM, Andy Lutomirski wrote:
> >>> Would that mean that with Meltdown affected CPUs we open speculation
> >>> attacks against the mmlocal memory from KVM user space?
> >> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
> >> still have pair of PTI PGDs just like we do know, they'd just be a local
> >> PGD pair.
> >>
> > Unfortunately, this would mean that we need to sync twice as many
> > top-level entries when we context switch.
>
> Yeah, PTI sucks. :)
>
> For anyone following along at home, I'm going to go off into crazy
> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>
> But, I was thinking we could get away with not doing this on _every_
> context switch at least.  For instance, couldn't 'struct tlb_context'
> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> way we only do the copying when we change the context.  Or does that tie
> the implementation up too much with PCIDs?

Hmm, that seems entirely reasonable.  I think the nasty bit would be
figuring out all the interactions with PV TLB flushing.  PV TLB
flushes already don't play so well with PCID tracking, and this will
make it worse.  We probably need to rewrite all that code regardless.
Nadav Amit June 17, 2019, 4:53 p.m. UTC | #27
> On Jun 17, 2019, at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 6/17/19 8:54 AM, Andy Lutomirski wrote:
>>>>> Would that mean that with Meltdown affected CPUs we open speculation
>>>>> attacks against the mmlocal memory from KVM user space?
>>>> Not necessarily.  There would likely be a _set_ of local PGDs.  We could
>>>> still have pair of PTI PGDs just like we do know, they'd just be a local
>>>> PGD pair.
>>> Unfortunately, this would mean that we need to sync twice as many
>>> top-level entries when we context switch.
>> 
>> Yeah, PTI sucks. :)
>> 
>> For anyone following along at home, I'm going to go off into crazy
>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>> 
>> But, I was thinking we could get away with not doing this on _every_
>> context switch at least.  For instance, couldn't 'struct tlb_context'
>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>> way we only do the copying when we change the context.  Or does that tie
>> the implementation up too much with PCIDs?
> 
> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> figuring out all the interactions with PV TLB flushing.  PV TLB
> flushes already don't play so well with PCID tracking, and this will
> make it worse.  We probably need to rewrite all that code regardless.

How is PCID (as you implemented) related to TLB flushing of kernel (not
user) PTEs? These kernel PTEs would be global, so they would be invalidated
from all the address-spaces using INVLPG, I presume. No?

Having said that, the fact that every hypervisor implements PV-TLB
completely differently might be unwarranted.
Dave Hansen June 17, 2019, 6:07 p.m. UTC | #28
On 6/17/19 9:53 AM, Nadav Amit wrote:
>>> For anyone following along at home, I'm going to go off into crazy
>>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>>>
>>> But, I was thinking we could get away with not doing this on _every_
>>> context switch at least.  For instance, couldn't 'struct tlb_context'
>>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>>> way we only do the copying when we change the context.  Or does that tie
>>> the implementation up too much with PCIDs?
>> Hmm, that seems entirely reasonable.  I think the nasty bit would be
>> figuring out all the interactions with PV TLB flushing.  PV TLB
>> flushes already don't play so well with PCID tracking, and this will
>> make it worse.  We probably need to rewrite all that code regardless.
> How is PCID (as you implemented) related to TLB flushing of kernel (not
> user) PTEs? These kernel PTEs would be global, so they would be invalidated
> from all the address-spaces using INVLPG, I presume. No?

The idea is that you have a per-cpu address space.  Certain kernel
virtual addresses would map to different physical address based on where
you are running.  Each of the physical addresses would be "owned" by a
single CPU and would, by convention, never use a PGD that mapped an
address unless that CPU that "owned" it.

In that case, you never really invalidate those addresses.
Konrad Rzeszutek Wilk June 17, 2019, 6:45 p.m. UTC | #29
On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote:
> On 6/17/19 9:53 AM, Nadav Amit wrote:
> >>> For anyone following along at home, I'm going to go off into crazy
> >>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
> >>>
> >>> But, I was thinking we could get away with not doing this on _every_
> >>> context switch at least.  For instance, couldn't 'struct tlb_context'
> >>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> >>> way we only do the copying when we change the context.  Or does that tie
> >>> the implementation up too much with PCIDs?
> >> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> >> figuring out all the interactions with PV TLB flushing.  PV TLB
> >> flushes already don't play so well with PCID tracking, and this will
> >> make it worse.  We probably need to rewrite all that code regardless.
> > How is PCID (as you implemented) related to TLB flushing of kernel (not
> > user) PTEs? These kernel PTEs would be global, so they would be invalidated
> > from all the address-spaces using INVLPG, I presume. No?
> 
> The idea is that you have a per-cpu address space.  Certain kernel
> virtual addresses would map to different physical address based on where
> you are running.  Each of the physical addresses would be "owned" by a
> single CPU and would, by convention, never use a PGD that mapped an
> address unless that CPU that "owned" it.
> 
> In that case, you never really invalidate those addresses.

But you would need to invalidate if the process moved to another CPU, correct?
Dave Hansen June 17, 2019, 6:49 p.m. UTC | #30
On 6/17/19 11:45 AM, Konrad Rzeszutek Wilk wrote:
>> The idea is that you have a per-cpu address space.  Certain kernel
>> virtual addresses would map to different physical address based on where
>> you are running.  Each of the physical addresses would be "owned" by a
>> single CPU and would, by convention, never use a PGD that mapped an
>> address unless that CPU that "owned" it.
>>
>> In that case, you never really invalidate those addresses.
> But you would need to invalidate if the process moved to another CPU, correct?

If you have a per-cpu PGD, the rule is that you never use the PGD on
more than one CPU.  In that model, processes "take over" an existing PGD
to run on the CPU rather than having the CPU use an existing PGD that
came with the process.

But we've really hijacked the original thread at this point, which is
probably causing a ton of confusion.
Nadav Amit June 17, 2019, 6:50 p.m. UTC | #31
> On Jun 17, 2019, at 11:07 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 6/17/19 9:53 AM, Nadav Amit wrote:
>>>> For anyone following along at home, I'm going to go off into crazy
>>>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
>>>> 
>>>> But, I was thinking we could get away with not doing this on _every_
>>>> context switch at least.  For instance, couldn't 'struct tlb_context'
>>>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
>>>> way we only do the copying when we change the context.  Or does that tie
>>>> the implementation up too much with PCIDs?
>>> Hmm, that seems entirely reasonable.  I think the nasty bit would be
>>> figuring out all the interactions with PV TLB flushing.  PV TLB
>>> flushes already don't play so well with PCID tracking, and this will
>>> make it worse.  We probably need to rewrite all that code regardless.
>> How is PCID (as you implemented) related to TLB flushing of kernel (not
>> user) PTEs? These kernel PTEs would be global, so they would be invalidated
>> from all the address-spaces using INVLPG, I presume. No?
> 
> The idea is that you have a per-cpu address space.  Certain kernel
> virtual addresses would map to different physical address based on where
> you are running.  Each of the physical addresses would be "owned" by a
> single CPU and would, by convention, never use a PGD that mapped an
> address unless that CPU that "owned" it.
> 
> In that case, you never really invalidate those addresses.

I understand, but as I see it, this is not related directly to PCIDs.
Andy Lutomirski June 17, 2019, 6:53 p.m. UTC | #32
On Mon, Jun 17, 2019 at 11:44 AM Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>
> On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote:
> > On 6/17/19 9:53 AM, Nadav Amit wrote:
> > >>> For anyone following along at home, I'm going to go off into crazy
> > >>> per-cpu-pgds speculation mode now...  Feel free to stop reading now. :)
> > >>>
> > >>> But, I was thinking we could get away with not doing this on _every_
> > >>> context switch at least.  For instance, couldn't 'struct tlb_context'
> > >>> have PGD pointer (or two with PTI) in addition to the TLB info?  That
> > >>> way we only do the copying when we change the context.  Or does that tie
> > >>> the implementation up too much with PCIDs?
> > >> Hmm, that seems entirely reasonable.  I think the nasty bit would be
> > >> figuring out all the interactions with PV TLB flushing.  PV TLB
> > >> flushes already don't play so well with PCID tracking, and this will
> > >> make it worse.  We probably need to rewrite all that code regardless.
> > > How is PCID (as you implemented) related to TLB flushing of kernel (not
> > > user) PTEs? These kernel PTEs would be global, so they would be invalidated
> > > from all the address-spaces using INVLPG, I presume. No?
> >
> > The idea is that you have a per-cpu address space.  Certain kernel
> > virtual addresses would map to different physical address based on where
> > you are running.  Each of the physical addresses would be "owned" by a
> > single CPU and would, by convention, never use a PGD that mapped an
> > address unless that CPU that "owned" it.
> >
> > In that case, you never really invalidate those addresses.
>
> But you would need to invalidate if the process moved to another CPU, correct?
>

There's nothing to invalidate.  It's a different CPU with a different TLB.

The big problem is that you have a choice.  Either you can have one
PGD per (mm, cpu) or you just have one or a few PGDs per CPU and you
change them every time you change processes.  Dave's idea to have one
or two per (cpu, asid) is right, though.  It means we have a decent
chance of context switching without rewriting the whole thing, and it
also means we don't need to write to the one that's currently loaded
when we switch CR3.  The latter could plausibly be important enough
that we'd want to pretend we're using PCID even if we're not.
Dave Hansen June 17, 2019, 6:55 p.m. UTC | #33
On 6/17/19 11:50 AM, Nadav Amit wrote:
>> The idea is that you have a per-cpu address space.  Certain kernel
>> virtual addresses would map to different physical address based on where
>> you are running.  Each of the physical addresses would be "owned" by a
>> single CPU and would, by convention, never use a PGD that mapped an
>> address unless that CPU that "owned" it.
>>
>> In that case, you never really invalidate those addresses.
> I understand, but as I see it, this is not related directly to PCIDs.

Yeah, the only link I was thinking of is that we can manage per-CPU PGDs
in the same way that we manage PCIDs.  Basically we can reuse a chunk of
the software concept.