Message ID | 20190612170834.14855-1-mhillenb@amazon.de (mailing list archive) |
---|---|
Headers | show |
Series | Process-local memory allocations for hiding KVM secrets | expand |
On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote: > The Linux kernel has a global address space that is the same for any > kernel code. This address space becomes a liability in a world with > processor information leak vulnerabilities, such as L1TF. With the right > cache load gadget, an attacker-controlled hyperthread pair can leak > arbitrary data via L1TF. Disabling hyperthreading is one recommended > mitigation, but it comes with a large performance hit for a wide range > of workloads. > > An alternative mitigation is to not make certain data in the kernel > globally visible, but only when the kernel executes in the context of > the process where this data belongs to. > > This patch series proposes to introduce a region for what we call > process-local memory into the kernel's virtual address space. Page > tables and mappings in that region will be exclusive to one address > space, instead of implicitly shared between all kernel address spaces. > Any data placed in that region will be out of reach of cache load > gadgets that execute in different address spaces. To implement > process-local memory, we introduce a new interface kmalloc_proclocal() / > kfree_proclocal() that allocates and maps pages exclusively into the > current kernel address space. As a first use case, we move architectural > state of guest CPUs in KVM out of reach of other kernel address spaces. Can you briefly describe what types of attacks this is intended to mitigate? E.g. guest-guest, userspace-guest, etc... I don't want to make comments based on my potentially bad assumptions. > The patch set is a prototype for x86-64 that we have developed on top of > kernel 4.20.17 (with cherry-picked commit d253ca0c3865 "x86/mm/cpa: Add > set_direct_map_*() functions"). I am aware that the integration with KVM > will see some changes while rebasing to 5.x. Patches 7 and 8, in Ha, "some" :-) > particular, help make patch 9 more readable, but will be dropped in > rebasing. We have tested the code on both Intel and AMDs, launching VMs > in a loop. So far, we have not done in-depth performance evaluation. > Impact on starting VMs was within measurement noise.
On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > This patch series proposes to introduce a region for what we call > process-local memory into the kernel's virtual address space. It might be fun to cc some x86 folks on this series. They might have some relevant opinions. ;) A few high-level questions: Why go to all this trouble to hide guest state like registers if all the guest data itself is still mapped? Where's the context-switching code? Did I just miss it? We've discussed having per-cpu page tables where a given PGD is only in use from one CPU at a time. I *think* this scheme still works in such a case, it just adds one more PGD entry that would have to context-switched.
> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: >> This patch series proposes to introduce a region for what we call >> process-local memory into the kernel's virtual address space. > > It might be fun to cc some x86 folks on this series. They might have > some relevant opinions. ;) > > A few high-level questions: > > Why go to all this trouble to hide guest state like registers if all the > guest data itself is still mapped? > > Where's the context-switching code? Did I just miss it? > > We've discussed having per-cpu page tables where a given PGD is only in > use from one CPU at a time. I *think* this scheme still works in such a > case, it just adds one more PGD entry that would have to context-switched. Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle.
On 6/12/19 1:27 PM, Andy Lutomirski wrote: >> We've discussed having per-cpu page tables where a given PGD is >> only in use from one CPU at a time. I *think* this scheme still >> works in such a case, it just adds one more PGD entry that would >> have to context-switched. > Fair warning: Linus is on record as absolutely hating this idea. He > might change his mind, but it’s an uphill battle. Just to be clear, are you referring to the per-cpu PGDs, or to this patch set with a per-mm kernel area?
> On Jun 12, 2019, at 1:41 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/12/19 1:27 PM, Andy Lutomirski wrote: >>> We've discussed having per-cpu page tables where a given PGD is >>> only in use from one CPU at a time. I *think* this scheme still >>> works in such a case, it just adds one more PGD entry that would >>> have to context-switched. >> Fair warning: Linus is on record as absolutely hating this idea. He >> might change his mind, but it’s an uphill battle. > > Just to be clear, are you referring to the per-cpu PGDs, or to this > patch set with a per-mm kernel area? per-CPU PGDs
On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > >> This patch series proposes to introduce a region for what we call > >> process-local memory into the kernel's virtual address space. > > > > It might be fun to cc some x86 folks on this series. They might have > > some relevant opinions. ;) > > > > A few high-level questions: > > > > Why go to all this trouble to hide guest state like registers if all the > > guest data itself is still mapped? > > > > Where's the context-switching code? Did I just miss it? > > > > We've discussed having per-cpu page tables where a given PGD is only in > > use from one CPU at a time. I *think* this scheme still works in such a > > case, it just adds one more PGD entry that would have to context-switched. > > Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle. I looked at the patch, and it (sensibly) has nothing to do with per-cpu PGDs. So it's in great shape! Seriously, though, here are some very high-level review comments: Please don't call it "process local", since "process" is meaningless. Call it "mm local" or something like that. We already have a per-mm kernel mapping: the LDT. So please nix all the code that adds a new VA region, etc, except to the extent that some of it consists of valid cleanups in and of itself. Instead, please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make it use a more general "mm local" address range, and then reuse the same infrastructure for other fancy things. The code that makes it KASLR-able should be in its very own patch that applies *after* the code that makes it all work so that, when the KASLR part causes a crash, we can bisect it. + /* + * Faults in process-local memory may be caused by process-local + * addresses leaking into other contexts. + * tbd: warn and handle gracefully. + */ + if (unlikely(fault_in_process_local(address))) { + pr_err("page fault in PROCLOCAL at %lx", address); + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current); + } + Huh? Either it's an OOPS or you shouldn't print any special debugging. As it is, you're just blatantly leaking the address of the mm-local range to malicious user programs. Also, you should IMO consider using this mechanism for kmap_atomic(). Hi, Nadav!
> On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote: >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: >>>> This patch series proposes to introduce a region for what we call >>>> process-local memory into the kernel's virtual address space. >>> >>> It might be fun to cc some x86 folks on this series. They might have >>> some relevant opinions. ;) >>> >>> A few high-level questions: >>> >>> Why go to all this trouble to hide guest state like registers if all the >>> guest data itself is still mapped? >>> >>> Where's the context-switching code? Did I just miss it? >>> >>> We've discussed having per-cpu page tables where a given PGD is only in >>> use from one CPU at a time. I *think* this scheme still works in such a >>> case, it just adds one more PGD entry that would have to context-switched. >> >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle. > > I looked at the patch, and it (sensibly) has nothing to do with > per-cpu PGDs. So it's in great shape! > > Seriously, though, here are some very high-level review comments: > > Please don't call it "process local", since "process" is meaningless. > Call it "mm local" or something like that. > > We already have a per-mm kernel mapping: the LDT. So please nix all > the code that adds a new VA region, etc, except to the extent that > some of it consists of valid cleanups in and of itself. Instead, > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make > it use a more general "mm local" address range, and then reuse the > same infrastructure for other fancy things. The code that makes it > KASLR-able should be in its very own patch that applies *after* the > code that makes it all work so that, when the KASLR part causes a > crash, we can bisect it. > > + /* > + * Faults in process-local memory may be caused by process-local > + * addresses leaking into other contexts. > + * tbd: warn and handle gracefully. > + */ > + if (unlikely(fault_in_process_local(address))) { > + pr_err("page fault in PROCLOCAL at %lx", address); > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current); > + } > + > > Huh? Either it's an OOPS or you shouldn't print any special > debugging. As it is, you're just blatantly leaking the address of the > mm-local range to malicious user programs. > > Also, you should IMO consider using this mechanism for kmap_atomic(). > Hi, Nadav! Well, some context for the “hi” would have been helpful. (Do I have a bug and I still don’t understand it?) Perhaps you regard some use-case for a similar mechanism that I mentioned before. I did implement something similar (but not the way that you wanted) to improve the performance of seccomp and system-calls when retpolines are used. I set per-mm code area that held code that used direct calls to invoke seccomp filters and frequently used system-calls. My mechanism, I think, is more not suitable for this use-case. I needed my code-page to be at the same 2GB range as the kernel text/modules, which does complicate things. Due to the same reason, it is also limited in the size of the data/code that it can hold.
On 12.06.19 20:25, Sean Christopherson wrote: > On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote: >> The Linux kernel has a global address space that is the same for any >> kernel code. This address space becomes a liability in a world with >> processor information leak vulnerabilities, such as L1TF. With the right >> cache load gadget, an attacker-controlled hyperthread pair can leak >> arbitrary data via L1TF. Disabling hyperthreading is one recommended >> mitigation, but it comes with a large performance hit for a wide range >> of workloads. >> >> An alternative mitigation is to not make certain data in the kernel >> globally visible, but only when the kernel executes in the context of >> the process where this data belongs to. >> >> This patch series proposes to introduce a region for what we call >> process-local memory into the kernel's virtual address space. Page >> tables and mappings in that region will be exclusive to one address >> space, instead of implicitly shared between all kernel address spaces. >> Any data placed in that region will be out of reach of cache load >> gadgets that execute in different address spaces. To implement >> process-local memory, we introduce a new interface kmalloc_proclocal() / >> kfree_proclocal() that allocates and maps pages exclusively into the >> current kernel address space. As a first use case, we move architectural >> state of guest CPUs in KVM out of reach of other kernel address spaces. > Can you briefly describe what types of attacks this is intended to > mitigate? E.g. guest-guest, userspace-guest, etc... I don't want to > make comments based on my potentially bad assumptions. (quickly jumping in for Marius, he's offline today) The main purpose of this is to protect from leakage of data from one guest into another guest using speculation gadgets on the host. The same mechanism can be used to prevent leakage of secrets from one host process into another host process though, as host processes potentially have access to gadgets via the syscall interface. Alex
On 12.06.19 21:55, Dave Hansen wrote: > On 6/12/19 10:08 AM, Marius Hillenbrand wrote: >> This patch series proposes to introduce a region for what we call >> process-local memory into the kernel's virtual address space. > It might be fun to cc some x86 folks on this series. They might have > some relevant opinions. ;) > > A few high-level questions: > > Why go to all this trouble to hide guest state like registers if all the > guest data itself is still mapped? (jumping in for Marius, he's offline today) Glad you asked :). I hope this cover letter explains well how to achieve guest data not being mapped: https://lkml.org/lkml/2019/1/31/933 > Where's the context-switching code? Did I just miss it? I'm not sure I understand the question. With this mechanism, the global linear map pages are just not present anymore, so there is no context switching needed. For the process local memory, the page table is already mm local, so we don't need to do anything special during context switch, no? Alex
On 13.06.19 03:30, Andy Lutomirski wrote: > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote: >> >> >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: >>>> This patch series proposes to introduce a region for what we call >>>> process-local memory into the kernel's virtual address space. >>> It might be fun to cc some x86 folks on this series. They might have >>> some relevant opinions. ;) >>> >>> A few high-level questions: >>> >>> Why go to all this trouble to hide guest state like registers if all the >>> guest data itself is still mapped? >>> >>> Where's the context-switching code? Did I just miss it? >>> >>> We've discussed having per-cpu page tables where a given PGD is only in >>> use from one CPU at a time. I *think* this scheme still works in such a >>> case, it just adds one more PGD entry that would have to context-switched. >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle. > I looked at the patch, and it (sensibly) has nothing to do with > per-cpu PGDs. So it's in great shape! Thanks a lot for the very timely review! > > Seriously, though, here are some very high-level review comments: > > Please don't call it "process local", since "process" is meaningless. > Call it "mm local" or something like that. Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm not fully convinced, but I don't find it better or worse than proclocal. So whatever flies with the majority works for me :). > We already have a per-mm kernel mapping: the LDT. So please nix all > the code that adds a new VA region, etc, except to the extent that > some of it consists of valid cleanups in and of itself. Instead, > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make > it use a more general "mm local" address range, and then reuse the > same infrastructure for other fancy things. The code that makes it I don't fully understand how those two are related. Are you referring to the KPTI enabling code in there? That just maps the LDT at the same address in both kernel and user mappings, no? So you're suggesting we use the new mm local address as LDT address instead and have that mapped in both kernel and user space? This patch set today maps "mm local" data only in kernel space, not in user space, as it's meant for kernel data structures. So I'm not really seeing the path to adapt any of the LDT logic to this. Could you please elaborate? > KASLR-able should be in its very own patch that applies *after* the > code that makes it all work so that, when the KASLR part causes a > crash, we can bisect it. That sounds very reasonable, yes. > > + /* > + * Faults in process-local memory may be caused by process-local > + * addresses leaking into other contexts. > + * tbd: warn and handle gracefully. > + */ > + if (unlikely(fault_in_process_local(address))) { > + pr_err("page fault in PROCLOCAL at %lx", address); > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current); > + } > + > > Huh? Either it's an OOPS or you shouldn't print any special > debugging. As it is, you're just blatantly leaking the address of the > mm-local range to malicious user programs. Yes, this is a left over bit from an idea that we discussed and rejected yesterday. The idea was to have a DEBUG config option that allows proclocal memory to leak into other processes, but print debug output so that it's easier to catch bugs. After discussion, I think we managed to convince everyone that an OOPS is the better tool to find bugs :). Any trace of this will disappear in the next version. > > Also, you should IMO consider using this mechanism for kmap_atomic(). It might make sense to use it for kmap_atomic() for debug purposes, as it ensures that other users can no longer access the same mapping through the linear map. However, it does come at quite a big cost, as we need to shoot down the TLB of all other threads in the system. So I'm not sure it's of general value? Alex > Hi, Nadav!
> On 12 Jun 2019, at 21:25, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Wed, Jun 12, 2019 at 07:08:24PM +0200, Marius Hillenbrand wrote: >> The Linux kernel has a global address space that is the same for any >> kernel code. This address space becomes a liability in a world with >> processor information leak vulnerabilities, such as L1TF. With the right >> cache load gadget, an attacker-controlled hyperthread pair can leak >> arbitrary data via L1TF. Disabling hyperthreading is one recommended >> mitigation, but it comes with a large performance hit for a wide range >> of workloads. >> >> An alternative mitigation is to not make certain data in the kernel >> globally visible, but only when the kernel executes in the context of >> the process where this data belongs to. >> >> This patch series proposes to introduce a region for what we call >> process-local memory into the kernel's virtual address space. Page >> tables and mappings in that region will be exclusive to one address >> space, instead of implicitly shared between all kernel address spaces. >> Any data placed in that region will be out of reach of cache load >> gadgets that execute in different address spaces. To implement >> process-local memory, we introduce a new interface kmalloc_proclocal() / >> kfree_proclocal() that allocates and maps pages exclusively into the >> current kernel address space. As a first use case, we move architectural >> state of guest CPUs in KVM out of reach of other kernel address spaces. > > Can you briefly describe what types of attacks this is intended to > mitigate? E.g. guest-guest, userspace-guest, etc... I don't want to > make comments based on my potentially bad assumptions. I think I can assist in the explanation. Consider the following scenario: 1) Hyperthread A in CPU core runs in guest and triggers a VMExit which is handled by host kernel. While hyperthread A runs VMExit handler, it populates CPU core cache / internal-resources (e.g. MDS buffers) with some sensitive data it have speculatively/architecturally access. 2) During hyperthread A running on host kernel, hyperthread B on same CPU core runs in guest and use some CPU speculative execution vulnerability to leak the sensitive host data populated by hyperthread A in CPU core cache / internal-resources. Current CPU microcode mitigations (L1D/MDS flush) only handle the case of a single hyperthread and don’t provide a mechanism to mitigate this hyperthreading attack scenario. Assuming there is some guest triggerable speculative load gadget in some VMExit path, it can be used to force any data that is mapped into kernel address space to be loaded into CPU resource that is subject to leak. Therefore, there were multiple attempts to reduce sensitive information from being mapped into the kernel address space that is accessible by this VMExit path. One attempt was XPFO which attempts to remove from kernel direct-map any page that is currently used only by userspace. Unfortunately, XPFO currently exhibits multiple performance issues that *currently* makes it impractical as far as I know. Another attempt is this patch-series which attempts to remove from one vCPU thread host kernel address space, the state of vCPUs of other guests. Which is very specific but I personally have additional ideas on how this patch series can be further used. For example, vhost-net needs to kmap entire guest memory into kernel-space to write ingress packets data into guest memory. Thus, vCPU thread kernel address space now maps entire other guest memory which can be leaked using the technique described above. Therefore, it should be useful to also move this kmap() to happen on process-local kernel virtual address region. One could argue however that there is still a much bigger issue because of kernel direct-map that maps all physical pages that kernel manage (i.e. have struct page) in kernel virtual address space. And all of those pages can theoretically be leaked. However, this could be handled by complementary techniques such as booting host kernel with “mem=X” and mapping guest memory by directly mmap relevant portion of /dev/mem. Which is probably what AWS does given these upstream KVM patches they have contributed: bd53cb35a3e9 X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs e45adf665a53 KVM: Introduce a new guest mapping API 0c55671f84ff kvm, x86: Properly check whether a pfn is an MMIO or not Also note that when using such “mem=X” technique, you can also avoid performance penalties introduced by CPU microcode mitigations. E.g. You can avoid doing L1D flush on VMEntry if VMExit handler run only in kernel and didn’t context-switch as you assume kernel address space don’t map any host sensitive data. It’s also worth mentioning that another alternative that I have attempted to this “mem=X” technique was to create an isolated address space that is only used when running KVM VMExit handlers. For more information, refer to: https://lkml.org/lkml/2019/5/13/515 (See some of my comments on that thread) This is my 2cents on this at least. -Liran
On 6/13/19 12:27 AM, Alexander Graf wrote: >> Where's the context-switching code? Did I just miss it? > > I'm not sure I understand the question. With this mechanism, the global > linear map pages are just not present anymore, so there is no context > switching needed. For the process local memory, the page table is > already mm local, so we don't need to do anything special during context > switch, no? Thanks for explaining, I was just confused. Andy reminded me when comparing it to the LDT area: since this area is per-mm/pgd and we context switch that *obviously* there's no extra work to do at context switch time, as long as the area is marked non-Global.
On Thu, Jun 13, 2019 at 12:53 AM Alexander Graf <graf@amazon.com> wrote: > > > On 13.06.19 03:30, Andy Lutomirski wrote: > > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote: > >> > >> > >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >>> > >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > >>>> This patch series proposes to introduce a region for what we call > >>>> process-local memory into the kernel's virtual address space. > >>> It might be fun to cc some x86 folks on this series. They might have > >>> some relevant opinions. ;) > >>> > >>> A few high-level questions: > >>> > >>> Why go to all this trouble to hide guest state like registers if all the > >>> guest data itself is still mapped? > >>> > >>> Where's the context-switching code? Did I just miss it? > >>> > >>> We've discussed having per-cpu page tables where a given PGD is only in > >>> use from one CPU at a time. I *think* this scheme still works in such a > >>> case, it just adds one more PGD entry that would have to context-switched. > >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle. > > I looked at the patch, and it (sensibly) has nothing to do with > > per-cpu PGDs. So it's in great shape! > > > Thanks a lot for the very timely review! > > > > > > Seriously, though, here are some very high-level review comments: > > > > Please don't call it "process local", since "process" is meaningless. > > Call it "mm local" or something like that. > > > Naming is hard, yes :). Is "mmlocal" obvious enough to most readers? I'm > not fully convinced, but I don't find it better or worse than proclocal. > So whatever flies with the majority works for me :). My objection to "proc" is that we have many concepts of "process" in the kernel: task, mm, signal handling context, etc. These memory ranges are specifically local to the mm. Admittedly, it would be very surprising to have memory that is local to a signal handling context, but still. > > > > We already have a per-mm kernel mapping: the LDT. So please nix all > > the code that adds a new VA region, etc, except to the extent that > > some of it consists of valid cleanups in and of itself. Instead, > > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make > > it use a more general "mm local" address range, and then reuse the > > same infrastructure for other fancy things. The code that makes it > > > I don't fully understand how those two are related. Are you referring to > the KPTI enabling code in there? That just maps the LDT at the same > address in both kernel and user mappings, no? The relevance here is that, when KPTI is on, the exact same address refers to a different LDT in different mms, so it's genuinely an mm-local mapping. It works just like yours: a whole top-level paging entry is reserved for it. What I'm suggesting is that, when you're all done, the LDT should be more or less just one more mm-local mapping, with two caveats. First, the LDT needs special KPTI handling, but that's fine. Second, the LDT address is visible to user code on non-UMIP systems, so you'll have to decide if that's okay. My suggestion is to have the LDT be the very first address in the mm-local range and then to randomize everything else in the mm-local range. > > So you're suggesting we use the new mm local address as LDT address > instead and have that mapped in both kernel and user space? This patch > set today maps "mm local" data only in kernel space, not in user space, > as it's meant for kernel data structures. Yes, exactly. > > So I'm not really seeing the path to adapt any of the LDT logic to this. > Could you please elaborate? > > > > KASLR-able should be in its very own patch that applies *after* the > > code that makes it all work so that, when the KASLR part causes a > > crash, we can bisect it. > > > That sounds very reasonable, yes. > > > > > > + /* > > + * Faults in process-local memory may be caused by process-local > > + * addresses leaking into other contexts. > > + * tbd: warn and handle gracefully. > > + */ > > + if (unlikely(fault_in_process_local(address))) { > > + pr_err("page fault in PROCLOCAL at %lx", address); > > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current); > > + } > > + > > > > Huh? Either it's an OOPS or you shouldn't print any special > > debugging. As it is, you're just blatantly leaking the address of the > > mm-local range to malicious user programs. > > > Yes, this is a left over bit from an idea that we discussed and rejected > yesterday. The idea was to have a DEBUG config option that allows > proclocal memory to leak into other processes, but print debug output so > that it's easier to catch bugs. After discussion, I think we managed to > convince everyone that an OOPS is the better tool to find bugs :). > > Any trace of this will disappear in the next version. > > > > > > Also, you should IMO consider using this mechanism for kmap_atomic(). > > > It might make sense to use it for kmap_atomic() for debug purposes, as > it ensures that other users can no longer access the same mapping > through the linear map. However, it does come at quite a big cost, as we > need to shoot down the TLB of all other threads in the system. So I'm > not sure it's of general value? What I meant was that kmap_atomic() could use mm-local memory so that it doesn't need to do a global shootdown. But I guess it's not actually used for real on 64-bit, so this is mostly moot. Are you planning to support mm-local on 32-bit? --Andy > > > Alex > > > > Hi, Nadav!
On Wed, Jun 12, 2019 at 6:50 PM Nadav Amit <namit@vmware.com> wrote: > > > On Jun 12, 2019, at 6:30 PM, Andy Lutomirski <luto@kernel.org> wrote: > > > > On Wed, Jun 12, 2019 at 1:27 PM Andy Lutomirski <luto@amacapital.net> wrote: > >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > >>> > >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > >>>> This patch series proposes to introduce a region for what we call > >>>> process-local memory into the kernel's virtual address space. > >>> > >>> It might be fun to cc some x86 folks on this series. They might have > >>> some relevant opinions. ;) > >>> > >>> A few high-level questions: > >>> > >>> Why go to all this trouble to hide guest state like registers if all the > >>> guest data itself is still mapped? > >>> > >>> Where's the context-switching code? Did I just miss it? > >>> > >>> We've discussed having per-cpu page tables where a given PGD is only in > >>> use from one CPU at a time. I *think* this scheme still works in such a > >>> case, it just adds one more PGD entry that would have to context-switched. > >> > >> Fair warning: Linus is on record as absolutely hating this idea. He might change his mind, but it’s an uphill battle. > > > > I looked at the patch, and it (sensibly) has nothing to do with > > per-cpu PGDs. So it's in great shape! > > > > Seriously, though, here are some very high-level review comments: > > > > Please don't call it "process local", since "process" is meaningless. > > Call it "mm local" or something like that. > > > > We already have a per-mm kernel mapping: the LDT. So please nix all > > the code that adds a new VA region, etc, except to the extent that > > some of it consists of valid cleanups in and of itself. Instead, > > please refactor the LDT code (arch/x86/kernel/ldt.c, mainly) to make > > it use a more general "mm local" address range, and then reuse the > > same infrastructure for other fancy things. The code that makes it > > KASLR-able should be in its very own patch that applies *after* the > > code that makes it all work so that, when the KASLR part causes a > > crash, we can bisect it. > > > > + /* > > + * Faults in process-local memory may be caused by process-local > > + * addresses leaking into other contexts. > > + * tbd: warn and handle gracefully. > > + */ > > + if (unlikely(fault_in_process_local(address))) { > > + pr_err("page fault in PROCLOCAL at %lx", address); > > + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address, current); > > + } > > + > > > > Huh? Either it's an OOPS or you shouldn't print any special > > debugging. As it is, you're just blatantly leaking the address of the > > mm-local range to malicious user programs. > > > > Also, you should IMO consider using this mechanism for kmap_atomic(). > > Hi, Nadav! > > Well, some context for the “hi” would have been helpful. (Do I have a bug > and I still don’t understand it?) Fair enough :) > > Perhaps you regard some use-case for a similar mechanism that I mentioned > before. I did implement something similar (but not the way that you wanted) > to improve the performance of seccomp and system-calls when retpolines are > used. I set per-mm code area that held code that used direct calls to invoke > seccomp filters and frequently used system-calls. > > My mechanism, I think, is more not suitable for this use-case. I needed my > code-page to be at the same 2GB range as the kernel text/modules, which does > complicate things. Due to the same reason, it is also limited in the size of > the data/code that it can hold. > I actually meant the opposite. If we had a general-purpose per-mm kernel address range, could it be used to optimize kmap_atomic() by limiting the scope of any shootdowns? As a rough sketch, we'd have some kmap_atomic slots for each cpu *in the mm-local region*. I'm not entirely sure this is a win. --Andy
On 6/13/19 9:13 AM, Andy Lutomirski wrote: >> It might make sense to use it for kmap_atomic() for debug purposes, as >> it ensures that other users can no longer access the same mapping >> through the linear map. However, it does come at quite a big cost, as we >> need to shoot down the TLB of all other threads in the system. So I'm >> not sure it's of general value? > What I meant was that kmap_atomic() could use mm-local memory so that > it doesn't need to do a global shootdown. But I guess it's not > actually used for real on 64-bit, so this is mostly moot. Are you > planning to support mm-local on 32-bit? Do we *do* global shootdowns on kmap_atomic()s on 32-bit? I thought we used entirely per-cpu addresses, so a stale entry from another CPU can get loaded in the TLB speculatively but it won't ever actually get used. I think it goes: kunmap_atomic() -> __kunmap_atomic() -> kpte_clear_flush() -> __flush_tlb_one_kernel() -> __flush_tlb_one_user() -> __native_flush_tlb_one_user() -> invlpg The per-cpu address calculation is visible in kmap_atomic_prot(): idx = type + KM_TYPE_NR*smp_processor_id();
> On Jun 13, 2019, at 9:20 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/13/19 9:13 AM, Andy Lutomirski wrote: >>> It might make sense to use it for kmap_atomic() for debug purposes, as >>> it ensures that other users can no longer access the same mapping >>> through the linear map. However, it does come at quite a big cost, as we >>> need to shoot down the TLB of all other threads in the system. So I'm >>> not sure it's of general value? >> What I meant was that kmap_atomic() could use mm-local memory so that >> it doesn't need to do a global shootdown. But I guess it's not >> actually used for real on 64-bit, so this is mostly moot. Are you >> planning to support mm-local on 32-bit? > > Do we *do* global shootdowns on kmap_atomic()s on 32-bit? I thought we > used entirely per-cpu addresses, so a stale entry from another CPU can > get loaded in the TLB speculatively but it won't ever actually get used. > I think it goes: > > kunmap_atomic() -> > __kunmap_atomic() -> > kpte_clear_flush() -> > __flush_tlb_one_kernel() -> > __flush_tlb_one_user() -> > __native_flush_tlb_one_user() -> > invlpg > > The per-cpu address calculation is visible in kmap_atomic_prot(): > > idx = type + KM_TYPE_NR*smp_processor_id(); From a security point-of-view, having such an entry is still not too good, since the mapping protection might override the default protection. This might lead to potential W+X cases, for example, that might stay for a long time if they are speculatively cached in the TLB and not invalidated upon kunmap_atomic(). Having said that, I am not too excited to deal with this issue. Do people still care about x86/32-bit? In addition, if kunmap_atomic() is used when IRQs are disabled, sending a TLB shootdown during kunmap_atomic() can cause a deadlock.
On 6/13/19 10:29 AM, Nadav Amit wrote: > Having said that, I am not too excited to deal with this issue. Do > people still care about x86/32-bit? No, not really. It's just fun to try to give history lessons about the good old days. ;)
On Thu, Jun 13, 2019 at 10:49:42AM -0700, Dave Hansen wrote: > On 6/13/19 10:29 AM, Nadav Amit wrote: > > Having said that, I am not too excited to deal with this issue. Do > > people still care about x86/32-bit? > No, not really. Especially not for KVM, given the number of times 32-bit KVM has been broken recently without anyone noticing for several kernel releases.
On Wed, 12 Jun 2019, Andy Lutomirski wrote: > > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > >> This patch series proposes to introduce a region for what we call > >> process-local memory into the kernel's virtual address space. > > > > It might be fun to cc some x86 folks on this series. They might have > > some relevant opinions. ;) > > > > A few high-level questions: > > > > Why go to all this trouble to hide guest state like registers if all the > > guest data itself is still mapped? > > > > Where's the context-switching code? Did I just miss it? > > > > We've discussed having per-cpu page tables where a given PGD is only in > > use from one CPU at a time. I *think* this scheme still works in such a > > case, it just adds one more PGD entry that would have to context-switched. > > Fair warning: Linus is on record as absolutely hating this idea. He might > change his mind, but it’s an uphill battle. Yes I know, but as a benefit we could get rid of all the GSBASE horrors in the entry code as we could just put the percpu space into the local PGD. Thanks, tglx
On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, 12 Jun 2019, Andy Lutomirski wrote: > > > On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > >> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: > > >> This patch series proposes to introduce a region for what we call > > >> process-local memory into the kernel's virtual address space. > > > > > > It might be fun to cc some x86 folks on this series. They might have > > > some relevant opinions. ;) > > > > > > A few high-level questions: > > > > > > Why go to all this trouble to hide guest state like registers if all the > > > guest data itself is still mapped? > > > > > > Where's the context-switching code? Did I just miss it? > > > > > > We've discussed having per-cpu page tables where a given PGD is only in > > > use from one CPU at a time. I *think* this scheme still works in such a > > > case, it just adds one more PGD entry that would have to context-switched. > > > > Fair warning: Linus is on record as absolutely hating this idea. He might > > change his mind, but it’s an uphill battle. > > Yes I know, but as a benefit we could get rid of all the GSBASE horrors in > the entry code as we could just put the percpu space into the local PGD. > I have personally suggested this to Linus on a couple of occasions, and he seemed quite skeptical.
On Sun, 16 Jun 2019, Andy Lutomirski wrote: > On Fri, Jun 14, 2019 at 7:21 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, 12 Jun 2019, Andy Lutomirski wrote: > > > > > > Fair warning: Linus is on record as absolutely hating this idea. He might > > > change his mind, but it’s an uphill battle. > > > > Yes I know, but as a benefit we could get rid of all the GSBASE horrors in > > the entry code as we could just put the percpu space into the local PGD. > > > > I have personally suggested this to Linus on a couple of occasions, > and he seemed quite skeptical. The only way to find out is the good old: numbers talk .... So someone has to bite the bullet, implement it and figure out whether it's bollocks or not. :) Thanks, tglx
On 14.06.19 16:21, Thomas Gleixner wrote: > On Wed, 12 Jun 2019, Andy Lutomirski wrote: >>> On Jun 12, 2019, at 12:55 PM, Dave Hansen <dave.hansen@intel.com> wrote: >>> >>>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote: >>>> This patch series proposes to introduce a region for what we call >>>> process-local memory into the kernel's virtual address space. >>> It might be fun to cc some x86 folks on this series. They might have >>> some relevant opinions. ;) >>> >>> A few high-level questions: >>> >>> Why go to all this trouble to hide guest state like registers if all the >>> guest data itself is still mapped? >>> >>> Where's the context-switching code? Did I just miss it? >>> >>> We've discussed having per-cpu page tables where a given PGD is only in >>> use from one CPU at a time. I *think* this scheme still works in such a >>> case, it just adds one more PGD entry that would have to context-switched. >> Fair warning: Linus is on record as absolutely hating this idea. He might >> change his mind, but it’s an uphill battle. > Yes I know, but as a benefit we could get rid of all the GSBASE horrors in > the entry code as we could just put the percpu space into the local PGD. Would that mean that with Meltdown affected CPUs we open speculation attacks against the mmlocal memory from KVM user space? Alex
On 6/17/19 12:38 AM, Alexander Graf wrote: >> Yes I know, but as a benefit we could get rid of all the GSBASE >> horrors in >> the entry code as we could just put the percpu space into the local PGD. > > Would that mean that with Meltdown affected CPUs we open speculation > attacks against the mmlocal memory from KVM user space? Not necessarily. There would likely be a _set_ of local PGDs. We could still have pair of PTI PGDs just like we do know, they'd just be a local PGD pair.
On Mon, Jun 17, 2019 at 8:50 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/17/19 12:38 AM, Alexander Graf wrote: > >> Yes I know, but as a benefit we could get rid of all the GSBASE > >> horrors in > >> the entry code as we could just put the percpu space into the local PGD. > > > > Would that mean that with Meltdown affected CPUs we open speculation > > attacks against the mmlocal memory from KVM user space? > > Not necessarily. There would likely be a _set_ of local PGDs. We could > still have pair of PTI PGDs just like we do know, they'd just be a local > PGD pair. > Unfortunately, this would mean that we need to sync twice as many top-level entries when we context switch.
On 6/17/19 8:54 AM, Andy Lutomirski wrote: >>> Would that mean that with Meltdown affected CPUs we open speculation >>> attacks against the mmlocal memory from KVM user space? >> Not necessarily. There would likely be a _set_ of local PGDs. We could >> still have pair of PTI PGDs just like we do know, they'd just be a local >> PGD pair. >> > Unfortunately, this would mean that we need to sync twice as many > top-level entries when we context switch. Yeah, PTI sucks. :) For anyone following along at home, I'm going to go off into crazy per-cpu-pgds speculation mode now... Feel free to stop reading now. :) But, I was thinking we could get away with not doing this on _every_ context switch at least. For instance, couldn't 'struct tlb_context' have PGD pointer (or two with PTI) in addition to the TLB info? That way we only do the copying when we change the context. Or does that tie the implementation up too much with PCIDs?
On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/17/19 8:54 AM, Andy Lutomirski wrote: > >>> Would that mean that with Meltdown affected CPUs we open speculation > >>> attacks against the mmlocal memory from KVM user space? > >> Not necessarily. There would likely be a _set_ of local PGDs. We could > >> still have pair of PTI PGDs just like we do know, they'd just be a local > >> PGD pair. > >> > > Unfortunately, this would mean that we need to sync twice as many > > top-level entries when we context switch. > > Yeah, PTI sucks. :) > > For anyone following along at home, I'm going to go off into crazy > per-cpu-pgds speculation mode now... Feel free to stop reading now. :) > > But, I was thinking we could get away with not doing this on _every_ > context switch at least. For instance, couldn't 'struct tlb_context' > have PGD pointer (or two with PTI) in addition to the TLB info? That > way we only do the copying when we change the context. Or does that tie > the implementation up too much with PCIDs? Hmm, that seems entirely reasonable. I think the nasty bit would be figuring out all the interactions with PV TLB flushing. PV TLB flushes already don't play so well with PCID tracking, and this will make it worse. We probably need to rewrite all that code regardless.
> On Jun 17, 2019, at 9:14 AM, Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, Jun 17, 2019 at 9:09 AM Dave Hansen <dave.hansen@intel.com> wrote: >> On 6/17/19 8:54 AM, Andy Lutomirski wrote: >>>>> Would that mean that with Meltdown affected CPUs we open speculation >>>>> attacks against the mmlocal memory from KVM user space? >>>> Not necessarily. There would likely be a _set_ of local PGDs. We could >>>> still have pair of PTI PGDs just like we do know, they'd just be a local >>>> PGD pair. >>> Unfortunately, this would mean that we need to sync twice as many >>> top-level entries when we context switch. >> >> Yeah, PTI sucks. :) >> >> For anyone following along at home, I'm going to go off into crazy >> per-cpu-pgds speculation mode now... Feel free to stop reading now. :) >> >> But, I was thinking we could get away with not doing this on _every_ >> context switch at least. For instance, couldn't 'struct tlb_context' >> have PGD pointer (or two with PTI) in addition to the TLB info? That >> way we only do the copying when we change the context. Or does that tie >> the implementation up too much with PCIDs? > > Hmm, that seems entirely reasonable. I think the nasty bit would be > figuring out all the interactions with PV TLB flushing. PV TLB > flushes already don't play so well with PCID tracking, and this will > make it worse. We probably need to rewrite all that code regardless. How is PCID (as you implemented) related to TLB flushing of kernel (not user) PTEs? These kernel PTEs would be global, so they would be invalidated from all the address-spaces using INVLPG, I presume. No? Having said that, the fact that every hypervisor implements PV-TLB completely differently might be unwarranted.
On 6/17/19 9:53 AM, Nadav Amit wrote: >>> For anyone following along at home, I'm going to go off into crazy >>> per-cpu-pgds speculation mode now... Feel free to stop reading now. :) >>> >>> But, I was thinking we could get away with not doing this on _every_ >>> context switch at least. For instance, couldn't 'struct tlb_context' >>> have PGD pointer (or two with PTI) in addition to the TLB info? That >>> way we only do the copying when we change the context. Or does that tie >>> the implementation up too much with PCIDs? >> Hmm, that seems entirely reasonable. I think the nasty bit would be >> figuring out all the interactions with PV TLB flushing. PV TLB >> flushes already don't play so well with PCID tracking, and this will >> make it worse. We probably need to rewrite all that code regardless. > How is PCID (as you implemented) related to TLB flushing of kernel (not > user) PTEs? These kernel PTEs would be global, so they would be invalidated > from all the address-spaces using INVLPG, I presume. No? The idea is that you have a per-cpu address space. Certain kernel virtual addresses would map to different physical address based on where you are running. Each of the physical addresses would be "owned" by a single CPU and would, by convention, never use a PGD that mapped an address unless that CPU that "owned" it. In that case, you never really invalidate those addresses.
On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote: > On 6/17/19 9:53 AM, Nadav Amit wrote: > >>> For anyone following along at home, I'm going to go off into crazy > >>> per-cpu-pgds speculation mode now... Feel free to stop reading now. :) > >>> > >>> But, I was thinking we could get away with not doing this on _every_ > >>> context switch at least. For instance, couldn't 'struct tlb_context' > >>> have PGD pointer (or two with PTI) in addition to the TLB info? That > >>> way we only do the copying when we change the context. Or does that tie > >>> the implementation up too much with PCIDs? > >> Hmm, that seems entirely reasonable. I think the nasty bit would be > >> figuring out all the interactions with PV TLB flushing. PV TLB > >> flushes already don't play so well with PCID tracking, and this will > >> make it worse. We probably need to rewrite all that code regardless. > > How is PCID (as you implemented) related to TLB flushing of kernel (not > > user) PTEs? These kernel PTEs would be global, so they would be invalidated > > from all the address-spaces using INVLPG, I presume. No? > > The idea is that you have a per-cpu address space. Certain kernel > virtual addresses would map to different physical address based on where > you are running. Each of the physical addresses would be "owned" by a > single CPU and would, by convention, never use a PGD that mapped an > address unless that CPU that "owned" it. > > In that case, you never really invalidate those addresses. But you would need to invalidate if the process moved to another CPU, correct?
On 6/17/19 11:45 AM, Konrad Rzeszutek Wilk wrote: >> The idea is that you have a per-cpu address space. Certain kernel >> virtual addresses would map to different physical address based on where >> you are running. Each of the physical addresses would be "owned" by a >> single CPU and would, by convention, never use a PGD that mapped an >> address unless that CPU that "owned" it. >> >> In that case, you never really invalidate those addresses. > But you would need to invalidate if the process moved to another CPU, correct? If you have a per-cpu PGD, the rule is that you never use the PGD on more than one CPU. In that model, processes "take over" an existing PGD to run on the CPU rather than having the CPU use an existing PGD that came with the process. But we've really hijacked the original thread at this point, which is probably causing a ton of confusion.
> On Jun 17, 2019, at 11:07 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 6/17/19 9:53 AM, Nadav Amit wrote: >>>> For anyone following along at home, I'm going to go off into crazy >>>> per-cpu-pgds speculation mode now... Feel free to stop reading now. :) >>>> >>>> But, I was thinking we could get away with not doing this on _every_ >>>> context switch at least. For instance, couldn't 'struct tlb_context' >>>> have PGD pointer (or two with PTI) in addition to the TLB info? That >>>> way we only do the copying when we change the context. Or does that tie >>>> the implementation up too much with PCIDs? >>> Hmm, that seems entirely reasonable. I think the nasty bit would be >>> figuring out all the interactions with PV TLB flushing. PV TLB >>> flushes already don't play so well with PCID tracking, and this will >>> make it worse. We probably need to rewrite all that code regardless. >> How is PCID (as you implemented) related to TLB flushing of kernel (not >> user) PTEs? These kernel PTEs would be global, so they would be invalidated >> from all the address-spaces using INVLPG, I presume. No? > > The idea is that you have a per-cpu address space. Certain kernel > virtual addresses would map to different physical address based on where > you are running. Each of the physical addresses would be "owned" by a > single CPU and would, by convention, never use a PGD that mapped an > address unless that CPU that "owned" it. > > In that case, you never really invalidate those addresses. I understand, but as I see it, this is not related directly to PCIDs.
On Mon, Jun 17, 2019 at 11:44 AM Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Mon, Jun 17, 2019 at 11:07:45AM -0700, Dave Hansen wrote: > > On 6/17/19 9:53 AM, Nadav Amit wrote: > > >>> For anyone following along at home, I'm going to go off into crazy > > >>> per-cpu-pgds speculation mode now... Feel free to stop reading now. :) > > >>> > > >>> But, I was thinking we could get away with not doing this on _every_ > > >>> context switch at least. For instance, couldn't 'struct tlb_context' > > >>> have PGD pointer (or two with PTI) in addition to the TLB info? That > > >>> way we only do the copying when we change the context. Or does that tie > > >>> the implementation up too much with PCIDs? > > >> Hmm, that seems entirely reasonable. I think the nasty bit would be > > >> figuring out all the interactions with PV TLB flushing. PV TLB > > >> flushes already don't play so well with PCID tracking, and this will > > >> make it worse. We probably need to rewrite all that code regardless. > > > How is PCID (as you implemented) related to TLB flushing of kernel (not > > > user) PTEs? These kernel PTEs would be global, so they would be invalidated > > > from all the address-spaces using INVLPG, I presume. No? > > > > The idea is that you have a per-cpu address space. Certain kernel > > virtual addresses would map to different physical address based on where > > you are running. Each of the physical addresses would be "owned" by a > > single CPU and would, by convention, never use a PGD that mapped an > > address unless that CPU that "owned" it. > > > > In that case, you never really invalidate those addresses. > > But you would need to invalidate if the process moved to another CPU, correct? > There's nothing to invalidate. It's a different CPU with a different TLB. The big problem is that you have a choice. Either you can have one PGD per (mm, cpu) or you just have one or a few PGDs per CPU and you change them every time you change processes. Dave's idea to have one or two per (cpu, asid) is right, though. It means we have a decent chance of context switching without rewriting the whole thing, and it also means we don't need to write to the one that's currently loaded when we switch CR3. The latter could plausibly be important enough that we'd want to pretend we're using PCID even if we're not.
On 6/17/19 11:50 AM, Nadav Amit wrote: >> The idea is that you have a per-cpu address space. Certain kernel >> virtual addresses would map to different physical address based on where >> you are running. Each of the physical addresses would be "owned" by a >> single CPU and would, by convention, never use a PGD that mapped an >> address unless that CPU that "owned" it. >> >> In that case, you never really invalidate those addresses. > I understand, but as I see it, this is not related directly to PCIDs. Yeah, the only link I was thinking of is that we can manage per-CPU PGDs in the same way that we manage PCIDs. Basically we can reuse a chunk of the software concept.