Message ID | 20221129191237.31447-1-mizhang@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Deprecate BUG() in pte_list_remove() in shadow mmu | expand |
gentle ping on this one? On Tue, Nov 29, 2022 at 11:12 AM Mingwei Zhang <mizhang@google.com> wrote: > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > physical machine. There are several reasons and motivations to do so: > > MMU bug is difficult to discover due to various racing conditions and > corner cases and thus it extremely hard to debug. The situation gets much > worse when it triggers the shutdown of a host. Host machine crash might > eliminates everything including the potential clues for debugging. > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > longer appropriate as the host reliability is top priority. Crashing the > physical machine is almost never a good option as it eliminates innocent > VMs and cause service outage in a larger scope. Even worse, if attacker can > reliably triggers this code by diverting the control flow or corrupting the > memory, then this becomes vm-of-death attack. This is a huge attack vector > to cloud providers, as the death of one single host machine is not the end > of the story. Without manual interferences, a failed cloud job may be > dispatched to other hosts and continue host crashes until all of them are > dead. > > For the above reason, we propose the replacement of BUG() in > pte_list_remove() with KVM_BUG() to crash just the VM itself. > > v3 - v4: > - update code to integrate messages into KVM_BUG() [seanjc]. > - update commit message [seanjc]. > > v2 -> v3: > - plumb @kvm all the way to pte_list_remove() [seanjc, pbonzini] > - https://lore.kernel.org/lkml/20221128002043.1555543-1-mizhang@google.com/ > > v1 -> v2: > - compile test the code. > - fill KVM_BUG() with kvm_get_running_vcpu()->kvm > - https://lore.kernel.org/all/20221124003505.424617-1-mizhang@google.com/ > > rfc v1: > - https://lore.kernel.org/all/20221123231206.274392-1-mizhang@google.com/ > > > Mingwei Zhang (2): > KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove() > KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu > > arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++---------------- > 1 file changed, 17 insertions(+), 16 deletions(-) > > -- > 2.38.1.584.g0f3c55d4c2-goog >
On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > physical machine. There are several reasons and motivations to do so: > > MMU bug is difficult to discover due to various racing conditions and > corner cases and thus it extremely hard to debug. The situation gets much > worse when it triggers the shutdown of a host. Host machine crash might > eliminates everything including the potential clues for debugging. > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > longer appropriate as the host reliability is top priority. Crashing the > physical machine is almost never a good option as it eliminates innocent > VMs and cause service outage in a larger scope. Even worse, if attacker can > reliably triggers this code by diverting the control flow or corrupting the > memory, then this becomes vm-of-death attack. This is a huge attack vector > to cloud providers, as the death of one single host machine is not the end > of the story. Without manual interferences, a failed cloud job may be > dispatched to other hosts and continue host crashes until all of them are > dead. My only concern with using KVM_BUG() is whether the machine can keep running correctly after this warning has been hit. In other words, are we sure the damage is contained to just this VM? If, for example, the KVM_BUG() was triggered by a use-after-free, then there might be corrupted memory floating around in the machine. What are some instances where we've seen these BUG_ON()s get triggered? For those instances, would it actually be safe to just kill the current VM and keep the rest of the machine running? > > For the above reason, we propose the replacement of BUG() in > pte_list_remove() with KVM_BUG() to crash just the VM itself. How did you test this series?
On Fri, Dec 09, 2022, David Matlack wrote: > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > physical machine. There are several reasons and motivations to do so: > > > > MMU bug is difficult to discover due to various racing conditions and > > corner cases and thus it extremely hard to debug. The situation gets much > > worse when it triggers the shutdown of a host. Host machine crash might > > eliminates everything including the potential clues for debugging. > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > longer appropriate as the host reliability is top priority. Crashing the > > physical machine is almost never a good option as it eliminates innocent > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > reliably triggers this code by diverting the control flow or corrupting the > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > to cloud providers, as the death of one single host machine is not the end > > of the story. Without manual interferences, a failed cloud job may be > > dispatched to other hosts and continue host crashes until all of them are > > dead. > > My only concern with using KVM_BUG() is whether the machine can keep > running correctly after this warning has been hit. In other words, are > we sure the damage is contained to just this VM? > > If, for example, the KVM_BUG() was triggered by a use-after-free, then > there might be corrupted memory floating around in the machine. > David, Your concern is quite reasonable. But given that both rmap and spte are pointers/data structures managed by individual VMs, i.e., none of them are global pointers, use-after-free is unlikely happening on cross-VM cases. Even if there is, then shuting down those corrupted VMs is feasible here, since pte_list_remove() basically does the checking. > What are some instances where we've seen these BUG_ON()s get triggered? > For those instances, would it actually be safe to just kill the current > VM and keep the rest of the machine running? > > > > > For the above reason, we propose the replacement of BUG() in > > pte_list_remove() with KVM_BUG() to crash just the VM itself. > > How did you test this series? I used a simple test case to test the series: diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 0f6455072055..d4b993b26b96 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -701,7 +701,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (fault->nx_huge_page_workaround_enabled) disallowed_hugepage_adjust(fault, *it.sptep, it.level); - base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1); + base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1) - 1; if (it.level == fault->goal_level) break; On the testing machine, I launched a L1 VM and a L2 VM within it. The L2 will trigger the above bug in shadow MMU and I got the following error in L0 kernel dmesg as shown below. L1 and L2 hangs with high CPU usage for a while and after a couple of seconds, the L1 VM dies properly. The machine is still alive and subsequent VM operations are all good (launch/kill). [ 1678.043378] ------------[ cut here ]------------ [ 1678.043381] gfn mismatch under direct page 1041bf (expected 10437e, got 1043be) [ 1678.043386] WARNING: CPU: 4 PID: 23430 at arch/x86/kvm/mmu/mmu.c:737 kvm_mmu_page_set_translation+0x131/0x140 [ 1678.043395] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O) [ 1678.043404] CPU: 4 PID: 23430 Comm: VCPU-7 Tainted: G S O 6.1.0-smp-DEV #5 [ 1678.043406] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022 [ 1678.043407] RIP: 0010:kvm_mmu_page_set_translation+0x131/0x140 [ 1678.043411] Code: 0f 44 e0 4c 8b 6b 28 48 89 df 44 89 f6 e8 b7 fb ff ff 48 c7 c7 1b 5a 2f 82 4c 89 e6 4c 89 ea 48 89 c1 4d 89 f8 e8 9f 39 0c 00 <0f> 0b eb ac 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 1678.043413] RSP: 0018:ffff88811ba87918 EFLAGS: 00010246 [ 1678.043415] RAX: 1bdd851636664d00 RBX: ffff888118602e60 RCX: 0000000000000027 [ 1678.043416] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff8897e0320488 [ 1678.043417] RBP: ffff88811ba87940 R08: 0000000000000000 R09: ffffffff82b2e6f0 [ 1678.043418] R10: 00000000ffff7fff R11: 0000000000000000 R12: ffffffff822e89da [ 1678.043419] R13: 00000000001041bf R14: 00000000000001bf R15: 00000000001043be [ 1678.043421] FS: 00007fee198ec700(0000) GS:ffff8897e0300000(0000) knlGS:0000000000000000 [ 1678.043422] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1678.043424] CR2: 0000000000000000 CR3: 0000001857c34005 CR4: 00000000003726e0 [ 1678.043425] Call Trace: [ 1678.043426] <TASK> [ 1678.043428] __rmap_add+0x8a/0x270 [ 1678.043432] mmu_set_spte+0x250/0x340 [ 1678.043435] ept_fetch+0x8ad/0xc00 [ 1678.043437] ept_page_fault+0x265/0x2f0 [ 1678.043440] kvm_mmu_page_fault+0xfa/0x2d0 [ 1678.043443] handle_ept_violation+0x135/0x2e0 [kvm_intel] [ 1678.043455] ? handle_desc+0x20/0x20 [kvm_intel] [ 1678.043462] __vmx_handle_exit+0x1c3/0x480 [kvm_intel] [ 1678.043468] vmx_handle_exit+0x12/0x40 [kvm_intel] [ 1678.043474] vcpu_enter_guest+0xbb3/0xf80 [ 1678.043477] ? complete_fast_pio_in+0xcc/0x160 [ 1678.043480] kvm_arch_vcpu_ioctl_run+0x3b0/0x770 [ 1678.043481] kvm_vcpu_ioctl+0x52d/0x610 [ 1678.043486] ? kvm_on_user_return+0x46/0xd0 [ 1678.043489] __se_sys_ioctl+0x77/0xc0 [ 1678.043492] __x64_sys_ioctl+0x1d/0x20 [ 1678.043493] do_syscall_64+0x3d/0x80 [ 1678.043497] ? sysvec_apic_timer_interrupt+0x49/0x90 [ 1678.043499] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1678.043501] RIP: 0033:0x7fee3ebf0347 [ 1678.043503] Code: 5d c3 cc 48 8b 05 f9 2f 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2f 07 00 f7 d8 64 89 01 48 [ 1678.043505] RSP: 002b:00007fee198e8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1678.043507] RAX: ffffffffffffffda RBX: 0000555308e7e4d0 RCX: 00007fee3ebf0347 [ 1678.043507] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 00000000000000b0 [ 1678.043508] RBP: 00007fee198e89c0 R08: 000055530943d920 R09: 00000000000003fa [ 1678.043509] R10: 0000555307349b00 R11: 0000000000000246 R12: 00000000000000b0 [ 1678.043510] R13: 00005574c1a1de88 R14: 00007fee198e8a27 R15: 0000000000000000 [ 1678.043511] </TASK> [ 1678.043512] ---[ end trace 0000000000000000 ]--- [ 5313.657064] ------------[ cut here ]------------ [ 5313.657067] no rmap for 0000000071a2f138 (many->many) [ 5313.657071] WARNING: CPU: 43 PID: 23398 at arch/x86/kvm/mmu/mmu.c:983 pte_list_remove+0x17a/0x190 [ 5313.657080] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O) [ 5313.657088] CPU: 43 PID: 23398 Comm: kvm-nx-lpage-re Tainted: G S W O 6.1.0-smp-DEV #5 [ 5313.657090] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022 [ 5313.657092] RIP: 0010:pte_list_remove+0x17a/0x190 [ 5313.657095] Code: cf e4 01 01 48 c7 c7 4d 3c 32 82 e8 70 5e 0c 00 0f 0b e9 0a ff ff ff c6 05 d4 cf e4 01 01 48 c7 c7 9e de 33 82 e8 56 5e 0c 00 <0f> 0b 84 db 75 c8 e9 ec fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 [ 5313.657097] RSP: 0018:ffff88986d5d3c30 EFLAGS: 00010246 [ 5313.657099] RAX: 1ebf71ba511d3100 RBX: 0000000000000000 RCX: 0000000000000027 [ 5313.657101] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff88afdf3e0488 [ 5313.657102] RBP: ffff88986d5d3c40 R08: 0000000000000000 R09: ffffffff82b2e6f0 [ 5313.657104] R10: 00000000ffff7fff R11: 40000000ffff8a28 R12: 0000000000000000 [ 5313.657105] R13: ffff888118602000 R14: ffffc90020e1e000 R15: ffff88815df33030 [ 5313.657106] FS: 0000000000000000(0000) GS:ffff88afdf3c0000(0000) knlGS:0000000000000000 [ 5313.657107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5313.657109] CR2: 000017c92b50f1b8 CR3: 000000006f40a001 CR4: 00000000003726e0 [ 5313.657110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5313.657111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5313.657112] Call Trace: [ 5313.657113] <TASK> [ 5313.657114] drop_spte+0x175/0x180 [ 5313.657117] mmu_page_zap_pte+0xfd/0x130 [ 5313.657119] __kvm_mmu_prepare_zap_page+0x290/0x6e0 [ 5313.657122] ? newidle_balance+0x228/0x3b0 [ 5313.657126] kvm_nx_huge_page_recovery_worker+0x266/0x360 [ 5313.657129] kvm_vm_worker_thread+0x93/0x150 [ 5313.657134] ? kvm_mmu_post_init_vm+0x40/0x40 [ 5313.657136] ? kvm_vm_create_worker_thread+0x120/0x120 [ 5313.657139] kthread+0x10d/0x120 [ 5313.657141] ? kthread_blkcg+0x30/0x30 [ 5313.657142] ret_from_fork+0x1f/0x30 [ 5313.657156] </TASK> [ 5313.657156] ---[ end trace 0000000000000000 ]---
On Mon, Dec 12, 2022, Mingwei Zhang wrote: > On Fri, Dec 09, 2022, David Matlack wrote: > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > > physical machine. There are several reasons and motivations to do so: > > > > > > MMU bug is difficult to discover due to various racing conditions and > > > corner cases and thus it extremely hard to debug. The situation gets much > > > worse when it triggers the shutdown of a host. Host machine crash might > > > eliminates everything including the potential clues for debugging. > > > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > > longer appropriate as the host reliability is top priority. Crashing the > > > physical machine is almost never a good option as it eliminates innocent > > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > > reliably triggers this code by diverting the control flow or corrupting the > > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > > to cloud providers, as the death of one single host machine is not the end > > > of the story. Without manual interferences, a failed cloud job may be > > > dispatched to other hosts and continue host crashes until all of them are > > > dead. > > > > My only concern with using KVM_BUG() is whether the machine can keep > > running correctly after this warning has been hit. In other words, are > > we sure the damage is contained to just this VM? Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte() when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple splats where the reserved bits appear to be set by stack overflow, that has a much, much higher probability of being a symptom of data corruption. That said, that's more of a reason to change get_mmio_spte() than it is to ignore potential data corruption in this case. KVM arguably should kill the VM if get_mmio_spte() fails too. What about explicitly treating both get_mmio_spte() and this as potential data corruption? E.g. something like this, and then use the DATA_CORRUPTION variant in pte_list_remove()? diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2055e04b8f89..1cb69c6d186b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", sptes[level], level, get_rsvd_bits(rsvd_check, sptes[level], level)); + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); } return reserved; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f16c4689322b..5c4a06f66f46 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) unlikely(__ret); \ }) +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ +({ \ + int __ret = (cond); \ + \ + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ + BUG_ON(__ret); \ + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ + kvm_vm_bugged(kvm); \ + unlikely(__ret); \ +}) + static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) { #ifdef CONFIG_PROVE_RCU > > If, for example, the KVM_BUG() was triggered by a use-after-free, then > > there might be corrupted memory floating around in the machine. > > > > David, > > Your concern is quite reasonable. But given that both rmap and spte are > pointers/data structures managed by individual VMs, i.e., none of them > are global pointers, use-after-free is unlikely happening on cross-VM > cases. Being per-VM allocations doesn't change the behavior/impact of use-after-free. E.g. if there is no rmap found for a SPTE then there's a non-zero chance KVM has previously zapped the SPTE and freed the memory the SPTE pointed at, and thus KVM might be reading/writing memory that is now owned by something else in the kernel. > Even if there is, then shuting down those corrupted VMs is feasible > here, since pte_list_remove() basically does the checking. But the damage may already be done. And the KVM_REQ_VM_DEAD request wont't be recognized until the next vcpu_enter_enter_guest(), e.g. it won't prevent vCPUs (or even this vCPU) from processing more
On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Dec 12, 2022, Mingwei Zhang wrote: > > On Fri, Dec 09, 2022, David Matlack wrote: > > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > > > physical machine. There are several reasons and motivations to do so: > > > > > > > > MMU bug is difficult to discover due to various racing conditions and > > > > corner cases and thus it extremely hard to debug. The situation gets much > > > > worse when it triggers the shutdown of a host. Host machine crash might > > > > eliminates everything including the potential clues for debugging. > > > > > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > > > longer appropriate as the host reliability is top priority. Crashing the > > > > physical machine is almost never a good option as it eliminates innocent > > > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > > > reliably triggers this code by diverting the control flow or corrupting the > > > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > > > to cloud providers, as the death of one single host machine is not the end > > > > of the story. Without manual interferences, a failed cloud job may be > > > > dispatched to other hosts and continue host crashes until all of them are > > > > dead. > > > > > > My only concern with using KVM_BUG() is whether the machine can keep > > > running correctly after this warning has been hit. In other words, are > > > we sure the damage is contained to just this VM? > > Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte() > when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple > splats where the reserved bits appear to be set by stack overflow, that has a much, > much higher probability of being a symptom of data corruption. > > That said, that's more of a reason to change get_mmio_spte() than it is to ignore > potential data corruption in this case. KVM arguably should kill the VM if > get_mmio_spte() fails too. > > What about explicitly treating both get_mmio_spte() and this as potential data > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant > in pte_list_remove()? > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 2055e04b8f89..1cb69c6d186b 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", > sptes[level], level, > get_rsvd_bits(rsvd_check, sptes[level], level)); > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); > } > > return reserved; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index f16c4689322b..5c4a06f66f46 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) > unlikely(__ret); \ > }) > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ > +({ \ > + int __ret = (cond); \ > + \ > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ > + BUG_ON(__ret); \ > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ > + kvm_vm_bugged(kvm); \ > + unlikely(__ret); \ > +}) > + > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) > { > #ifdef CONFIG_PROVE_RCU That sounds reasonable to me.
On Mon, Dec 12, 2022, David Matlack wrote: > On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Dec 12, 2022, Mingwei Zhang wrote: > > > On Fri, Dec 09, 2022, David Matlack wrote: > > > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote: > > > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a > > > > > physical machine. There are several reasons and motivations to do so: > > > > > > > > > > MMU bug is difficult to discover due to various racing conditions and > > > > > corner cases and thus it extremely hard to debug. The situation gets much > > > > > worse when it triggers the shutdown of a host. Host machine crash might > > > > > eliminates everything including the potential clues for debugging. > > > > > > > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no > > > > > longer appropriate as the host reliability is top priority. Crashing the > > > > > physical machine is almost never a good option as it eliminates innocent > > > > > VMs and cause service outage in a larger scope. Even worse, if attacker can > > > > > reliably triggers this code by diverting the control flow or corrupting the > > > > > memory, then this becomes vm-of-death attack. This is a huge attack vector > > > > > to cloud providers, as the death of one single host machine is not the end > > > > > of the story. Without manual interferences, a failed cloud job may be > > > > > dispatched to other hosts and continue host crashes until all of them are > > > > > dead. > > > > > > > > My only concern with using KVM_BUG() is whether the machine can keep > > > > running correctly after this warning has been hit. In other words, are > > > > we sure the damage is contained to just this VM? > > > > Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte() > > when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple > > splats where the reserved bits appear to be set by stack overflow, that has a much, > > much higher probability of being a symptom of data corruption. > > > > That said, that's more of a reason to change get_mmio_spte() than it is to ignore > > potential data corruption in this case. KVM arguably should kill the VM if > > get_mmio_spte() fails too. > > > > What about explicitly treating both get_mmio_spte() and this as potential data > > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant > > in pte_list_remove()? > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 2055e04b8f89..1cb69c6d186b 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) > > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", > > sptes[level], level, > > get_rsvd_bits(rsvd_check, sptes[level], level)); > > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); > > } > > > > return reserved; > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index f16c4689322b..5c4a06f66f46 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) > > unlikely(__ret); \ > > }) > > > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ > > +({ \ > > + int __ret = (cond); \ > > + \ > > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ > > + BUG_ON(__ret); \ > > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ > > + kvm_vm_bugged(kvm); \ > > + unlikely(__ret); \ > > +}) > > + > > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) > > { > > #ifdef CONFIG_PROVE_RCU > > That sounds reasonable to me. Actually, I disagree after thinking about it for a while. Since Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no point. The purpose of the series is to save the host from crash. If we follow the upper changes on get_mmio_spte() as well, we are introducing more BUG()!, ie., more chances to crash the host machine. So I cannot stay with this idea. In fact, there could many reasons why RMAPs and SPTEs are messed. In many times, as illustrated by my example, they are not data corruptions. They are just errors due to code refactoring or some miscalculations under certain tricky corner situations, e.g., offset by 1. So, treating them blindly as data corruptions is an overkill. This is a comparison to the list traversal bug, which clearly shows data corruptions. So, alternatively, I think it should be reasonable to shutdown the KVM instance if we see things messed up in MMIO SPTEs as well, but please don not the host for that.
On Tue, Dec 13, 2022, Mingwei Zhang wrote: > On Mon, Dec 12, 2022, David Matlack wrote: > > > What about explicitly treating both get_mmio_spte() and this as potential data > > > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant > > > in pte_list_remove()? > > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 2055e04b8f89..1cb69c6d186b 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep) > > > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx", > > > sptes[level], level, > > > get_rsvd_bits(rsvd_check, sptes[level], level)); > > > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm); Random aside, this would be better: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 99c40617d325..d9c46f2a6748 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4090,7 +4090,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct) return RET_PF_EMULATE; reserved = get_mmio_spte(vcpu, addr, &spte); - if (WARN_ON(reserved)) + if (KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm)) return -EINVAL; if (is_mmio_spte(spte)) { > > > } > > > > > > return reserved; > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index f16c4689322b..5c4a06f66f46 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm) > > > unlikely(__ret); \ > > > }) > > > > > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \ > > > +({ \ > > > + int __ret = (cond); \ > > > + \ > > > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \ > > > + BUG_ON(__ret); \ > > > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \ > > > + kvm_vm_bugged(kvm); \ > > > + unlikely(__ret); \ > > > +}) > > > + > > > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu) > > > { > > > #ifdef CONFIG_PROVE_RCU > > > > That sounds reasonable to me. > > Actually, I disagree after thinking about it for a while. Since > Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this > KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no > point. The purpose of the series is to save the host from crash. Sure, but Google is not the only consumer of KVM. E.g. I'm guessing many distros use CONFIG_BUG_ON_DATA_CORRUPTION=n. Any user that doesn't have the infrastructure to capture crash dumps is likely better served overall by the WARN-and-continue behavior. > If we follow the upper changes on get_mmio_spte() as well, we are > introducing more BUG()!, ie., more chances to crash the host machine. So > I cannot stay with this idea. For get_mmio_spte() specifically, I am quite confident that in production kernels, the benefits of a BUG() far outweigh the risk of unnecessarily panicking the host. There have been recent-ish bugs that escaped into kernel releases, but the first one (wrong root level) escaped only because it affected an esoteric, rarely used configuration (shadow paging with a 32-bit host kernel), and the second one was a purely theoretical bug fix that was also limited to 32-bit kernels. 39b4d43e6003 ("KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE") 2aa078932ff6 ("KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()") > In fact, there could many reasons why RMAPs and SPTEs are messed. In > many times, as illustrated by my example, they are not data corruptions. > They are just errors due to code refactoring or some miscalculations under > certain tricky corner situations, e.g., offset by 1. So, treating them > blindly as data corruptions is an overkill. For SPTEs in prod, not really. Something has gone really wrong if KVM messes up a SPTE to the point where _hardware_ unexpectedly observes a reserved bit in a production build. We've had bugs where we've botched the software sanity checks, e.g. see commit 6c6ab524cfae ("KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT") but except for rarely used setups (see above), unexpected reserved SPTE faults in production pretty much won't happen unless there's memory corruption or a CPU error. For RMAPs, I agree it's less clear cut, but at the same time the status quo is to always BUG(), so from an upstream perspective allowing the admin to avoid the BUG() is an unqualified improvement. > This is a comparison to the list traversal bug, which clearly shows data corruptions. That depends on what is considered "data corruption". In this case, CONFIG_BUG_ON_DATA_CORRUPTION is about the kernel memory structures being corrupted, it's not referring to generic data corruption user data. Select this option if the kernel should BUG when it encounters data corruption in kernel memory structures when they get checked for validity. And IMO that definition fits KVM's behavior perfectly. get_mmio_spte() detects corrupted SPTEs, pte_list_remove() detects corrupted RMAP lists. If "data corruption" is interpeted to be corruption of user data than the list behavior is wildly overstepping. E.g. if a subsystem has a bug where an object is added to two different lists, list_add() will yell about the issue but because all writes to the list_head are aborted with CONFIG_DEBUG_LIST=y, it's entirely possible for the bug to be benign or limited to the subsystem. A somewhat contrived example would be if a (complex) allocator and its caller both add the object to the same list; the second list_add() is problematic and would trigger BUG() with CONFIG_BUG_ON_DATA_CORRUPTION=y, but with CONFIG_DEBUG_LIST=y the bug would be completely benign. > So, alternatively, I think it should be reasonable to shutdown the KVM > instance if we see things messed up in MMIO SPTEs as well, but please > don not the host for that. Again, this doesn't unconditionally bring down the host. Google is _choosing_ to BUG() on _potential_ data corruption. I'm open to not tieing KVM's behavior to CONFIG_BUG_ON_DATA_CORRUPTION if there's a good argument for doing so, but if we (Google) plan on turning on the BUG() behavior for KVM then I don't see the point. In other words, we should discuss internally whether or not we want/need to push for a separate control, but from an upstream perspective I think the proposed behavior is entirely reasonable.