Message ID | 20230329135401.105592-1-christian.koenig@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/ttm: set TTM allocated pages as reserved | expand |
On 3/29/23 15:54, Christian König wrote: > KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. > This is illegal and can cause data corruption with TTM pages because > only some of them are actually reference counted. I think that you are referring to this: /* Don't set the __GFP_COMP flag for higher order allocations. * Mapping pages directly into an userspace process and calling * put_page() on a TTM allocated page is illegal. */ if (order) gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; By "directly" I guess you mean without going through remap_pfn_range(). Based on our discussion offlist, it should be possible to remove the get_page/put_page from the path that fills in the KVM page table, but it is difficult to remove it altogether (it requires changing everything to use userspace virtual address). Indeed KVM needs to detect non-reference-counted pages because unfortunately there are cases where people want to map VM_PFNMAP pages into a guest. If it is not enough to have PageReserved set, and there is a better check, KVM can be fixed too. > Mark all pages allocated by TTM as reserved, this way KVM handles the > PFNs like they would point to MMIO space. > > This still results in a warning, but at least no other problem. What warning is it? Paolo > Signed-off-by: Christian König<christian.koenig@amd.com>
On 3/29/23 16:28, Paolo Bonzini wrote: > On 3/29/23 15:54, Christian König wrote: >> KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. >> This is illegal and can cause data corruption with TTM pages because >> only some of them are actually reference counted. After some other offlist discussions, I also would like to understand what you mean by corruption. First, is it a _host_ corruption or a guest corruption/crash? A guest crash would be KVM doing exactly what it's meant to do: it detects the non-reserved, non-refcounted page and refuses to map it into the guest. On the other hand, if it is a host crash, my understanding is that an order>0 allocation leaves the tail pages with a zero reference count (and without a compound_head if, as in this case, __GFP_COMP is unset). If that's correct, more analysis is needed to understand why get_page_unless_zero() isn't rejecting the tail pages. Paolo >> Mark all pages allocated by TTM as reserved, this way KVM handles the >> PFNs like they would point to MMIO space. >> >> This still results in a warning, but at least no other problem. > > What warning is it? > > Paolo > >> Signed-off-by: Christian König<christian.koenig@amd.com> >
Am 29.03.23 um 17:08 schrieb Paolo Bonzini: > On 3/29/23 16:28, Paolo Bonzini wrote: >> On 3/29/23 15:54, Christian König wrote: >>> KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. >>> This is illegal and can cause data corruption with TTM pages because >>> only some of them are actually reference counted. > > After some other offlist discussions, I also would like to understand > what you mean by corruption. I think what was meant here is that only parts of the buffers where updated, but see below. > > First, is it a _host_ corruption or a guest corruption/crash? A guest > crash would be KVM doing exactly what it's meant to do: it detects the > non-reserved, non-refcounted page and refuses to map it into the guest. Yes I think that this is what happens. The use case and mapping is indeed valid as far as I can see, but the handling that KVM does here is really problematic. When the PFN points to an IO address it just works because that isn't backed by struct pages. And when the PFN points to the first page of a higher order allocation it also works, only when the PFN points to a following page kvm_try_get_pfn() fails and causes the problems. What needs to happen instead is that when both hva_to_pfn_fast() and hva_to_pfn_slow() fails you don't try to convert the PFN into a page and so also don't get a reference to the page. This somehow needs to be communicated to the callers of hva_to_pfn() so that kvm_release_pfn() knows what to do. Regards, Christian. > > On the other hand, if it is a host crash, my understanding is that an > order>0 allocation leaves the tail pages with a zero reference count > (and without a compound_head if, as in this case, __GFP_COMP is > unset). If that's correct, more analysis is needed to understand why > get_page_unless_zero() isn't rejecting the tail pages. > > Paolo > > >>> Mark all pages allocated by TTM as reserved, this way KVM handles the >>> PFNs like they would point to MMIO space. >>> >>> This still results in a warning, but at least no other problem. >> >> What warning is it? >> >> Paolo >> >>> Signed-off-by: Christian König<christian.koenig@amd.com> >> >
On 3/29/23 17:29, Christian König wrote: >> First, is it a _host_ corruption or a guest corruption/crash? A guest >> crash would be KVM doing exactly what it's meant to do: it detects the >> non-reserved, non-refcounted page and refuses to map it into the guest. > > Yes I think that this is what happens. Ok I was worried all the time that this was host corruption/crash; which obviously would have been much worse. > The use case and mapping is indeed valid as far as I can see, but > the handling that KVM does here is really problematic. > > What needs to happen instead is that when both hva_to_pfn_fast() and > hva_to_pfn_slow() fails you don't try to convert the PFN into a page and > so also don't get a reference to the page. > > This somehow needs to be communicated to the callers of hva_to_pfn() so > that kvm_release_pfn() knows what to do. There's a bit more complication here: 1) in the guest page fault path we can avoid taking the reference altogether 2) in other MMU-notifier-protected paths, we can also avoid taking the reference but we also must stop using kmap() in virt/kvm/pfncache.c. 3) other uses of kmap() must switch to MMU-notifier protection. If the DRM people are okay with SetPageReserved() as a temporary hack, we can change or remove the WARN in kvm_is_zone_device_page(), since that is what you are referring to in the commit message. Paolo
Am 29.03.23 um 17:51 schrieb Paolo Bonzini: > On 3/29/23 17:29, Christian König wrote: >>> First, is it a _host_ corruption or a guest corruption/crash? A >>> guest crash would be KVM doing exactly what it's meant to do: it >>> detects the non-reserved, non-refcounted page and refuses to map it >>> into the guest. >> >> Yes I think that this is what happens. > > Ok I was worried all the time that this was host corruption/crash; > which obviously would have been much worse. > >> The use case and mapping is indeed valid as far as I can see, but >> the handling that KVM does here is really problematic. >> >> What needs to happen instead is that when both hva_to_pfn_fast() and >> hva_to_pfn_slow() fails you don't try to convert the PFN into a page >> and so also don't get a reference to the page. >> >> This somehow needs to be communicated to the callers of hva_to_pfn() >> so that kvm_release_pfn() knows what to do. > > There's a bit more complication here: > > 1) in the guest page fault path we can avoid taking the reference > altogether > > 2) in other MMU-notifier-protected paths, we can also avoid taking the > reference but we also must stop using kmap() in virt/kvm/pfncache.c. > > 3) other uses of kmap() must switch to MMU-notifier protection. I would rather suggest to return the page additionally to the pfn from hva_to_pfn() when the function was able to grab a reference to it. When the page is then not available you can't call kmap() on that either. > > If the DRM people are okay with SetPageReserved() as a temporary hack, > we can change or remove the WARN in kvm_is_zone_device_page(), since > that is what you are referring to in the commit message. Adding Daniel for additional comments on this, but essentially we have changed quite some infrastructure to make sure that everybody uses VM_PFNMAP to prevent stuff like this from happening. I would really prefer a proper solution in KVM instead. Christian. > > Paolo >
On 3/29/23 18:43, Christian König wrote: >> >> >> 3) other uses of kmap() must switch to MMU-notifier protection. > > I would rather suggest to return the page additionally to the pfn from > hva_to_pfn() when the function was able to grab a reference to it. > > When the page is then not available you can't call kmap() on that either. > >> >> If the DRM people are okay with SetPageReserved() as a temporary hack, >> we can change or remove the WARN in kvm_is_zone_device_page(), since >> that is what you are referring to in the commit message. > > Adding Daniel for additional comments on this, but essentially we have > changed quite some infrastructure to make sure that everybody uses > VM_PFNMAP to prevent stuff like this from happening. > > I would really prefer a proper solution in KVM instead. Hmm... Now that I think about it that would be https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ Time to resurrect that work. Paolo
+David On Wed, Mar 29, 2023, Paolo Bonzini wrote: > On 3/29/23 18:43, Christian K�nig wrote: > > > > > > > > > 3) other uses of kmap() must switch to MMU-notifier protection. > > > > I would rather suggest to return the page additionally to the pfn from > > hva_to_pfn() when the function was able to grab a reference to it. > > > > When the page is then not available you can't call kmap() on that either. > > > > > > > > If the DRM people are okay with SetPageReserved() as a temporary > > > hack, we can change or remove the WARN in kvm_is_zone_device_page(), > > > since that is what you are referring to in the commit message. > > > > Adding Daniel for additional comments on this, but essentially we have > > changed quite some infrastructure to make sure that everybody uses > > VM_PFNMAP to prevent stuff like this from happening. > > > > I would really prefer a proper solution in KVM instead. > > Hmm... Now that I think about it that would be > > https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ > > Time to resurrect that work. Ya. I had previously asked David to first eliminated the usage that isn't protected by mmu_notifiers, but after seeing the resulting complexity, I had a change of heart[2]. Can you weigh in on the latter thread, specifically my proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. [1] https://lore.kernel.org/kvm/Ydhq5aHW+JFo15UF@google.com [2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
Am 29.03.23 um 19:22 schrieb Sean Christopherson: > +David > > On Wed, Mar 29, 2023, Paolo Bonzini wrote: >> On 3/29/23 18:43, Christian K�nig wrote: >>>> >>>> 3) other uses of kmap() must switch to MMU-notifier protection. >>> I would rather suggest to return the page additionally to the pfn from >>> hva_to_pfn() when the function was able to grab a reference to it. >>> >>> When the page is then not available you can't call kmap() on that either. >>> >>>> If the DRM people are okay with SetPageReserved() as a temporary >>>> hack, we can change or remove the WARN in kvm_is_zone_device_page(), >>>> since that is what you are referring to in the commit message. >>> Adding Daniel for additional comments on this, but essentially we have >>> changed quite some infrastructure to make sure that everybody uses >>> VM_PFNMAP to prevent stuff like this from happening. >>> >>> I would really prefer a proper solution in KVM instead. >> Hmm... Now that I think about it that would be >> >> https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ >> >> Time to resurrect that work. > Ya. I had previously asked David to first eliminated the usage that isn't > protected by mmu_notifiers, but after seeing the resulting complexity, I had a > change of heart[2]. Can you weigh in on the latter thread, specifically my > proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. I don't think that this is something an admin should be able to decide. x86 system are rather grateful, but if you kmap() pages accessed by GPUs on ARM the system might just reboot spontaneously. Robin Murphy <robin.murphy@arm.com> can fill you in on the hw details if needed. Christian. > > [1] https://lore.kernel.org/kvm/Ydhq5aHW+JFo15UF@google.com > [2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/
On Wed, Mar 29, 2023, Christian K�nig wrote: > Am 29.03.23 um 19:22 schrieb Sean Christopherson: > > +David > > > > On Wed, Mar 29, 2023, Paolo Bonzini wrote: > > > On 3/29/23 18:43, Christian K�nig wrote: > > > > > > > > > > 3) other uses of kmap() must switch to MMU-notifier protection. > > > > I would rather suggest to return the page additionally to the pfn from > > > > hva_to_pfn() when the function was able to grab a reference to it. > > > > > > > > When the page is then not available you can't call kmap() on that either. > > > > > > > > > If the DRM people are okay with SetPageReserved() as a temporary > > > > > hack, we can change or remove the WARN in kvm_is_zone_device_page(), > > > > > since that is what you are referring to in the commit message. > > > > Adding Daniel for additional comments on this, but essentially we have > > > > changed quite some infrastructure to make sure that everybody uses > > > > VM_PFNMAP to prevent stuff like this from happening. > > > > > > > > I would really prefer a proper solution in KVM instead. > > > Hmm... Now that I think about it that would be > > > > > > https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ > > > > > > Time to resurrect that work. > > Ya. I had previously asked David to first eliminated the usage that isn't > > protected by mmu_notifiers, but after seeing the resulting complexity, I had a > > change of heart[2]. Can you weigh in on the latter thread, specifically my > > proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. > > I don't think that this is something an admin should be able to decide. Why not? Assuming the admin has CAP_SYS_ADMIN, they can reboot the host in a myriad of ways. The idea with the KVM module param is to allow curated setups where the userspace VMM is trusted to a large extent, e.g. AWS' Nitro, to opt-in to the unsafe behavior. I.e. by enabling the flag, the admin is acknowledging that bad things can happen if untrusted/compromised userspace gets ahold of /dev/kvm. > x86 system are rather grateful, but if you kmap() pages accessed by GPUs on > ARM the system might just reboot spontaneously. FWIW, the dangers of an unsafe kmap() are arguably far worse than spontaneous reboots, e.g. there's potential for use-after-free and possibly even write access to arbitrary memory.
Am 29.03.23 um 19:39 schrieb Sean Christopherson: > On Wed, Mar 29, 2023, Christian K�nig wrote: >> Am 29.03.23 um 19:22 schrieb Sean Christopherson: >>> +David >>> >>> On Wed, Mar 29, 2023, Paolo Bonzini wrote: >>>> On 3/29/23 18:43, Christian K�nig wrote: >>>>>> 3) other uses of kmap() must switch to MMU-notifier protection. >>>>> I would rather suggest to return the page additionally to the pfn from >>>>> hva_to_pfn() when the function was able to grab a reference to it. >>>>> >>>>> When the page is then not available you can't call kmap() on that either. >>>>> >>>>>> If the DRM people are okay with SetPageReserved() as a temporary >>>>>> hack, we can change or remove the WARN in kvm_is_zone_device_page(), >>>>>> since that is what you are referring to in the commit message. >>>>> Adding Daniel for additional comments on this, but essentially we have >>>>> changed quite some infrastructure to make sure that everybody uses >>>>> VM_PFNMAP to prevent stuff like this from happening. >>>>> >>>>> I would really prefer a proper solution in KVM instead. >>>> Hmm... Now that I think about it that would be >>>> >>>> https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ >>>> >>>> Time to resurrect that work. >>> Ya. I had previously asked David to first eliminated the usage that isn't >>> protected by mmu_notifiers, but after seeing the resulting complexity, I had a >>> change of heart[2]. Can you weigh in on the latter thread, specifically my >>> proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. >> I don't think that this is something an admin should be able to decide. > Why not? Assuming the admin has CAP_SYS_ADMIN, they can reboot the host in a > myriad of ways. The idea with the KVM module param is to allow curated setups > where the userspace VMM is trusted to a large extent, e.g. AWS' Nitro, to opt-in > to the unsafe behavior. I.e. by enabling the flag, the admin is acknowledging > that bad things can happen if untrusted/compromised userspace gets ahold of > /dev/kvm. Well exactly that's the point, you don't need untrusted/compromised userspace the system could just go spontaneously into nirvana during normal operation. This would result in very very hard to debug problems since the issues only happen rather rarely. On the other hand why do you need the kmap() in the first place? Regards, Christian. > >> x86 system are rather grateful, but if you kmap() pages accessed by GPUs on >> ARM the system might just reboot spontaneously. > FWIW, the dangers of an unsafe kmap() are arguably far worse than spontaneous > reboots, e.g. there's potential for use-after-free and possibly even write access > to arbitrary memory.
On Wed, Mar 29, 2023, Christian K�nig wrote: > Am 29.03.23 um 19:39 schrieb Sean Christopherson: > > On Wed, Mar 29, 2023, Christian K�nig wrote: > > > Am 29.03.23 um 19:22 schrieb Sean Christopherson: > > > > +David > > > > > > > > On Wed, Mar 29, 2023, Paolo Bonzini wrote: > > > > > On 3/29/23 18:43, Christian K�nig wrote: > > > > > > > 3) other uses of kmap() must switch to MMU-notifier protection. > > > > > > I would rather suggest to return the page additionally to the pfn from > > > > > > hva_to_pfn() when the function was able to grab a reference to it. > > > > > > > > > > > > When the page is then not available you can't call kmap() on that either. > > > > > > > > > > > > > If the DRM people are okay with SetPageReserved() as a temporary > > > > > > > hack, we can change or remove the WARN in kvm_is_zone_device_page(), > > > > > > > since that is what you are referring to in the commit message. > > > > > > Adding Daniel for additional comments on this, but essentially we have > > > > > > changed quite some infrastructure to make sure that everybody uses > > > > > > VM_PFNMAP to prevent stuff like this from happening. > > > > > > > > > > > > I would really prefer a proper solution in KVM instead. > > > > > Hmm... Now that I think about it that would be > > > > > > > > > > https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ > > > > > > > > > > Time to resurrect that work. > > > > Ya. I had previously asked David to first eliminated the usage that isn't > > > > protected by mmu_notifiers, but after seeing the resulting complexity, I had a > > > > change of heart[2]. Can you weigh in on the latter thread, specifically my > > > > proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. > > > I don't think that this is something an admin should be able to decide. > > Why not? Assuming the admin has CAP_SYS_ADMIN, they can reboot the host in a > > myriad of ways. The idea with the KVM module param is to allow curated setups > > where the userspace VMM is trusted to a large extent, e.g. AWS' Nitro, to opt-in > > to the unsafe behavior. I.e. by enabling the flag, the admin is acknowledging > > that bad things can happen if untrusted/compromised userspace gets ahold of > > /dev/kvm. > > Well exactly that's the point, you don't need untrusted/compromised > userspace the system could just go spontaneously into nirvana during normal > operation. The proposal is that the module param is off by default, and the expectation is that it would be turned on only by very specific setups. I would not object to making it even more difficult to enable, burying it behind a Kconfig, but IMO that is unnecessary. > This would result in very very hard to debug problems since the issues only > happen rather rarely. > > On the other hand why do you need the kmap() in the first place? The Nitro setup manages guest memory in userspace, and hides that memory from the kernel, e.g. to avoid struct page overhead.. To enable KVM to access guest memory in select flows, e.g. for paravirtualization, without taking on too much complexity, KVM supports kmap() of that memory. For some uses, e.g. nested virtualization, switching to uaccess might be feasible, but the paravirt stuff, especially Xen support, would likely be insanely complex to do without kmap().
On Thu, Mar 30, 2023 at 2:22 AM Sean Christopherson <seanjc@google.com> wrote: > > +David > > On Wed, Mar 29, 2023, Paolo Bonzini wrote: > > On 3/29/23 18:43, Christian K�nig wrote: > > > > > > > > > > > > 3) other uses of kmap() must switch to MMU-notifier protection. > > > > > > I would rather suggest to return the page additionally to the pfn from > > > hva_to_pfn() when the function was able to grab a reference to it. > > > > > > When the page is then not available you can't call kmap() on that either. > > > > > > > > > > > If the DRM people are okay with SetPageReserved() as a temporary > > > > hack, we can change or remove the WARN in kvm_is_zone_device_page(), > > > > since that is what you are referring to in the commit message. > > > > > > Adding Daniel for additional comments on this, but essentially we have > > > changed quite some infrastructure to make sure that everybody uses > > > VM_PFNMAP to prevent stuff like this from happening. > > > > > > I would really prefer a proper solution in KVM instead. > > > > Hmm... Now that I think about it that would be > > > > https://lore.kernel.org/kvm/Yc4H+dGfK83BaGpC@google.com/t/ > > > > Time to resurrect that work. > > Ya. I had previously asked David to first eliminated the usage that isn't > protected by mmu_notifiers, but after seeing the resulting complexity, I had a > change of heart[2]. Can you weigh in on the latter thread, specifically my > proposal of using a module param to let the admin opt-in to "unsafe" kmap usage. > > [1] https://lore.kernel.org/kvm/Ydhq5aHW+JFo15UF@google.com > [2] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com/ I just finished rebasing my patch series and sent out v6: https://lore.kernel.org/all/20230330085802.2414466-1-stevensd@google.com/ -David
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index aa116a7bbae3..c665a8bf366a 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -82,6 +82,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags, unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS; struct ttm_pool_dma *dma; struct page *p; + unsigned int i; void *vaddr; /* Don't set the __GFP_COMP flag for higher order allocations. @@ -94,38 +95,43 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags, if (!pool->use_dma_alloc) { p = alloc_pages(gfp_flags, order); - if (p) - p->private = order; - return p; - } + if (!p) + return NULL; - dma = kmalloc(sizeof(*dma), GFP_KERNEL); - if (!dma) - return NULL; + p->private = order; + } else { - if (order) - attr |= DMA_ATTR_NO_WARN; + dma = kmalloc(sizeof(*dma), GFP_KERNEL); + if (!dma) + return NULL; - vaddr = dma_alloc_attrs(pool->dev, (1ULL << order) * PAGE_SIZE, - &dma->addr, gfp_flags, attr); - if (!vaddr) - goto error_free; + if (order) + attr |= DMA_ATTR_NO_WARN; - /* TODO: This is an illegal abuse of the DMA API, but we need to rework - * TTM page fault handling and extend the DMA API to clean this up. - */ - if (is_vmalloc_addr(vaddr)) - p = vmalloc_to_page(vaddr); - else - p = virt_to_page(vaddr); + vaddr = dma_alloc_attrs(pool->dev, (1ULL << order) * PAGE_SIZE, + &dma->addr, gfp_flags, attr); + if (!vaddr) { + kfree(dma); + return NULL; + } - dma->vaddr = (unsigned long)vaddr | order; - p->private = (unsigned long)dma; - return p; + /* TODO: This is an illegal abuse of the DMA API, but we need + * to rework TTM page fault handling and extend the DMA API to + * clean this up. + */ + if (is_vmalloc_addr(vaddr)) + p = vmalloc_to_page(vaddr); + else + p = virt_to_page(vaddr); -error_free: - kfree(dma); - return NULL; + dma->vaddr = (unsigned long)vaddr | order; + p->private = (unsigned long)dma; + } + + for (i = 0; i < (1 << order); ++i) + SetPageReserved(&p[i]); + + return p; } /* Reset the caching and pages of size 1 << order */ @@ -134,6 +140,7 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching, { unsigned long attr = DMA_ATTR_FORCE_CONTIGUOUS; struct ttm_pool_dma *dma; + unsigned int i; void *vaddr; #ifdef CONFIG_X86 @@ -144,6 +151,9 @@ static void ttm_pool_free_page(struct ttm_pool *pool, enum ttm_caching caching, set_pages_wb(p, 1 << order); #endif + for (i = 0; i < (1 << order); ++i) + ClearPageReserved(&p[i]); + if (!pool || !pool->use_dma_alloc) { __free_pages(p, order); return;
KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. This is illegal and can cause data corruption with TTM pages because only some of them are actually reference counted. Mark all pages allocated by TTM as reserved, this way KVM handles the PFNs like they would point to MMIO space. This still results in a warning, but at least no other problem. Signed-off-by: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/ttm/ttm_pool.c | 62 ++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 26 deletions(-)