Message ID | 20230511145110.27707-1-yi.l.liu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Add Intel VT-d nested translation | expand |
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Thursday, May 11, 2023 10:51 PM > > The first Intel platform supporting nested translation is Sapphire > Rapids which, unfortunately, has a hardware errata [2] requiring special > treatment. This errata happens when a stage-1 page table page (either > level) is located in a stage-2 read-only region. In that case the IOMMU > hardware may ignore the stage-2 RO permission and still set the A/D bit > in stage-1 page table entries during page table walking. > > A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to > report > this errata to userspace. With that restriction the user should either > disable nested translation to favor RO stage-2 mappings or ensure no > RO stage-2 mapping to enable nested translation. > > Intel-iommu driver is armed with necessary checks to prevent such mix > in patch10 of this series. > > Qemu currently does add RO mappings though. The vfio agent in Qemu > simply maps all valid regions in the GPA address space which certainly > includes RO regions e.g. vbios. > > In reality we don't know a usage relying on DMA reads from the BIOS > region. Hence finding a way to allow user opt-out RO mappings in > Qemu might be an acceptable tradeoff. But how to achieve it cleanly > needs more discussion in Qemu community. For now we just hacked Qemu > to test. > Hi, Alex, Want to touch base on your thoughts about this errata before we actually go to discuss how to handle it in Qemu. Overall it affects all Sapphire Rapids platforms. Fully disabling nested translation in the kernel just for this rare vulnerability sounds an overkill. So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) in the kernel and expose the restriction to userspace so the VMM can choose which one to enable based on its own requirement. At least this looks a reasonable tradeoff to some proprietary VMMs which never adds RO mappings in stage-2 today. But we do want to get Qemu support nested translation on those platform as the widely-used reference VMM! Do you see any major oversight before pursuing such change in Qemu e.g. having a way for the user to opt-out adding RO mappings in stage-2?
On Wed, 24 May 2023 08:59:43 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Thursday, May 11, 2023 10:51 PM > > > > The first Intel platform supporting nested translation is Sapphire > > Rapids which, unfortunately, has a hardware errata [2] requiring special > > treatment. This errata happens when a stage-1 page table page (either > > level) is located in a stage-2 read-only region. In that case the IOMMU > > hardware may ignore the stage-2 RO permission and still set the A/D bit > > in stage-1 page table entries during page table walking. > > > > A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to > > report > > this errata to userspace. With that restriction the user should either > > disable nested translation to favor RO stage-2 mappings or ensure no > > RO stage-2 mapping to enable nested translation. > > > > Intel-iommu driver is armed with necessary checks to prevent such mix > > in patch10 of this series. > > > > Qemu currently does add RO mappings though. The vfio agent in Qemu > > simply maps all valid regions in the GPA address space which certainly > > includes RO regions e.g. vbios. > > > > In reality we don't know a usage relying on DMA reads from the BIOS > > region. Hence finding a way to allow user opt-out RO mappings in > > Qemu might be an acceptable tradeoff. But how to achieve it cleanly > > needs more discussion in Qemu community. For now we just hacked Qemu > > to test. > > > > Hi, Alex, > > Want to touch base on your thoughts about this errata before we > actually go to discuss how to handle it in Qemu. > > Overall it affects all Sapphire Rapids platforms. Fully disabling nested > translation in the kernel just for this rare vulnerability sounds an overkill. > > So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) > in the kernel and expose the restriction to userspace so the VMM can > choose which one to enable based on its own requirement. > > At least this looks a reasonable tradeoff to some proprietary VMMs > which never adds RO mappings in stage-2 today. > > But we do want to get Qemu support nested translation on those > platform as the widely-used reference VMM! > > Do you see any major oversight before pursuing such change in Qemu > e.g. having a way for the user to opt-out adding RO mappings in stage-2?
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, May 26, 2023 2:07 AM > > On Wed, 24 May 2023 08:59:43 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Thursday, May 11, 2023 10:51 PM > > > > > > The first Intel platform supporting nested translation is Sapphire > > > Rapids which, unfortunately, has a hardware errata [2] requiring special > > > treatment. This errata happens when a stage-1 page table page (either > > > level) is located in a stage-2 read-only region. In that case the IOMMU > > > hardware may ignore the stage-2 RO permission and still set the A/D bit > > > in stage-1 page table entries during page table walking. > > > > > > A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to > > > report > > > this errata to userspace. With that restriction the user should either > > > disable nested translation to favor RO stage-2 mappings or ensure no > > > RO stage-2 mapping to enable nested translation. > > > > > > Intel-iommu driver is armed with necessary checks to prevent such mix > > > in patch10 of this series. > > > > > > Qemu currently does add RO mappings though. The vfio agent in Qemu > > > simply maps all valid regions in the GPA address space which certainly > > > includes RO regions e.g. vbios. > > > > > > In reality we don't know a usage relying on DMA reads from the BIOS > > > region. Hence finding a way to allow user opt-out RO mappings in > > > Qemu might be an acceptable tradeoff. But how to achieve it cleanly > > > needs more discussion in Qemu community. For now we just hacked > Qemu > > > to test. > > > > > > > Hi, Alex, > > > > Want to touch base on your thoughts about this errata before we > > actually go to discuss how to handle it in Qemu. > > > > Overall it affects all Sapphire Rapids platforms. Fully disabling nested > > translation in the kernel just for this rare vulnerability sounds an overkill. > > > > So we decide to enforce the exclusive check (RO in stage-2 vs. nesting) > > in the kernel and expose the restriction to userspace so the VMM can > > choose which one to enable based on its own requirement. > > > > At least this looks a reasonable tradeoff to some proprietary VMMs > > which never adds RO mappings in stage-2 today. > > > > But we do want to get Qemu support nested translation on those > > platform as the widely-used reference VMM! > > > > Do you see any major oversight before pursuing such change in Qemu > > e.g. having a way for the user to opt-out adding RO mappings in stage-2? >
On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote: > At least this looks a reasonable tradeoff to some proprietary VMMs > which never adds RO mappings in stage-2 today. What is the reason for the RO anyhow? Would it be so bad if it was DMA mapped as RW due to the errata? Jason
On Mon, 29 May 2023 15:43:02 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote: > > > At least this looks a reasonable tradeoff to some proprietary VMMs > > which never adds RO mappings in stage-2 today. > > What is the reason for the RO anyhow? > > Would it be so bad if it was DMA mapped as RW due to the errata? What if it's the zero page? Thanks, Alex
On Mon, May 29, 2023 at 06:16:44PM -0600, Alex Williamson wrote: > On Mon, 29 May 2023 15:43:02 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote: > > > > > At least this looks a reasonable tradeoff to some proprietary VMMs > > > which never adds RO mappings in stage-2 today. > > > > What is the reason for the RO anyhow? > > > > Would it be so bad if it was DMA mapped as RW due to the errata? > > What if it's the zero page? Thanks, GUP doesn't return the zero page if FOL_WRITE is specified Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, May 30, 2023 2:43 AM > > On Wed, May 24, 2023 at 08:59:43AM +0000, Tian, Kevin wrote: > > > At least this looks a reasonable tradeoff to some proprietary VMMs > > which never adds RO mappings in stage-2 today. > > What is the reason for the RO anyhow? vfio simply follows the permission in the CPU address space. vBIOS regions are marked as RO there hence also carried to vfio mappings. > > Would it be so bad if it was DMA mapped as RW due to the errata? > think of a scenario where the vbios memory is shared by multiple qemu instances then RW allows a malicious VM to modify the shared content then potentially attacking other VMs. skipping the mapping is safest in this regard.
On Wed, Jun 14, 2023 at 08:07:30AM +0000, Tian, Kevin wrote: > think of a scenario where the vbios memory is shared by multiple qemu > instances then RW allows a malicious VM to modify the shared content > then potentially attacking other VMs. qemu would have to map the vbios as MAP_PRIVATE WRITE before the iommu side could map it writable, so this is not a real worry. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, June 14, 2023 7:53 PM > > On Wed, Jun 14, 2023 at 08:07:30AM +0000, Tian, Kevin wrote: > > > think of a scenario where the vbios memory is shared by multiple qemu > > instances then RW allows a malicious VM to modify the shared content > > then potentially attacking other VMs. > > qemu would have to map the vbios as MAP_PRIVATE WRITE before the > iommu > side could map it writable, so this is not a real worry. > Make sense. but IMHO it's still safer to reduce the permission (RO->NP) than increasing the permission (RO->RW) when faithfully emulating bare metal behavior is impossible, especially when there is no real usage counting on it.