Message ID | 166182871735.3518559.8884121293045337358.stgit@omen (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/type1: Unpin zero pages | expand |
On Mon, Aug 29, 2022, Alex Williamson wrote: > There's currently a reference count leak on the zero page. We increment > the reference via pin_user_pages_remote(), but the page is later handled > as an invalid/reserved page, therefore it's not accounted against the > user and not unpinned by our put_pfn(). Heh, kvm_pfn_to_refcounted_page() all over again. is_zone_device_page() is the other known case where a PageReserved page is refcounted. But as KVM's comment calls out, KVM's list was built through trial and error.
On 30.08.22 05:05, Alex Williamson wrote: > There's currently a reference count leak on the zero page. We increment > the reference via pin_user_pages_remote(), but the page is later handled > as an invalid/reserved page, therefore it's not accounted against the > user and not unpinned by our put_pfn(). > > Introducing special zero page handling in put_pfn() would resolve the > leak, but without accounting of the zero page, a single user could > still create enough mappings to generate a reference count overflow. > > The zero page is always resident, so for our purposes there's no reason > to keep it pinned. Therefore, add a loop to walk pages returned from > pin_user_pages_remote() and unpin any zero pages. > > Cc: David Hildenbrand <david@redhat.com> > Cc: stable@vger.kernel.org > Reported-by: Luboslav Pivarc <lpivarc@redhat.com> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index db516c90a977..8706482665d1 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, > ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM, > pages, NULL, NULL); > if (ret > 0) { > + int i; > + > + /* > + * The zero page is always resident, we don't need to pin it > + * and it falls into our invalid/reserved test so we don't > + * unpin in put_pfn(). Unpin all zero pages in the batch here. > + */ > + for (i = 0 ; i < ret; i++) { > + if (unlikely(is_zero_pfn(page_to_pfn(pages[i])))) > + unpin_user_page(pages[i]); > + } > + > *pfn = page_to_pfn(pages[0]); > goto done; > } > > As discussed offline, for the shared zeropage (that's not even refcounted when mapped into a process), this makes perfect sense to me. Good question raised by Sean if ZONE_DEVICE pages might similarly be problematic. But for them, we cannot simply always unpin here. Reviewed-by: David Hildenbrand <david@redhat.com>
On Tue, 30 Aug 2022 09:59:33 +0200 David Hildenbrand <david@redhat.com> wrote: > On 30.08.22 05:05, Alex Williamson wrote: > > There's currently a reference count leak on the zero page. We increment > > the reference via pin_user_pages_remote(), but the page is later handled > > as an invalid/reserved page, therefore it's not accounted against the > > user and not unpinned by our put_pfn(). > > > > Introducing special zero page handling in put_pfn() would resolve the > > leak, but without accounting of the zero page, a single user could > > still create enough mappings to generate a reference count overflow. > > > > The zero page is always resident, so for our purposes there's no reason > > to keep it pinned. Therefore, add a loop to walk pages returned from > > pin_user_pages_remote() and unpin any zero pages. > > > > Cc: David Hildenbrand <david@redhat.com> > > Cc: stable@vger.kernel.org > > Reported-by: Luboslav Pivarc <lpivarc@redhat.com> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index db516c90a977..8706482665d1 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, > > ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM, > > pages, NULL, NULL); > > if (ret > 0) { > > + int i; > > + > > + /* > > + * The zero page is always resident, we don't need to pin it > > + * and it falls into our invalid/reserved test so we don't > > + * unpin in put_pfn(). Unpin all zero pages in the batch here. > > + */ > > + for (i = 0 ; i < ret; i++) { > > + if (unlikely(is_zero_pfn(page_to_pfn(pages[i])))) > > + unpin_user_page(pages[i]); > > + } > > + > > *pfn = page_to_pfn(pages[0]); > > goto done; > > } > > > > > > As discussed offline, for the shared zeropage (that's not even > refcounted when mapped into a process), this makes perfect sense to me. > > Good question raised by Sean if ZONE_DEVICE pages might similarly be > problematic. But for them, we cannot simply always unpin here. What sort of VM mapping would give me ZONE_DEVICE pages? Thanks, Alex
On 30.08.22 17:11, Alex Williamson wrote: > On Tue, 30 Aug 2022 09:59:33 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> On 30.08.22 05:05, Alex Williamson wrote: >>> There's currently a reference count leak on the zero page. We increment >>> the reference via pin_user_pages_remote(), but the page is later handled >>> as an invalid/reserved page, therefore it's not accounted against the >>> user and not unpinned by our put_pfn(). >>> >>> Introducing special zero page handling in put_pfn() would resolve the >>> leak, but without accounting of the zero page, a single user could >>> still create enough mappings to generate a reference count overflow. >>> >>> The zero page is always resident, so for our purposes there's no reason >>> to keep it pinned. Therefore, add a loop to walk pages returned from >>> pin_user_pages_remote() and unpin any zero pages. >>> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: stable@vger.kernel.org >>> Reported-by: Luboslav Pivarc <lpivarc@redhat.com> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++ >>> 1 file changed, 12 insertions(+) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index db516c90a977..8706482665d1 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, >>> ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM, >>> pages, NULL, NULL); >>> if (ret > 0) { >>> + int i; >>> + >>> + /* >>> + * The zero page is always resident, we don't need to pin it >>> + * and it falls into our invalid/reserved test so we don't >>> + * unpin in put_pfn(). Unpin all zero pages in the batch here. >>> + */ >>> + for (i = 0 ; i < ret; i++) { >>> + if (unlikely(is_zero_pfn(page_to_pfn(pages[i])))) >>> + unpin_user_page(pages[i]); >>> + } >>> + >>> *pfn = page_to_pfn(pages[0]); >>> goto done; >>> } >>> >>> >> >> As discussed offline, for the shared zeropage (that's not even >> refcounted when mapped into a process), this makes perfect sense to me. >> >> Good question raised by Sean if ZONE_DEVICE pages might similarly be >> problematic. But for them, we cannot simply always unpin here. > > What sort of VM mapping would give me ZONE_DEVICE pages? Thanks, I think one approach is mmap'ing a devdax device. To test without actual NVDIMM hardware, there are ways to simulate it even on bare metal using the "memmap=" kernel parameter. https://nvdimm.wiki.kernel.org/ Alternatively, you can use an emulated nvdimm device under QEMU -- but then you'd have to run VFIO inside the VM. I know (that you know) that there are ways to get that working, but it certainly requires more effort :) ... let me know if you need any tips&tricks.
On Tue, Aug 30, 2022 at 05:43:43PM +0200, David Hildenbrand wrote: > I think one approach is mmap'ing a devdax device. To test without actual > NVDIMM hardware, there are ways to simulate it even on bare metal using > the "memmap=" kernel parameter. > > https://nvdimm.wiki.kernel.org/ > > Alternatively, you can use an emulated nvdimm device under QEMU -- but > then you'd have to run VFIO inside the VM. I know (that you know) that > there are ways to get that working, but it certainly requires more effort :) > > ... let me know if you need any tips&tricks. currently pin_user_pages will not return ZONE_DEVICE pages anyhow, so it is not relevant to this snippet You might seem them on that follow_pfn flow though :\ Jason
Hi, Alex, > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Tuesday, August 30, 2022 11:06 AM > > There's currently a reference count leak on the zero page. We increment > the reference via pin_user_pages_remote(), but the page is later handled > as an invalid/reserved page, therefore it's not accounted against the > user and not unpinned by our put_pfn(). > > Introducing special zero page handling in put_pfn() would resolve the > leak, but without accounting of the zero page, a single user could > still create enough mappings to generate a reference count overflow. > > The zero page is always resident, so for our purposes there's no reason > to keep it pinned. Therefore, add a loop to walk pages returned from > pin_user_pages_remote() and unpin any zero pages. > We found an interesting issue on zero page and wonder whether we should instead find a way to not use zero page in vfio pinning path. The observation - the 'pc.bios' region (0xfffc0000) is always mapped RO to zero page in the IOMMU page table even after the mapping in the CPU page table has been changed after Qemu loads the guest bios image into that region (which is mmap'ed as RW). In reality this may not cause real problem as I don't expect any sane usage would want to DMA read from the bios region. This is probably the reason why nobody ever notes it. But in concept it is incorrect. Fixing Qemu to update/setup the VFIO mapping after loading the bios image could mitigate this problem. But we never document such ABI restriction on RO mappings and in concept the pinning semantics should apply to all paths (RO in DMA and RW in CPU) which the application uses to access the pinned memory hence the sequence shouldn't matter from user p.o.v And old Qemu/VMM still have this issue. Having a notifier to implicitly fix the IOMMU mapping within the kernel violates the semantics of pinning, and makes vfio page accounting even more tricky. So I wonder instead of continuing to fix trickiness around the zero page whether it is a better idea to pursue allocating a normal page from the beginning for pinned RO mappings? Thanks Kevin
On 02.09.22 10:24, Tian, Kevin wrote: > Hi, Alex, > >> From: Alex Williamson <alex.williamson@redhat.com> >> Sent: Tuesday, August 30, 2022 11:06 AM >> >> There's currently a reference count leak on the zero page. We increment >> the reference via pin_user_pages_remote(), but the page is later handled >> as an invalid/reserved page, therefore it's not accounted against the >> user and not unpinned by our put_pfn(). >> >> Introducing special zero page handling in put_pfn() would resolve the >> leak, but without accounting of the zero page, a single user could >> still create enough mappings to generate a reference count overflow. >> >> The zero page is always resident, so for our purposes there's no reason >> to keep it pinned. Therefore, add a loop to walk pages returned from >> pin_user_pages_remote() and unpin any zero pages. >> > > We found an interesting issue on zero page and wonder whether we > should instead find a way to not use zero page in vfio pinning path. > > The observation - the 'pc.bios' region (0xfffc0000) is always mapped > RO to zero page in the IOMMU page table even after the mapping in > the CPU page table has been changed after Qemu loads the guest > bios image into that region (which is mmap'ed as RW). > > In reality this may not cause real problem as I don't expect any sane > usage would want to DMA read from the bios region. This is probably > the reason why nobody ever notes it. > > But in concept it is incorrect. > > Fixing Qemu to update/setup the VFIO mapping after loading the bios > image could mitigate this problem. But we never document such ABI > restriction on RO mappings and in concept the pinning semantics > should apply to all paths (RO in DMA and RW in CPU) which the > application uses to access the pinned memory hence the sequence > shouldn't matter from user p.o.v > > And old Qemu/VMM still have this issue. > > Having a notifier to implicitly fix the IOMMU mapping within the > kernel violates the semantics of pinning, and makes vfio page > accounting even more tricky. > > So I wonder instead of continuing to fix trickiness around the zero > page whether it is a better idea to pursue allocating a normal > page from the beginning for pinned RO mappings? That's precisely what I am working. For example, that's required to get rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA: See https://lore.kernel.org/all/5593cbb7-eb29-82f0-490e-dd72ceafff9b@redhat.com/ for some more details. The concerns I discussed with Peter and Alex offline for VFIO is that we might end up charging more anon pages with that change to the MEMLOCK limit of the user and might degrade existing setups. I do wonder if that's a real issue, though. One approach would be to warn the VFIO users and allow for slightly exceeding the MEMLOCK limit for a while. Of course, that only works if we assume that such pinned zeropages are only extremely rarely longterm-pinned for a single VM instance by VFIO.
> From: David Hildenbrand <david@redhat.com> > Sent: Friday, September 2, 2022 4:32 PM > > On 02.09.22 10:24, Tian, Kevin wrote: > > Hi, Alex, > > > >> From: Alex Williamson <alex.williamson@redhat.com> > >> Sent: Tuesday, August 30, 2022 11:06 AM > >> > >> There's currently a reference count leak on the zero page. We increment > >> the reference via pin_user_pages_remote(), but the page is later handled > >> as an invalid/reserved page, therefore it's not accounted against the > >> user and not unpinned by our put_pfn(). > >> > >> Introducing special zero page handling in put_pfn() would resolve the > >> leak, but without accounting of the zero page, a single user could > >> still create enough mappings to generate a reference count overflow. > >> > >> The zero page is always resident, so for our purposes there's no reason > >> to keep it pinned. Therefore, add a loop to walk pages returned from > >> pin_user_pages_remote() and unpin any zero pages. > >> > > > > We found an interesting issue on zero page and wonder whether we > > should instead find a way to not use zero page in vfio pinning path. > > > > The observation - the 'pc.bios' region (0xfffc0000) is always mapped > > RO to zero page in the IOMMU page table even after the mapping in > > the CPU page table has been changed after Qemu loads the guest > > bios image into that region (which is mmap'ed as RW). > > > > In reality this may not cause real problem as I don't expect any sane > > usage would want to DMA read from the bios region. This is probably > > the reason why nobody ever notes it. > > > > But in concept it is incorrect. > > > > Fixing Qemu to update/setup the VFIO mapping after loading the bios > > image could mitigate this problem. But we never document such ABI > > restriction on RO mappings and in concept the pinning semantics > > should apply to all paths (RO in DMA and RW in CPU) which the > > application uses to access the pinned memory hence the sequence > > shouldn't matter from user p.o.v > > > > And old Qemu/VMM still have this issue. > > > > Having a notifier to implicitly fix the IOMMU mapping within the > > kernel violates the semantics of pinning, and makes vfio page > > accounting even more tricky. > > > > So I wonder instead of continuing to fix trickiness around the zero > > page whether it is a better idea to pursue allocating a normal > > page from the beginning for pinned RO mappings? > > That's precisely what I am working. For example, that's required to get > rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA: > > See > https://lore.kernel.org/all/5593cbb7-eb29-82f0-490e- > dd72ceafff9b@redhat.com/ > for some more details. Yes, this is exactly what I'm looking for! > > > The concerns I discussed with Peter and Alex offline for VFIO is that we > might end up charging more anon pages with that change to the MEMLOCK > limit of the user and might degrade existing setups. > > I do wonder if that's a real issue, though. One approach would be to > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit > for a while. Of course, that only works if we assume that such pinned > zeropages are only extremely rarely longterm-pinned for a single VM > instance by VFIO. > For now I believe native VFIO applications like dpdk don't use RO mapping. For Qemu the only relevant usage is for virtual bios which is in 100's kilo- or several mega-bytes. And for future Qemu it may implement an option to skip adding RO mappings if MEMLOCK limit becomes a concern. So far I haven't seen a real VFIO usage requiring such RO mappings in DMA path, probably until cross-VM p2p where the exporter VM wants to restrict the access permission on the certain MMIO range accessed by the importer VM. But in that case it's not about memory hence no impact on MEMLOCK limit.
On Fri, Sep 02, 2022 at 10:32:01AM +0200, David Hildenbrand wrote: > > So I wonder instead of continuing to fix trickiness around the zero > > page whether it is a better idea to pursue allocating a normal > > page from the beginning for pinned RO mappings? > > That's precisely what I am working. For example, that's required to get > rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA: And all these issues are exactly why RDMA uses FOLL_FORCE and it is, IMHO, a simple bug that VFIO does not. > I do wonder if that's a real issue, though. One approach would be to > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit > for a while. Of course, that only works if we assume that such pinned > zeropages are only extremely rarely longterm-pinned for a single VM > instance by VFIO. I'm confused, doesn't vfio increment the memlock for every page of VA it pins? Why would it matter if the page was COW'd or not? It is already accounted for today as though it was a unique page. IOW if we add FOLL_FORCE it won't change the value of the memlock. Jason
On 07.09.22 01:30, Jason Gunthorpe wrote: > On Fri, Sep 02, 2022 at 10:32:01AM +0200, David Hildenbrand wrote: > >>> So I wonder instead of continuing to fix trickiness around the zero >>> page whether it is a better idea to pursue allocating a normal >>> page from the beginning for pinned RO mappings? >> >> That's precisely what I am working. For example, that's required to get >> rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA: > > And all these issues are exactly why RDMA uses FOLL_FORCE and it is, > IMHO, a simple bug that VFIO does not. I consider the BUG that our longterm page pinning behaves the way it currently does, not that we're not using the FOLL_FORCE flag here. But it doesn't matter, I'm working on fixing/cleaning it up. > >> I do wonder if that's a real issue, though. One approach would be to >> warn the VFIO users and allow for slightly exceeding the MEMLOCK limit >> for a while. Of course, that only works if we assume that such pinned >> zeropages are only extremely rarely longterm-pinned for a single VM >> instance by VFIO. > > I'm confused, doesn't vfio increment the memlock for every page of VA > it pins? Why would it matter if the page was COW'd or not? It is > already accounted for today as though it was a unique page. > > IOW if we add FOLL_FORCE it won't change the value of the memlock. I only briefly skimmed over the code Alex might be able to provide more details and correct me if I'm wrong: vfio_pin_pages_remote() contains a comment: "Reserved pages aren't counted against the user, externally pinned pages are already counted against the user." is_invalid_reserved_pfn() should return "true" for the shared zeropage and prevent us from accounting it via vfio_lock_acct(). Otherwise, vfio_find_vpfn() seems to be in place to avoid double-accounting pages.
On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote: > > > I do wonder if that's a real issue, though. One approach would be to > > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit > > > for a while. Of course, that only works if we assume that such pinned > > > zeropages are only extremely rarely longterm-pinned for a single VM > > > instance by VFIO. > > > > I'm confused, doesn't vfio increment the memlock for every page of VA > > it pins? Why would it matter if the page was COW'd or not? It is > > already accounted for today as though it was a unique page. > > > > IOW if we add FOLL_FORCE it won't change the value of the memlock. > > I only briefly skimmed over the code Alex might be able to provide more > details and correct me if I'm wrong: > > vfio_pin_pages_remote() contains a comment: > > "Reserved pages aren't counted against the user, externally pinned pages are > already counted against the user." > > is_invalid_reserved_pfn() should return "true" for the shared zeropage and > prevent us from accounting it via vfio_lock_acct(). Otherwise, > vfio_find_vpfn() seems to be in place to avoid double-accounting pages. is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that cannot be returned from pin_user_pages(): /* * Some mappings aren't backed by a struct page, for example an mmap'd * MMIO range for our own or another device. These use a different * pfn conversion and shouldn't be tracked as locked pages. * For compound pages, any driver that sets the reserved bit in head * page needs to set the reserved bit in all subpages to be safe. */ static bool is_invalid_reserved_pfn(unsigned long pfn) What it is talking about by 'different pfn conversion' is the follow_fault_pfn() path, not the PUP path. So, it is some way for VFIO to keep track of when a pfn was returned by PUP vs follow_fault_pfn(), because it treats those two paths quite differently. I lost track of what the original cause of this bug is - however AFAIK pin_user_pages() used to succeed when the zero page is mapped. No other PUP user call this follow_fault_pfn() hacky path, and we expect things like O_DIRECT to work properly even when reading from VA that has the zero page mapped. So, if we go back far enough in the git history we will find a case where PUP is returning something for the zero page, and that something caused is_invalid_reserved_pfn() == false since VFIO did work at some point. IHMO we should simply go back to the historical behavior - make is_invalid_reserved_pfn() check for the zero_pfn and return false. Meaning that PUP returned it. Jason
On Wed, 7 Sep 2022 09:48:59 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote: > > > > I do wonder if that's a real issue, though. One approach would be to > > > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit > > > > for a while. Of course, that only works if we assume that such pinned > > > > zeropages are only extremely rarely longterm-pinned for a single VM > > > > instance by VFIO. > > > > > > I'm confused, doesn't vfio increment the memlock for every page of VA > > > it pins? Why would it matter if the page was COW'd or not? It is > > > already accounted for today as though it was a unique page. > > > > > > IOW if we add FOLL_FORCE it won't change the value of the memlock. > > > > I only briefly skimmed over the code Alex might be able to provide more > > details and correct me if I'm wrong: > > > > vfio_pin_pages_remote() contains a comment: > > > > "Reserved pages aren't counted against the user, externally pinned pages are > > already counted against the user." > > > > is_invalid_reserved_pfn() should return "true" for the shared zeropage and > > prevent us from accounting it via vfio_lock_acct(). Otherwise, > > vfio_find_vpfn() seems to be in place to avoid double-accounting pages. > > is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that > cannot be returned from pin_user_pages(): > > /* > * Some mappings aren't backed by a struct page, for example an mmap'd > * MMIO range for our own or another device. These use a different > * pfn conversion and shouldn't be tracked as locked pages. > * For compound pages, any driver that sets the reserved bit in head > * page needs to set the reserved bit in all subpages to be safe. > */ > static bool is_invalid_reserved_pfn(unsigned long pfn) > > What it is talking about by 'different pfn conversion' is the > follow_fault_pfn() path, not the PUP path. > > So, it is some way for VFIO to keep track of when a pfn was returned > by PUP vs follow_fault_pfn(), because it treats those two paths quite > differently. > > I lost track of what the original cause of this bug is - however AFAIK > pin_user_pages() used to succeed when the zero page is mapped. It does currently, modulo getting broken occasionally. > No other PUP user call this follow_fault_pfn() hacky path, and we > expect things like O_DIRECT to work properly even when reading from VA > that has the zero page mapped. zero page shouldn't take that path, we get the pages via PUP. > So, if we go back far enough in the git history we will find a case > where PUP is returning something for the zero page, and that something > caused is_invalid_reserved_pfn() == false since VFIO did work at some > point. Can we assume that? It takes a while for a refcount leak on the zero page to cause an overflow. My assumption is that it's never worked, we pin zero pages, don't account them against the locked memory limits because our is_invalid_reserved_pfn() test returns true, and therefore we don't unpin them. > IHMO we should simply go back to the historical behavior - make > is_invalid_reserved_pfn() check for the zero_pfn and return > false. Meaning that PUP returned it. We've never explicitly tested for zero_pfn and as David notes, accounting the zero page against the user's locked memory limits has user visible consequences. VMs that worked with a specific locked memory limit may no longer work. Logically, this seems to be the one case of duplicate accounting that we get right relative to the user's locked memory limit and the current implementation of pinning the zero page. We're not locking any resources that aren't effectively already locked. Thanks, Alex
On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote: > > So, if we go back far enough in the git history we will find a case > > where PUP is returning something for the zero page, and that something > > caused is_invalid_reserved_pfn() == false since VFIO did work at some > > point. > > Can we assume that? It takes a while for a refcount leak on the zero > page to cause an overflow. My assumption is that it's never worked, we > pin zero pages, don't account them against the locked memory limits > because our is_invalid_reserved_pfn() test returns true, and therefore > we don't unpin them. Oh, you think it has been buggy forever? That is not great.. > > IHMO we should simply go back to the historical behavior - make > > is_invalid_reserved_pfn() check for the zero_pfn and return > > false. Meaning that PUP returned it. > > We've never explicitly tested for zero_pfn and as David notes, > accounting the zero page against the user's locked memory limits has > user visible consequences. VMs that worked with a specific locked > memory limit may no longer work. Yes, but the question is if that is a strict ABI we have to preserve, because if you take that view it also means because VFIO has this historical bug that David can't fix the FOLL_FORCE issue either. If the view holds for memlock then it should hold for cgroups also. This means the kernel can never change anything about GFP_KERNEL_ACCOUNT allocations because it might impact userspace having set a tight limit there. It means we can't fix the bug that VFIO is using the wrong storage for memlock. It means qemu can't change anything about how it sets up this memory, ie Kevin's idea to change the ordering. On the other hand the "abi break" we are talking about is that a user might have to increase a configured limit in a config file after a kernel upgrade. IDK what consensus exists here, I've never heard of anyone saying these limits are a strict ABI like this.. I think at least for cgroup that would be so invasive as to immediately be off the table. Jason
On Wed, 7 Sep 2022 13:40:52 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote: > > > > So, if we go back far enough in the git history we will find a case > > > where PUP is returning something for the zero page, and that something > > > caused is_invalid_reserved_pfn() == false since VFIO did work at some > > > point. > > > > Can we assume that? It takes a while for a refcount leak on the zero > > page to cause an overflow. My assumption is that it's never worked, we > > pin zero pages, don't account them against the locked memory limits > > because our is_invalid_reserved_pfn() test returns true, and therefore > > we don't unpin them. > > Oh, you think it has been buggy forever? That is not great.. > > > > IHMO we should simply go back to the historical behavior - make > > > is_invalid_reserved_pfn() check for the zero_pfn and return > > > false. Meaning that PUP returned it. > > > > We've never explicitly tested for zero_pfn and as David notes, > > accounting the zero page against the user's locked memory limits has > > user visible consequences. VMs that worked with a specific locked > > memory limit may no longer work. > > Yes, but the question is if that is a strict ABI we have to preserve, > because if you take that view it also means because VFIO has this > historical bug that David can't fix the FOLL_FORCE issue either. > > If the view holds for memlock then it should hold for cgroups > also. This means the kernel can never change anything about > GFP_KERNEL_ACCOUNT allocations because it might impact userspace > having set a tight limit there. > > It means we can't fix the bug that VFIO is using the wrong storage for > memlock. > > It means qemu can't change anything about how it sets up this memory, > ie Kevin's idea to change the ordering. > > On the other hand the "abi break" we are talking about is that a user > might have to increase a configured limit in a config file after a > kernel upgrade. > > IDK what consensus exists here, I've never heard of anyone saying > these limits are a strict ABI like this.. I think at least for cgroup > that would be so invasive as to immediately be off the table. I thought we'd already agreed that we were stuck with locked_vm for type1 and any compatibility mode of type1 due to this. Native iommufd support can do the right thing since userspace will need to account for various new usage models anyway. I've raised the issue with David for the zero page accounting, but I don't know what the solution is. libvirt automatically adds a 1GB fudge factor to the VM locked memory limits to account for things like ROM mappings, or at least the non-zeropage backed portion of those ROMs. I think that most management tools have adopted similar, so the majority of users shouldn't notice. However this won't cover all users, so we certainly risk breaking userspace if we introduce hard page accounting of zero pages. I think David suggested possibly allowing some degree of exceeding locked memory limits for zero page COWs. Potentially type1 could do this as well; special case handling with some heuristically determined, module parameter defined limit. We might also consider whether we could just ignore zero page mappings, maybe with a optional "strict" mode module option to generate an errno on such mappings. Thanks, Alex
On 07.09.22 20:56, Alex Williamson wrote: > On Wed, 7 Sep 2022 13:40:52 -0300 > Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote: >> >>>> So, if we go back far enough in the git history we will find a case >>>> where PUP is returning something for the zero page, and that something >>>> caused is_invalid_reserved_pfn() == false since VFIO did work at some >>>> point. >>> >>> Can we assume that? It takes a while for a refcount leak on the zero >>> page to cause an overflow. My assumption is that it's never worked, we >>> pin zero pages, don't account them against the locked memory limits >>> because our is_invalid_reserved_pfn() test returns true, and therefore >>> we don't unpin them. >> >> Oh, you think it has been buggy forever? That is not great.. >> >>>> IHMO we should simply go back to the historical behavior - make >>>> is_invalid_reserved_pfn() check for the zero_pfn and return >>>> false. Meaning that PUP returned it. >>> >>> We've never explicitly tested for zero_pfn and as David notes, >>> accounting the zero page against the user's locked memory limits has >>> user visible consequences. VMs that worked with a specific locked >>> memory limit may no longer work. >> >> Yes, but the question is if that is a strict ABI we have to preserve, >> because if you take that view it also means because VFIO has this >> historical bug that David can't fix the FOLL_FORCE issue either. >> >> If the view holds for memlock then it should hold for cgroups >> also. This means the kernel can never change anything about >> GFP_KERNEL_ACCOUNT allocations because it might impact userspace >> having set a tight limit there. >> >> It means we can't fix the bug that VFIO is using the wrong storage for >> memlock. >> >> It means qemu can't change anything about how it sets up this memory, >> ie Kevin's idea to change the ordering. >> >> On the other hand the "abi break" we are talking about is that a user >> might have to increase a configured limit in a config file after a >> kernel upgrade. >> >> IDK what consensus exists here, I've never heard of anyone saying >> these limits are a strict ABI like this.. I think at least for cgroup >> that would be so invasive as to immediately be off the table. > > I thought we'd already agreed that we were stuck with locked_vm for > type1 and any compatibility mode of type1 due to this. Native iommufd > support can do the right thing since userspace will need to account for > various new usage models anyway. > > I've raised the issue with David for the zero page accounting, but I > don't know what the solution is. libvirt automatically adds a 1GB > fudge factor to the VM locked memory limits to account for things like > ROM mappings, or at least the non-zeropage backed portion of those > ROMs. I think that most management tools have adopted similar, so the > majority of users shouldn't notice. However this won't cover all > users, so we certainly risk breaking userspace if we introduce hard > page accounting of zero pages. > > I think David suggested possibly allowing some degree of exceeding > locked memory limits for zero page COWs. Potentially type1 could do > this as well; special case handling with some heuristically determined, > module parameter defined limit. We might also consider whether we > could just ignore zero page mappings, maybe with a optional "strict" > mode module option to generate an errno on such mappings. Thanks, So far I played with the ideas a) allow slightly exceeding the limit and warn b) weird vfio kernel parameter to control the zeropage behavior (old vs. new) I certainly have in mind that we might need some toggle for vfio.
On Wed, Sep 07, 2022 at 12:56:27PM -0600, Alex Williamson wrote: > I thought we'd already agreed that we were stuck with locked_vm for > type1 and any compatibility mode of type1 due to this. Native iommufd > support can do the right thing since userspace will need to account for > various new usage models anyway. We did, that was for the iommufd situation (which will also hit the same zeropage issue, sigh) - this discussion is about fixing a bug in vfio and what many consider a bug in GUP. My point is I'm still not convinced we can really consider these limits as ABI because it opens a pandoras box of kernel limitations. > I've raised the issue with David for the zero page accounting, but I > don't know what the solution is. libvirt automatically adds a 1GB > fudge factor to the VM locked memory limits to account for things like > ROM mappings, or at least the non-zeropage backed portion of those > ROMs. I think that most management tools have adopted similar, so the > majority of users shouldn't notice. However this won't cover all > users, so we certainly risk breaking userspace if we introduce hard > page accounting of zero pages. It sounds like things will be fine. 1GB fudge is pretty big. For things like this ABI compat is not about absolute compatability in the face of any userspace, but a real-world compatibility "does something that actually exists break?" So I would be happier if we had an actual deployed thing that breaks.. I would be inclined to go with the simple fix and rely on the fudge. If someone does come with an actual break then lets do one of the work arounds. Given the whole thing is obstensibly for security it is better to keep it simple and sane then to poke it full of holes. > module parameter defined limit. We might also consider whether we > could just ignore zero page mappings, maybe with a optional "strict" > mode module option to generate an errno on such mappings. Thanks, Once GUP is fixed vfio won't see the zero pages anymore :( That really limits the choices for a work around :( Jason
On Wed, 7 Sep 2022 16:55:50 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Sep 07, 2022 at 12:56:27PM -0600, Alex Williamson wrote: > > > I thought we'd already agreed that we were stuck with locked_vm for > > type1 and any compatibility mode of type1 due to this. Native iommufd > > support can do the right thing since userspace will need to account for > > various new usage models anyway. > > We did, that was for the iommufd situation (which will also hit the > same zeropage issue, sigh) - this discussion is about fixing a bug in > vfio and what many consider a bug in GUP. > > My point is I'm still not convinced we can really consider these > limits as ABI because it opens a pandoras box of kernel limitations. > > > I've raised the issue with David for the zero page accounting, but I > > don't know what the solution is. libvirt automatically adds a 1GB > > fudge factor to the VM locked memory limits to account for things like > > ROM mappings, or at least the non-zeropage backed portion of those > > ROMs. I think that most management tools have adopted similar, so the > > majority of users shouldn't notice. However this won't cover all > > users, so we certainly risk breaking userspace if we introduce hard > > page accounting of zero pages. > > It sounds like things will be fine. 1GB fudge is pretty big. > > For things like this ABI compat is not about absolute compatability in > the face of any userspace, but a real-world compatibility "does > something that actually exists break?" Magic 8 ball says "Cannot predict now." Unfortunately there's a lot of roll-your-own scripting that goes on in this space, many users reject the overhead of things like libvirt, let alone deeper management stacks. Private clouds have constraints that might also generate custom solutions. I can't predict the degree to which libvirt is a canonical example. > So I would be happier if we had an actual deployed thing that breaks.. > I would be inclined to go with the simple fix and rely on the > fudge. If someone does come with an actual break then lets do one of > the work arounds. We should probably have a workaround in our pocket for such a case. Also, I want to clarify, is this a recommendation relative to the stable patch proposed here, or only once we get rid of shared zero page pinning? We can't simply do accounting on the shared zero page since a single user can overflow the refcount. > Given the whole thing is obstensibly for security it is better to keep > it simple and sane then to poke it full of holes. > > > module parameter defined limit. We might also consider whether we > > could just ignore zero page mappings, maybe with a optional "strict" > > mode module option to generate an errno on such mappings. Thanks, > > Once GUP is fixed vfio won't see the zero pages anymore :( That really > limits the choices for a work around :( I was afraid of that. Thanks, Alex
On Wed, Sep 07, 2022 at 02:24:16PM -0600, Alex Williamson wrote: > Also, I want to clarify, is this a recommendation relative to the > stable patch proposed here, or only once we get rid of shared zero page > pinning? We can't simply do accounting on the shared zero page since a > single user can overflow the refcount. Yes, here I would account properly in a way that keeps working for future GUP changes because if something goes wrong with this simple patch it has a simple fix. Trialing it will get some good data to inform what David's patch should do. Overall have the feeling that a small group of people might grumble that their limits break, but with a limit adjustment they can probably trivially move on. It would be very interesting to see if someone feels like the issue is important enough to try and get something changed. You could also fix it by just using FOLL_FORCE (like RDMA/io_uring does), which fixes the larger issue Kevin noted that the ROM doesn't become visible to DMA. Jason
On Wed, 7 Sep 2022 20:07:02 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Wed, Sep 07, 2022 at 02:24:16PM -0600, Alex Williamson wrote: > > > Also, I want to clarify, is this a recommendation relative to the > > stable patch proposed here, or only once we get rid of shared zero page > > pinning? We can't simply do accounting on the shared zero page since a > > single user can overflow the refcount. > > Yes, here I would account properly in a way that keeps working for > future GUP changes because if something goes wrong with this simple > patch it has a simple fix. > > Trialing it will get some good data to inform what David's patch > should do. > > Overall have the feeling that a small group of people might grumble > that their limits break, but with a limit adjustment they can probably > trivially move on. It would be very interesting to see if someone > feels like the issue is important enough to try and get something > changed. > > You could also fix it by just using FOLL_FORCE (like RDMA/io_uring > does), which fixes the larger issue Kevin noted that the ROM doesn't > become visible to DMA. That's only a theoretical problem, I suspect there are absolutely zero cases where this is an actual problem. Doing anything other than simply fixing the leak for stable seems reckless, we're not actually consuming resources that need to be accounted until David's changes come through, and we risk breaking users on a broad scale. IMO, the fix proposed here is the correct first step and we can start experimenting with accounting the zero page moving forward. Thanks, Alex
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index db516c90a977..8706482665d1 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM, pages, NULL, NULL); if (ret > 0) { + int i; + + /* + * The zero page is always resident, we don't need to pin it + * and it falls into our invalid/reserved test so we don't + * unpin in put_pfn(). Unpin all zero pages in the batch here. + */ + for (i = 0 ; i < ret; i++) { + if (unlikely(is_zero_pfn(page_to_pfn(pages[i])))) + unpin_user_page(pages[i]); + } + *pfn = page_to_pfn(pages[0]); goto done; }
There's currently a reference count leak on the zero page. We increment the reference via pin_user_pages_remote(), but the page is later handled as an invalid/reserved page, therefore it's not accounted against the user and not unpinned by our put_pfn(). Introducing special zero page handling in put_pfn() would resolve the leak, but without accounting of the zero page, a single user could still create enough mappings to generate a reference count overflow. The zero page is always resident, so for our purposes there's no reason to keep it pinned. Therefore, add a loop to walk pages returned from pin_user_pages_remote() and unpin any zero pages. Cc: David Hildenbrand <david@redhat.com> Cc: stable@vger.kernel.org Reported-by: Luboslav Pivarc <lpivarc@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- drivers/vfio/vfio_iommu_type1.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)