vfio/type1: Unpin zero pages

Message ID	166182871735.3518559.8884121293045337358.stgit@omen (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: [PATCH] vfio/type1: Unpin zero pages From: Alex Williamson <alex.williamson@redhat.com> To: alex.williamson@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, david@redhat.com, lpivarc@redhat.com Date: Mon, 29 Aug 2022 21:05:40 -0600 Message-ID: <166182871735.3518559.8884121293045337358.stgit@omen> User-Agent: StGit/1.5.dev2+g9ce680a52bd9 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	vfio/type1: Unpin zero pages \| expand vfio/type1: Unpin zero pages

Alex Williamson Aug. 30, 2022, 3:05 a.m. UTC

There's currently a reference count leak on the zero page.  We increment
the reference via pin_user_pages_remote(), but the page is later handled
as an invalid/reserved page, therefore it's not accounted against the
user and not unpinned by our put_pfn().

Introducing special zero page handling in put_pfn() would resolve the
leak, but without accounting of the zero page, a single user could
still create enough mappings to generate a reference count overflow.

The zero page is always resident, so for our purposes there's no reason
to keep it pinned.  Therefore, add a loop to walk pages returned from
pin_user_pages_remote() and unpin any zero pages.

Cc: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Sean Christopherson Aug. 30, 2022, 5:34 a.m. UTC | #1

On Mon, Aug 29, 2022, Alex Williamson wrote:
> There's currently a reference count leak on the zero page.  We increment
> the reference via pin_user_pages_remote(), but the page is later handled
> as an invalid/reserved page, therefore it's not accounted against the
> user and not unpinned by our put_pfn().

Heh, kvm_pfn_to_refcounted_page() all over again.  is_zone_device_page() is the
other known case where a PageReserved page is refcounted.  But as KVM's comment
calls out, KVM's list was built through trial and error.

David Hildenbrand Aug. 30, 2022, 7:59 a.m. UTC | #2

On 30.08.22 05:05, Alex Williamson wrote:
> There's currently a reference count leak on the zero page.  We increment
> the reference via pin_user_pages_remote(), but the page is later handled
> as an invalid/reserved page, therefore it's not accounted against the
> user and not unpinned by our put_pfn().
> 
> Introducing special zero page handling in put_pfn() would resolve the
> leak, but without accounting of the zero page, a single user could
> still create enough mappings to generate a reference count overflow.
> 
> The zero page is always resident, so for our purposes there's no reason
> to keep it pinned.  Therefore, add a loop to walk pages returned from
> pin_user_pages_remote() and unpin any zero pages.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: stable@vger.kernel.org
> Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index db516c90a977..8706482665d1 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>  	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
>  				    pages, NULL, NULL);
>  	if (ret > 0) {
> +		int i;
> +
> +		/*
> +		 * The zero page is always resident, we don't need to pin it
> +		 * and it falls into our invalid/reserved test so we don't
> +		 * unpin in put_pfn().  Unpin all zero pages in the batch here.
> +		 */
> +		for (i = 0 ; i < ret; i++) {
> +			if (unlikely(is_zero_pfn(page_to_pfn(pages[i]))))
> +				unpin_user_page(pages[i]);
> +		}
> +
>  		*pfn = page_to_pfn(pages[0]);
>  		goto done;
>  	}
> 
> 

As discussed offline, for the shared zeropage (that's not even
refcounted when mapped into a process), this makes perfect sense to me.

Good question raised by Sean if ZONE_DEVICE pages might similarly be
problematic. But for them, we cannot simply always unpin here.

Reviewed-by: David Hildenbrand <david@redhat.com>

Alex Williamson Aug. 30, 2022, 3:11 p.m. UTC | #3

On Tue, 30 Aug 2022 09:59:33 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 30.08.22 05:05, Alex Williamson wrote:
> > There's currently a reference count leak on the zero page.  We increment
> > the reference via pin_user_pages_remote(), but the page is later handled
> > as an invalid/reserved page, therefore it's not accounted against the
> > user and not unpinned by our put_pfn().
> > 
> > Introducing special zero page handling in put_pfn() would resolve the
> > leak, but without accounting of the zero page, a single user could
> > still create enough mappings to generate a reference count overflow.
> > 
> > The zero page is always resident, so for our purposes there's no reason
> > to keep it pinned.  Therefore, add a loop to walk pages returned from
> > pin_user_pages_remote() and unpin any zero pages.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: stable@vger.kernel.org
> > Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |   12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index db516c90a977..8706482665d1 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >  	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
> >  				    pages, NULL, NULL);
> >  	if (ret > 0) {
> > +		int i;
> > +
> > +		/*
> > +		 * The zero page is always resident, we don't need to pin it
> > +		 * and it falls into our invalid/reserved test so we don't
> > +		 * unpin in put_pfn().  Unpin all zero pages in the batch here.
> > +		 */
> > +		for (i = 0 ; i < ret; i++) {
> > +			if (unlikely(is_zero_pfn(page_to_pfn(pages[i]))))
> > +				unpin_user_page(pages[i]);
> > +		}
> > +
> >  		*pfn = page_to_pfn(pages[0]);
> >  		goto done;
> >  	}
> > 
> >   
> 
> As discussed offline, for the shared zeropage (that's not even
> refcounted when mapped into a process), this makes perfect sense to me.
> 
> Good question raised by Sean if ZONE_DEVICE pages might similarly be
> problematic. But for them, we cannot simply always unpin here.

What sort of VM mapping would give me ZONE_DEVICE pages?  Thanks,

Alex

David Hildenbrand Aug. 30, 2022, 3:43 p.m. UTC | #4

On 30.08.22 17:11, Alex Williamson wrote:
> On Tue, 30 Aug 2022 09:59:33 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 30.08.22 05:05, Alex Williamson wrote:
>>> There's currently a reference count leak on the zero page.  We increment
>>> the reference via pin_user_pages_remote(), but the page is later handled
>>> as an invalid/reserved page, therefore it's not accounted against the
>>> user and not unpinned by our put_pfn().
>>>
>>> Introducing special zero page handling in put_pfn() would resolve the
>>> leak, but without accounting of the zero page, a single user could
>>> still create enough mappings to generate a reference count overflow.
>>>
>>> The zero page is always resident, so for our purposes there's no reason
>>> to keep it pinned.  Therefore, add a loop to walk pages returned from
>>> pin_user_pages_remote() and unpin any zero pages.
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: stable@vger.kernel.org
>>> Reported-by: Luboslav Pivarc <lpivarc@redhat.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c |   12 ++++++++++++
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index db516c90a977..8706482665d1 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -558,6 +558,18 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>>>  	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
>>>  				    pages, NULL, NULL);
>>>  	if (ret > 0) {
>>> +		int i;
>>> +
>>> +		/*
>>> +		 * The zero page is always resident, we don't need to pin it
>>> +		 * and it falls into our invalid/reserved test so we don't
>>> +		 * unpin in put_pfn().  Unpin all zero pages in the batch here.
>>> +		 */
>>> +		for (i = 0 ; i < ret; i++) {
>>> +			if (unlikely(is_zero_pfn(page_to_pfn(pages[i]))))
>>> +				unpin_user_page(pages[i]);
>>> +		}
>>> +
>>>  		*pfn = page_to_pfn(pages[0]);
>>>  		goto done;
>>>  	}
>>>
>>>   
>>
>> As discussed offline, for the shared zeropage (that's not even
>> refcounted when mapped into a process), this makes perfect sense to me.
>>
>> Good question raised by Sean if ZONE_DEVICE pages might similarly be
>> problematic. But for them, we cannot simply always unpin here.
> 
> What sort of VM mapping would give me ZONE_DEVICE pages?  Thanks,

I think one approach is mmap'ing a devdax device. To test without actual
NVDIMM hardware, there are ways to simulate it even on bare metal using
the "memmap=" kernel parameter.

https://nvdimm.wiki.kernel.org/

Alternatively, you can use an emulated nvdimm device under QEMU -- but
then you'd have to run VFIO inside the VM. I know (that you know) that
there are ways to get that working, but it certainly requires more effort :)

... let me know if you need any tips&tricks.

Jason Gunthorpe Sept. 2, 2022, 12:53 a.m. UTC | #5

On Tue, Aug 30, 2022 at 05:43:43PM +0200, David Hildenbrand wrote:

> I think one approach is mmap'ing a devdax device. To test without actual
> NVDIMM hardware, there are ways to simulate it even on bare metal using
> the "memmap=" kernel parameter.
> 
> https://nvdimm.wiki.kernel.org/
> 
> Alternatively, you can use an emulated nvdimm device under QEMU -- but
> then you'd have to run VFIO inside the VM. I know (that you know) that
> there are ways to get that working, but it certainly requires more effort :)
> 
> ... let me know if you need any tips&tricks.

currently pin_user_pages will not return ZONE_DEVICE pages anyhow, so
it is not relevant to this snippet

You might seem them on that follow_pfn flow though :\

Jason

Tian, Kevin Sept. 2, 2022, 8:24 a.m. UTC | #6

Hi, Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, August 30, 2022 11:06 AM
> 
> There's currently a reference count leak on the zero page.  We increment
> the reference via pin_user_pages_remote(), but the page is later handled
> as an invalid/reserved page, therefore it's not accounted against the
> user and not unpinned by our put_pfn().
> 
> Introducing special zero page handling in put_pfn() would resolve the
> leak, but without accounting of the zero page, a single user could
> still create enough mappings to generate a reference count overflow.
> 
> The zero page is always resident, so for our purposes there's no reason
> to keep it pinned.  Therefore, add a loop to walk pages returned from
> pin_user_pages_remote() and unpin any zero pages.
> 

We found an interesting issue on zero page and wonder whether we
should instead find a way to not use zero page in vfio pinning path.

The observation - the 'pc.bios' region (0xfffc0000) is always mapped
RO to zero page in the IOMMU page table even after the mapping in
the CPU page table has been changed after Qemu loads the guest
bios image into that region (which is mmap'ed as RW).

In reality this may not cause real problem as I don't expect any sane
usage would want to DMA read from the bios region. This is probably
the reason why nobody ever notes it.

But in concept it is incorrect.

Fixing Qemu to update/setup the VFIO mapping after loading the bios
image could mitigate this problem. But we never document such ABI
restriction on RO mappings and in concept the pinning semantics
should apply to all paths (RO in DMA and RW in CPU) which the
application uses to access the pinned memory hence the sequence
shouldn't matter from user p.o.v

And old Qemu/VMM still have this issue.

Having a notifier to implicitly fix the IOMMU mapping within the
kernel violates the semantics of pinning, and makes vfio page
accounting even more tricky.

So I wonder instead of continuing to fix trickiness around the zero
page whether it is a better idea to pursue allocating a normal
page from the beginning for pinned RO mappings?

Thanks
Kevin

David Hildenbrand Sept. 2, 2022, 8:32 a.m. UTC | #7

On 02.09.22 10:24, Tian, Kevin wrote:
> Hi, Alex,
> 
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Tuesday, August 30, 2022 11:06 AM
>>
>> There's currently a reference count leak on the zero page.  We increment
>> the reference via pin_user_pages_remote(), but the page is later handled
>> as an invalid/reserved page, therefore it's not accounted against the
>> user and not unpinned by our put_pfn().
>>
>> Introducing special zero page handling in put_pfn() would resolve the
>> leak, but without accounting of the zero page, a single user could
>> still create enough mappings to generate a reference count overflow.
>>
>> The zero page is always resident, so for our purposes there's no reason
>> to keep it pinned.  Therefore, add a loop to walk pages returned from
>> pin_user_pages_remote() and unpin any zero pages.
>>
> 
> We found an interesting issue on zero page and wonder whether we
> should instead find a way to not use zero page in vfio pinning path.
> 
> The observation - the 'pc.bios' region (0xfffc0000) is always mapped
> RO to zero page in the IOMMU page table even after the mapping in
> the CPU page table has been changed after Qemu loads the guest
> bios image into that region (which is mmap'ed as RW).
> 
> In reality this may not cause real problem as I don't expect any sane
> usage would want to DMA read from the bios region. This is probably
> the reason why nobody ever notes it.
> 
> But in concept it is incorrect.
> 
> Fixing Qemu to update/setup the VFIO mapping after loading the bios
> image could mitigate this problem. But we never document such ABI
> restriction on RO mappings and in concept the pinning semantics
> should apply to all paths (RO in DMA and RW in CPU) which the
> application uses to access the pinned memory hence the sequence
> shouldn't matter from user p.o.v
> 
> And old Qemu/VMM still have this issue.
> 
> Having a notifier to implicitly fix the IOMMU mapping within the
> kernel violates the semantics of pinning, and makes vfio page
> accounting even more tricky.
> 
> So I wonder instead of continuing to fix trickiness around the zero
> page whether it is a better idea to pursue allocating a normal
> page from the beginning for pinned RO mappings?

That's precisely what I am working. For example, that's required to get
rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA:

See
https://lore.kernel.org/all/5593cbb7-eb29-82f0-490e-dd72ceafff9b@redhat.com/
for some more details.


The concerns I discussed with Peter and Alex offline for VFIO is that we
might end up charging more anon pages with that change to the MEMLOCK
limit of the user and might degrade existing setups.

I do wonder if that's a real issue, though. One approach would be to
warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
for a while. Of course, that only works if we assume that such pinned
zeropages are only extremely rarely longterm-pinned for a single VM
instance by VFIO.

Tian, Kevin Sept. 2, 2022, 9:09 a.m. UTC | #8

> From: David Hildenbrand <david@redhat.com>
> Sent: Friday, September 2, 2022 4:32 PM
> 
> On 02.09.22 10:24, Tian, Kevin wrote:
> > Hi, Alex,
> >
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Tuesday, August 30, 2022 11:06 AM
> >>
> >> There's currently a reference count leak on the zero page.  We increment
> >> the reference via pin_user_pages_remote(), but the page is later handled
> >> as an invalid/reserved page, therefore it's not accounted against the
> >> user and not unpinned by our put_pfn().
> >>
> >> Introducing special zero page handling in put_pfn() would resolve the
> >> leak, but without accounting of the zero page, a single user could
> >> still create enough mappings to generate a reference count overflow.
> >>
> >> The zero page is always resident, so for our purposes there's no reason
> >> to keep it pinned.  Therefore, add a loop to walk pages returned from
> >> pin_user_pages_remote() and unpin any zero pages.
> >>
> >
> > We found an interesting issue on zero page and wonder whether we
> > should instead find a way to not use zero page in vfio pinning path.
> >
> > The observation - the 'pc.bios' region (0xfffc0000) is always mapped
> > RO to zero page in the IOMMU page table even after the mapping in
> > the CPU page table has been changed after Qemu loads the guest
> > bios image into that region (which is mmap'ed as RW).
> >
> > In reality this may not cause real problem as I don't expect any sane
> > usage would want to DMA read from the bios region. This is probably
> > the reason why nobody ever notes it.
> >
> > But in concept it is incorrect.
> >
> > Fixing Qemu to update/setup the VFIO mapping after loading the bios
> > image could mitigate this problem. But we never document such ABI
> > restriction on RO mappings and in concept the pinning semantics
> > should apply to all paths (RO in DMA and RW in CPU) which the
> > application uses to access the pinned memory hence the sequence
> > shouldn't matter from user p.o.v
> >
> > And old Qemu/VMM still have this issue.
> >
> > Having a notifier to implicitly fix the IOMMU mapping within the
> > kernel violates the semantics of pinning, and makes vfio page
> > accounting even more tricky.
> >
> > So I wonder instead of continuing to fix trickiness around the zero
> > page whether it is a better idea to pursue allocating a normal
> > page from the beginning for pinned RO mappings?
> 
> That's precisely what I am working. For example, that's required to get
> rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA:
> 
> See
> https://lore.kernel.org/all/5593cbb7-eb29-82f0-490e-
> dd72ceafff9b@redhat.com/
> for some more details.

Yes, this is exactly what I'm looking for!

> 
> 
> The concerns I discussed with Peter and Alex offline for VFIO is that we
> might end up charging more anon pages with that change to the MEMLOCK
> limit of the user and might degrade existing setups.
> 
> I do wonder if that's a real issue, though. One approach would be to
> warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
> for a while. Of course, that only works if we assume that such pinned
> zeropages are only extremely rarely longterm-pinned for a single VM
> instance by VFIO.
> 

For now I believe native VFIO applications like dpdk don't use RO mapping.
For Qemu the only relevant usage is for virtual bios which is in 100's kilo-
or several mega-bytes. 

And for future Qemu it may implement an option to skip adding RO
mappings if MEMLOCK limit becomes a concern. So far I haven't seen
a real VFIO usage requiring such RO mappings in DMA path, probably
until cross-VM p2p where the exporter VM wants to restrict the access
permission on the certain MMIO range accessed by the importer VM.
But in that case it's not about memory hence no impact on MEMLOCK
limit.

Jason Gunthorpe Sept. 6, 2022, 11:30 p.m. UTC | #9

On Fri, Sep 02, 2022 at 10:32:01AM +0200, David Hildenbrand wrote:

> > So I wonder instead of continuing to fix trickiness around the zero
> > page whether it is a better idea to pursue allocating a normal
> > page from the beginning for pinned RO mappings?
> 
> That's precisely what I am working. For example, that's required to get
> rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA:

And all these issues are exactly why RDMA uses FOLL_FORCE and it is,
IMHO, a simple bug that VFIO does not.

> I do wonder if that's a real issue, though. One approach would be to
> warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
> for a while. Of course, that only works if we assume that such pinned
> zeropages are only extremely rarely longterm-pinned for a single VM
> instance by VFIO.

I'm confused, doesn't vfio increment the memlock for every page of VA
it pins? Why would it matter if the page was COW'd or not? It is
already accounted for today as though it was a unique page.

IOW if we add FOLL_FORCE it won't change the value of the memlock.

Jason

David Hildenbrand Sept. 7, 2022, 9 a.m. UTC | #10

On 07.09.22 01:30, Jason Gunthorpe wrote:
> On Fri, Sep 02, 2022 at 10:32:01AM +0200, David Hildenbrand wrote:
> 
>>> So I wonder instead of continuing to fix trickiness around the zero
>>> page whether it is a better idea to pursue allocating a normal
>>> page from the beginning for pinned RO mappings?
>>
>> That's precisely what I am working. For example, that's required to get
>> rid of FOLL_FORCE|FOLL_WRITE for taking a R/O pin as done by RDMA:
> 
> And all these issues are exactly why RDMA uses FOLL_FORCE and it is,
> IMHO, a simple bug that VFIO does not.

I consider the BUG that our longterm page pinning behaves the way it 
currently does, not that we're not using the FOLL_FORCE flag here.

But it doesn't matter, I'm working on fixing/cleaning it up.

> 
>> I do wonder if that's a real issue, though. One approach would be to
>> warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
>> for a while. Of course, that only works if we assume that such pinned
>> zeropages are only extremely rarely longterm-pinned for a single VM
>> instance by VFIO.
> 
> I'm confused, doesn't vfio increment the memlock for every page of VA
> it pins? Why would it matter if the page was COW'd or not? It is
> already accounted for today as though it was a unique page.
> 
> IOW if we add FOLL_FORCE it won't change the value of the memlock.

I only briefly skimmed over the code Alex might be able to provide more 
details and correct me if I'm wrong:

vfio_pin_pages_remote() contains a comment:

"Reserved pages aren't counted against the user, externally pinned pages 
are already counted against the user."

is_invalid_reserved_pfn() should return "true" for the shared zeropage 
and prevent us from accounting it via vfio_lock_acct(). Otherwise, 
vfio_find_vpfn() seems to be in place to avoid double-accounting pages.

Jason Gunthorpe Sept. 7, 2022, 12:48 p.m. UTC | #11

On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote:
> > > I do wonder if that's a real issue, though. One approach would be to
> > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
> > > for a while. Of course, that only works if we assume that such pinned
> > > zeropages are only extremely rarely longterm-pinned for a single VM
> > > instance by VFIO.
> > 
> > I'm confused, doesn't vfio increment the memlock for every page of VA
> > it pins? Why would it matter if the page was COW'd or not? It is
> > already accounted for today as though it was a unique page.
> > 
> > IOW if we add FOLL_FORCE it won't change the value of the memlock.
> 
> I only briefly skimmed over the code Alex might be able to provide more
> details and correct me if I'm wrong:
> 
> vfio_pin_pages_remote() contains a comment:
> 
> "Reserved pages aren't counted against the user, externally pinned pages are
> already counted against the user."
> 
> is_invalid_reserved_pfn() should return "true" for the shared zeropage and
> prevent us from accounting it via vfio_lock_acct(). Otherwise,
> vfio_find_vpfn() seems to be in place to avoid double-accounting pages.

is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that
cannot be returned from pin_user_pages():

/*
 * Some mappings aren't backed by a struct page, for example an mmap'd
 * MMIO range for our own or another device.  These use a different
 * pfn conversion and shouldn't be tracked as locked pages.
 * For compound pages, any driver that sets the reserved bit in head
 * page needs to set the reserved bit in all subpages to be safe.
 */
static bool is_invalid_reserved_pfn(unsigned long pfn)

What it is talking about by 'different pfn conversion' is the
follow_fault_pfn() path, not the PUP path.

So, it is some way for VFIO to keep track of when a pfn was returned
by PUP vs follow_fault_pfn(), because it treats those two paths quite
differently.

I lost track of what the original cause of this bug is - however AFAIK
pin_user_pages() used to succeed when the zero page is mapped.

No other PUP user call this follow_fault_pfn() hacky path, and we
expect things like O_DIRECT to work properly even when reading from VA
that has the zero page mapped.

So, if we go back far enough in the git history we will find a case
where PUP is returning something for the zero page, and that something
caused is_invalid_reserved_pfn() == false since VFIO did work at some
point.

IHMO we should simply go back to the historical behavior - make
is_invalid_reserved_pfn() check for the zero_pfn and return
false. Meaning that PUP returned it.

Jason

Alex Williamson Sept. 7, 2022, 3:55 p.m. UTC | #12

On Wed, 7 Sep 2022 09:48:59 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 07, 2022 at 11:00:21AM +0200, David Hildenbrand wrote:
> > > > I do wonder if that's a real issue, though. One approach would be to
> > > > warn the VFIO users and allow for slightly exceeding the MEMLOCK limit
> > > > for a while. Of course, that only works if we assume that such pinned
> > > > zeropages are only extremely rarely longterm-pinned for a single VM
> > > > instance by VFIO.  
> > > 
> > > I'm confused, doesn't vfio increment the memlock for every page of VA
> > > it pins? Why would it matter if the page was COW'd or not? It is
> > > already accounted for today as though it was a unique page.
> > > 
> > > IOW if we add FOLL_FORCE it won't change the value of the memlock.  
> > 
> > I only briefly skimmed over the code Alex might be able to provide more
> > details and correct me if I'm wrong:
> > 
> > vfio_pin_pages_remote() contains a comment:
> > 
> > "Reserved pages aren't counted against the user, externally pinned pages are
> > already counted against the user."
> > 
> > is_invalid_reserved_pfn() should return "true" for the shared zeropage and
> > prevent us from accounting it via vfio_lock_acct(). Otherwise,
> > vfio_find_vpfn() seems to be in place to avoid double-accounting pages.  
> 
> is_invalid_reserved_pfn() is supposed to return 'true' for PFNs that
> cannot be returned from pin_user_pages():
> 
> /*
>  * Some mappings aren't backed by a struct page, for example an mmap'd
>  * MMIO range for our own or another device.  These use a different
>  * pfn conversion and shouldn't be tracked as locked pages.
>  * For compound pages, any driver that sets the reserved bit in head
>  * page needs to set the reserved bit in all subpages to be safe.
>  */
> static bool is_invalid_reserved_pfn(unsigned long pfn)
> 
> What it is talking about by 'different pfn conversion' is the
> follow_fault_pfn() path, not the PUP path.
> 
> So, it is some way for VFIO to keep track of when a pfn was returned
> by PUP vs follow_fault_pfn(), because it treats those two paths quite
> differently.
> 
> I lost track of what the original cause of this bug is - however AFAIK
> pin_user_pages() used to succeed when the zero page is mapped.

It does currently, modulo getting broken occasionally.

> No other PUP user call this follow_fault_pfn() hacky path, and we
> expect things like O_DIRECT to work properly even when reading from VA
> that has the zero page mapped.

zero page shouldn't take that path, we get the pages via PUP.

> So, if we go back far enough in the git history we will find a case
> where PUP is returning something for the zero page, and that something
> caused is_invalid_reserved_pfn() == false since VFIO did work at some
> point.

Can we assume that?  It takes a while for a refcount leak on the zero
page to cause an overflow.  My assumption is that it's never worked, we
pin zero pages, don't account them against the locked memory limits
because our is_invalid_reserved_pfn() test returns true, and therefore
we don't unpin them.

> IHMO we should simply go back to the historical behavior - make
> is_invalid_reserved_pfn() check for the zero_pfn and return
> false. Meaning that PUP returned it.

We've never explicitly tested for zero_pfn and as David notes,
accounting the zero page against the user's locked memory limits has
user visible consequences.  VMs that worked with a specific locked
memory limit may no longer work.  Logically, this seems to be the one
case of duplicate accounting that we get right relative to the user's
locked memory limit and the current implementation of pinning the zero
page.  We're not locking any resources that aren't effectively already
locked.  Thanks,

Alex

Jason Gunthorpe Sept. 7, 2022, 4:40 p.m. UTC | #13

On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:

> > So, if we go back far enough in the git history we will find a case
> > where PUP is returning something for the zero page, and that something
> > caused is_invalid_reserved_pfn() == false since VFIO did work at some
> > point.
> 
> Can we assume that?  It takes a while for a refcount leak on the zero
> page to cause an overflow.  My assumption is that it's never worked, we
> pin zero pages, don't account them against the locked memory limits
> because our is_invalid_reserved_pfn() test returns true, and therefore
> we don't unpin them.

Oh, you think it has been buggy forever? That is not great..

> > IHMO we should simply go back to the historical behavior - make
> > is_invalid_reserved_pfn() check for the zero_pfn and return
> > false. Meaning that PUP returned it.
> 
> We've never explicitly tested for zero_pfn and as David notes,
> accounting the zero page against the user's locked memory limits has
> user visible consequences.  VMs that worked with a specific locked
> memory limit may no longer work.  

Yes, but the question is if that is a strict ABI we have to preserve,
because if you take that view it also means because VFIO has this
historical bug that David can't fix the FOLL_FORCE issue either.

If the view holds for memlock then it should hold for cgroups
also. This means the kernel can never change anything about
GFP_KERNEL_ACCOUNT allocations because it might impact userspace
having set a tight limit there.

It means we can't fix the bug that VFIO is using the wrong storage for
memlock.

It means qemu can't change anything about how it sets up this memory,
ie Kevin's idea to change the ordering.

On the other hand the "abi break" we are talking about is that a user
might have to increase a configured limit in a config file after a
kernel upgrade.

IDK what consensus exists here, I've never heard of anyone saying
these limits are a strict ABI like this.. I think at least for cgroup
that would be so invasive as to immediately be off the table.

Jason

Alex Williamson Sept. 7, 2022, 6:56 p.m. UTC | #14

On Wed, 7 Sep 2022 13:40:52 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:
> 
> > > So, if we go back far enough in the git history we will find a case
> > > where PUP is returning something for the zero page, and that something
> > > caused is_invalid_reserved_pfn() == false since VFIO did work at some
> > > point.  
> > 
> > Can we assume that?  It takes a while for a refcount leak on the zero
> > page to cause an overflow.  My assumption is that it's never worked, we
> > pin zero pages, don't account them against the locked memory limits
> > because our is_invalid_reserved_pfn() test returns true, and therefore
> > we don't unpin them.  
> 
> Oh, you think it has been buggy forever? That is not great..
>  
> > > IHMO we should simply go back to the historical behavior - make
> > > is_invalid_reserved_pfn() check for the zero_pfn and return
> > > false. Meaning that PUP returned it.  
> > 
> > We've never explicitly tested for zero_pfn and as David notes,
> > accounting the zero page against the user's locked memory limits has
> > user visible consequences.  VMs that worked with a specific locked
> > memory limit may no longer work.    
> 
> Yes, but the question is if that is a strict ABI we have to preserve,
> because if you take that view it also means because VFIO has this
> historical bug that David can't fix the FOLL_FORCE issue either.
> 
> If the view holds for memlock then it should hold for cgroups
> also. This means the kernel can never change anything about
> GFP_KERNEL_ACCOUNT allocations because it might impact userspace
> having set a tight limit there.
> 
> It means we can't fix the bug that VFIO is using the wrong storage for
> memlock.
> 
> It means qemu can't change anything about how it sets up this memory,
> ie Kevin's idea to change the ordering.
> 
> On the other hand the "abi break" we are talking about is that a user
> might have to increase a configured limit in a config file after a
> kernel upgrade.
> 
> IDK what consensus exists here, I've never heard of anyone saying
> these limits are a strict ABI like this.. I think at least for cgroup
> that would be so invasive as to immediately be off the table.

I thought we'd already agreed that we were stuck with locked_vm for
type1 and any compatibility mode of type1 due to this.  Native iommufd
support can do the right thing since userspace will need to account for
various new usage models anyway.

I've raised the issue with David for the zero page accounting, but I
don't know what the solution is.  libvirt automatically adds a 1GB
fudge factor to the VM locked memory limits to account for things like
ROM mappings, or at least the non-zeropage backed portion of those
ROMs.  I think that most management tools have adopted similar, so the
majority of users shouldn't notice.  However this won't cover all
users, so we certainly risk breaking userspace if we introduce hard
page accounting of zero pages.

I think David suggested possibly allowing some degree of exceeding
locked memory limits for zero page COWs.  Potentially type1 could do
this as well; special case handling with some heuristically determined,
module parameter defined limit.  We might also consider whether we
could just ignore zero page mappings, maybe with a optional "strict"
mode module option to generate an errno on such mappings.  Thanks,

Alex

David Hildenbrand Sept. 7, 2022, 7:14 p.m. UTC | #15

On 07.09.22 20:56, Alex Williamson wrote:
> On Wed, 7 Sep 2022 13:40:52 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
>> On Wed, Sep 07, 2022 at 09:55:52AM -0600, Alex Williamson wrote:
>>
>>>> So, if we go back far enough in the git history we will find a case
>>>> where PUP is returning something for the zero page, and that something
>>>> caused is_invalid_reserved_pfn() == false since VFIO did work at some
>>>> point.
>>>
>>> Can we assume that?  It takes a while for a refcount leak on the zero
>>> page to cause an overflow.  My assumption is that it's never worked, we
>>> pin zero pages, don't account them against the locked memory limits
>>> because our is_invalid_reserved_pfn() test returns true, and therefore
>>> we don't unpin them.
>>
>> Oh, you think it has been buggy forever? That is not great..
>>   
>>>> IHMO we should simply go back to the historical behavior - make
>>>> is_invalid_reserved_pfn() check for the zero_pfn and return
>>>> false. Meaning that PUP returned it.
>>>
>>> We've never explicitly tested for zero_pfn and as David notes,
>>> accounting the zero page against the user's locked memory limits has
>>> user visible consequences.  VMs that worked with a specific locked
>>> memory limit may no longer work.
>>
>> Yes, but the question is if that is a strict ABI we have to preserve,
>> because if you take that view it also means because VFIO has this
>> historical bug that David can't fix the FOLL_FORCE issue either.
>>
>> If the view holds for memlock then it should hold for cgroups
>> also. This means the kernel can never change anything about
>> GFP_KERNEL_ACCOUNT allocations because it might impact userspace
>> having set a tight limit there.
>>
>> It means we can't fix the bug that VFIO is using the wrong storage for
>> memlock.
>>
>> It means qemu can't change anything about how it sets up this memory,
>> ie Kevin's idea to change the ordering.
>>
>> On the other hand the "abi break" we are talking about is that a user
>> might have to increase a configured limit in a config file after a
>> kernel upgrade.
>>
>> IDK what consensus exists here, I've never heard of anyone saying
>> these limits are a strict ABI like this.. I think at least for cgroup
>> that would be so invasive as to immediately be off the table.
> 
> I thought we'd already agreed that we were stuck with locked_vm for
> type1 and any compatibility mode of type1 due to this.  Native iommufd
> support can do the right thing since userspace will need to account for
> various new usage models anyway.
> 
> I've raised the issue with David for the zero page accounting, but I
> don't know what the solution is.  libvirt automatically adds a 1GB
> fudge factor to the VM locked memory limits to account for things like
> ROM mappings, or at least the non-zeropage backed portion of those
> ROMs.  I think that most management tools have adopted similar, so the
> majority of users shouldn't notice.  However this won't cover all
> users, so we certainly risk breaking userspace if we introduce hard
> page accounting of zero pages.
> 
> I think David suggested possibly allowing some degree of exceeding
> locked memory limits for zero page COWs.  Potentially type1 could do
> this as well; special case handling with some heuristically determined,
> module parameter defined limit.  We might also consider whether we
> could just ignore zero page mappings, maybe with a optional "strict"
> mode module option to generate an errno on such mappings.  Thanks,

So far I played with the ideas

a) allow slightly exceeding the limit and warn

b) weird vfio kernel parameter to control the zeropage behavior (old vs.
    new)

I certainly have in mind that we might need some toggle for vfio.

Jason Gunthorpe Sept. 7, 2022, 7:55 p.m. UTC | #16

On Wed, Sep 07, 2022 at 12:56:27PM -0600, Alex Williamson wrote:

> I thought we'd already agreed that we were stuck with locked_vm for
> type1 and any compatibility mode of type1 due to this.  Native iommufd
> support can do the right thing since userspace will need to account for
> various new usage models anyway.

We did, that was for the iommufd situation (which will also hit the
same zeropage issue, sigh) - this discussion is about fixing a bug in
vfio and what many consider a bug in GUP.

My point is I'm still not convinced we can really consider these
limits as ABI because it opens a pandoras box of kernel limitations.

> I've raised the issue with David for the zero page accounting, but I
> don't know what the solution is.  libvirt automatically adds a 1GB
> fudge factor to the VM locked memory limits to account for things like
> ROM mappings, or at least the non-zeropage backed portion of those
> ROMs.  I think that most management tools have adopted similar, so the
> majority of users shouldn't notice.  However this won't cover all
> users, so we certainly risk breaking userspace if we introduce hard
> page accounting of zero pages.

It sounds like things will be fine. 1GB fudge is pretty big.

For things like this ABI compat is not about absolute compatability in
the face of any userspace, but a real-world compatibility "does
something that actually exists break?"

So I would be happier if we had an actual deployed thing that breaks..
I would be inclined to go with the simple fix and rely on the
fudge. If someone does come with an actual break then lets do one of
the work arounds.

Given the whole thing is obstensibly for security it is better to keep
it simple and sane then to poke it full of holes.

> module parameter defined limit.  We might also consider whether we
> could just ignore zero page mappings, maybe with a optional "strict"
> mode module option to generate an errno on such mappings.  Thanks,

Once GUP is fixed vfio won't see the zero pages anymore :( That really
limits the choices for a work around :(

Jason

Alex Williamson Sept. 7, 2022, 8:24 p.m. UTC | #17

On Wed, 7 Sep 2022 16:55:50 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 07, 2022 at 12:56:27PM -0600, Alex Williamson wrote:
> 
> > I thought we'd already agreed that we were stuck with locked_vm for
> > type1 and any compatibility mode of type1 due to this.  Native iommufd
> > support can do the right thing since userspace will need to account for
> > various new usage models anyway.  
> 
> We did, that was for the iommufd situation (which will also hit the
> same zeropage issue, sigh) - this discussion is about fixing a bug in
> vfio and what many consider a bug in GUP.
> 
> My point is I'm still not convinced we can really consider these
> limits as ABI because it opens a pandoras box of kernel limitations.
> 
> > I've raised the issue with David for the zero page accounting, but I
> > don't know what the solution is.  libvirt automatically adds a 1GB
> > fudge factor to the VM locked memory limits to account for things like
> > ROM mappings, or at least the non-zeropage backed portion of those
> > ROMs.  I think that most management tools have adopted similar, so the
> > majority of users shouldn't notice.  However this won't cover all
> > users, so we certainly risk breaking userspace if we introduce hard
> > page accounting of zero pages.  
> 
> It sounds like things will be fine. 1GB fudge is pretty big.
> 
> For things like this ABI compat is not about absolute compatability in
> the face of any userspace, but a real-world compatibility "does
> something that actually exists break?"

Magic 8 ball says "Cannot predict now."  Unfortunately there's a lot of
roll-your-own scripting that goes on in this space, many users reject
the overhead of things like libvirt, let alone deeper management
stacks.  Private clouds have constraints that might also generate custom
solutions.  I can't predict the degree to which libvirt is a canonical
example.
 
> So I would be happier if we had an actual deployed thing that breaks..
> I would be inclined to go with the simple fix and rely on the
> fudge. If someone does come with an actual break then lets do one of
> the work arounds.

We should probably have a workaround in our pocket for such a case.

Also, I want to clarify, is this a recommendation relative to the
stable patch proposed here, or only once we get rid of shared zero page
pinning?  We can't simply do accounting on the shared zero page since a
single user can overflow the refcount.

> Given the whole thing is obstensibly for security it is better to keep
> it simple and sane then to poke it full of holes.
> 
> > module parameter defined limit.  We might also consider whether we
> > could just ignore zero page mappings, maybe with a optional "strict"
> > mode module option to generate an errno on such mappings.  Thanks,  
> 
> Once GUP is fixed vfio won't see the zero pages anymore :( That really
> limits the choices for a work around :(

I was afraid of that.  Thanks,

Alex

Jason Gunthorpe Sept. 7, 2022, 11:07 p.m. UTC | #18

On Wed, Sep 07, 2022 at 02:24:16PM -0600, Alex Williamson wrote:

> Also, I want to clarify, is this a recommendation relative to the
> stable patch proposed here, or only once we get rid of shared zero page
> pinning?  We can't simply do accounting on the shared zero page since a
> single user can overflow the refcount.

Yes, here I would account properly in a way that keeps working for
future GUP changes because if something goes wrong with this simple
patch it has a simple fix.

Trialing it will get some good data to inform what David's patch
should do.

Overall have the feeling that a small group of people might grumble
that their limits break, but with a limit adjustment they can probably
trivially move on. It would be very interesting to see if someone
feels like the issue is important enough to try and get something
changed.

You could also fix it by just using FOLL_FORCE (like RDMA/io_uring
does), which fixes the larger issue Kevin noted that the ROM doesn't
become visible to DMA.

Jason

Alex Williamson Sept. 8, 2022, 1:10 a.m. UTC | #19

On Wed, 7 Sep 2022 20:07:02 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Sep 07, 2022 at 02:24:16PM -0600, Alex Williamson wrote:
> 
> > Also, I want to clarify, is this a recommendation relative to the
> > stable patch proposed here, or only once we get rid of shared zero page
> > pinning?  We can't simply do accounting on the shared zero page since a
> > single user can overflow the refcount.  
> 
> Yes, here I would account properly in a way that keeps working for
> future GUP changes because if something goes wrong with this simple
> patch it has a simple fix.
> 
> Trialing it will get some good data to inform what David's patch
> should do.
> 
> Overall have the feeling that a small group of people might grumble
> that their limits break, but with a limit adjustment they can probably
> trivially move on. It would be very interesting to see if someone
> feels like the issue is important enough to try and get something
> changed.
> 
> You could also fix it by just using FOLL_FORCE (like RDMA/io_uring
> does), which fixes the larger issue Kevin noted that the ROM doesn't
> become visible to DMA.

That's only a theoretical problem, I suspect there are absolutely zero
cases where this is an actual problem.  Doing anything other than
simply fixing the leak for stable seems reckless, we're not actually
consuming resources that need to be accounted until David's changes
come through, and we risk breaking users on a broad scale.  IMO, the
fix proposed here is the correct first step and we can start
experimenting with accounting the zero page moving forward.  Thanks,

Alex

vfio/type1: Unpin zero pages

Commit Message

Comments

Patch