mbox series

[0/2] KVM: do not assume PTE is writable after follow_pfn

Message ID 20210205103259.42866-1-pbonzini@redhat.com (mailing list archive)
Headers show
Series KVM: do not assume PTE is writable after follow_pfn | expand

Message

Paolo Bonzini Feb. 5, 2021, 10:32 a.m. UTC
This series is the first step towards fixing KVM's usage of follow_pfn.
The immediate fix here is that KVM is not checking the writability of
the PFN, which actually dates back to way before the introduction of
follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults
before giving up", 2016-07-05).  There are more changes needed to
invalidate gfn-to-pfn caches from MMU notifiers, but this issue will
be tackled later.

A more fundamental issue however is that the follow_pfn function is
basically impossible to use correctly.  Almost all users for example
are assuming that the page is writable; KVM was not alone in this
mistake.  follow_pte, despite not being exported for modules, is a
far saner API.  Therefore, patch 1 simplifies follow_pte a bit and
makes it available to modules.

Please review and possibly ack for inclusion in the KVM tree,
thanks!

Paolo


Paolo Bonzini (2):
  mm: provide a sane PTE walking API for modules
  KVM: do not assume PTE is writable after follow_pfn

 arch/s390/pci/pci_mmio.c |  2 +-
 fs/dax.c                 |  5 +++--
 include/linux/mm.h       |  6 ++++--
 mm/memory.c              | 35 ++++++++++++++++++++++++++++++-----
 virt/kvm/kvm_main.c      | 15 ++++++++++++---
 5 files changed, 50 insertions(+), 13 deletions(-)

Comments

Peter Xu Feb. 5, 2021, 6:14 p.m. UTC | #1
On Fri, Feb 05, 2021 at 05:32:57AM -0500, Paolo Bonzini wrote:
> This series is the first step towards fixing KVM's usage of follow_pfn.
> The immediate fix here is that KVM is not checking the writability of
> the PFN, which actually dates back to way before the introduction of
> follow_pfn in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults
> before giving up", 2016-07-05).  There are more changes needed to
> invalidate gfn-to-pfn caches from MMU notifiers, but this issue will
> be tackled later.
> 
> A more fundamental issue however is that the follow_pfn function is
> basically impossible to use correctly.  Almost all users for example
> are assuming that the page is writable; KVM was not alone in this
> mistake.  follow_pte, despite not being exported for modules, is a
> far saner API.  Therefore, patch 1 simplifies follow_pte a bit and
> makes it available to modules.
> 
> Please review and possibly ack for inclusion in the KVM tree,
> thanks!

FWIW, the patches look correct to me (if with patch 2 report fixed):

Reviewed-by: Peter Xu <peterx@redhat.com>

But I do have a question on why dax as the only user needs to pass in the
notifier to follow_pte() for initialization.

Indeed there're a difference on start/end init of the notifier depending on
whether it's a huge pmd but since there's the pmdp passed over too so I assume
the caller should know how to init the notifier anyways.

The thing is at least in current code we could send meaningless notifiers,
e.g., in follow_pte():

	if (range) {
		mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
					address & PAGE_MASK,
					(address & PAGE_MASK) + PAGE_SIZE);
		mmu_notifier_invalidate_range_start(range);
	}
	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
	if (!pte_present(*ptep))
		goto unlock;
	*ptepp = ptep;
	return 0;
unlock:
	pte_unmap_unlock(ptep, *ptlp);
	if (range)
		mmu_notifier_invalidate_range_end(range);

The notify could be meaningless if we do the "goto unlock" path.

Ideally it seems we can move the notifier code to caller (as what most mmu
notifier users do) and we can also avoid doing that if follow_pte returned
-EINVAL.

Thanks,
Jason Gunthorpe Feb. 8, 2021, 6:51 p.m. UTC | #2
On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:

> But I do have a question on why dax as the only user needs to pass in the
> notifier to follow_pte() for initialization.

Not sure either, why does DAX opencode something very much like
page_mkclean() with dax_entry_mkclean()?

Also it looks like DAX uses the wrong notifier, it calls
MMU_NOTIFY_CLEAR but page_mkclean_one() uses
MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??

page_mkclean() has some technique to make the notifier have the right
size without becoming entangled in the PTL locks..

Jason
Peter Xu Feb. 8, 2021, 10:02 p.m. UTC | #3
On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> 
> > But I do have a question on why dax as the only user needs to pass in the
> > notifier to follow_pte() for initialization.
> 
> Not sure either, why does DAX opencode something very much like
> page_mkclean() with dax_entry_mkclean()?
> 
> Also it looks like DAX uses the wrong notifier, it calls
> MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> 
> page_mkclean() has some technique to make the notifier have the right
> size without becoming entangled in the PTL locks..

Right.  I guess it's because dax doesn't have "struct page*" on the back, so it
can't directly use page_mkclean().  However the whole logic does look very
similar, so maybe they can be merged in some way.

Thanks,
Jason Gunthorpe Feb. 8, 2021, 11:26 p.m. UTC | #4
On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote:
> On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> > 
> > > But I do have a question on why dax as the only user needs to pass in the
> > > notifier to follow_pte() for initialization.
> > 
> > Not sure either, why does DAX opencode something very much like
> > page_mkclean() with dax_entry_mkclean()?
> > 
> > Also it looks like DAX uses the wrong notifier, it calls
> > MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> > 
> > page_mkclean() has some technique to make the notifier have the right
> > size without becoming entangled in the PTL locks..
> 
> Right.  I guess it's because dax doesn't have "struct page*" on the
> back, so it

It doesn't? I thought DAX cases did?

Jason
Peter Xu Feb. 9, 2021, 12:23 a.m. UTC | #5
On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 08, 2021 at 05:02:59PM -0500, Peter Xu wrote:
> > On Mon, Feb 08, 2021 at 02:51:33PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Feb 05, 2021 at 01:14:11PM -0500, Peter Xu wrote:
> > > 
> > > > But I do have a question on why dax as the only user needs to pass in the
> > > > notifier to follow_pte() for initialization.
> > > 
> > > Not sure either, why does DAX opencode something very much like
> > > page_mkclean() with dax_entry_mkclean()?
> > > 
> > > Also it looks like DAX uses the wrong notifier, it calls
> > > MMU_NOTIFY_CLEAR but page_mkclean_one() uses
> > > MMU_NOTIFY_PROTECTION_PAGE for the same PTE modification sequence??
> > > 
> > > page_mkclean() has some technique to make the notifier have the right
> > > size without becoming entangled in the PTL locks..
> > 
> > Right.  I guess it's because dax doesn't have "struct page*" on the
> > back, so it
> 
> It doesn't? I thought DAX cases did?

I'm not familiar with dax at all.. but it seems so: e.g. dax_iomap_pte_fault()
looks like the general fault handler for dax mappings, in which there's calls
to things like vmf_insert_mixed_mkwrite() trying to install ptes with raw pfn.
Or I could also be missing something very important..  Thanks,
Christoph Hellwig Feb. 9, 2021, 8:19 a.m. UTC | #6
On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
> > > page_mkclean() has some technique to make the notifier have the right
> > > size without becoming entangled in the PTL locks..
> > 
> > Right.  I guess it's because dax doesn't have "struct page*" on the
> > back, so it
> 
> It doesn't? I thought DAX cases did?

File system DAX has a struct page, device DAX does not.  Which means
everything using iomap should have a page available, but i'm adding
Dan as he should know the details :)
Joao Martins Feb. 9, 2021, 10:02 a.m. UTC | #7
On 2/9/21 8:19 AM, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
>>>> page_mkclean() has some technique to make the notifier have the right
>>>> size without becoming entangled in the PTL locks..
>>>
>>> Right.  I guess it's because dax doesn't have "struct page*" on the
>>> back, so it
>>
>> It doesn't? I thought DAX cases did?
> 
> File system DAX has a struct page, device DAX does not.  Which means
> everything using iomap should have a page available, but i'm adding
> Dan as he should know the details :)
> 
Both fsdax and device-dax have struct page. It's the pmem block device
that doesn't.