Message ID | 20170724170616.25810-2-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 24, 2017 at 11:06:12AM -0600, Ross Zwisler wrote: > @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (mkwrite) { > + /* > + * For read faults on private mappings the PFN passed > + * in may not match the PFN we have mapped if the > + * mapped PFN is a writeable COW page. In the mkwrite > + * case we are creating a writable PTE for a shared > + * mapping and we expect the PFNs to match. > + */ Can we? I guess it's up to filesystem if it wants to reuse the same spot to write data or not. I think your assumptions works for ext4 and xfs. I wouldn't be that sure for btrfs or other filesystems with CoW support.
On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > I guess it's up to filesystem if it wants to reuse the same spot to write > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > be that sure for btrfs or other filesystems with CoW support. Or XFS with reflinks for that matter. Which currently can't be combined with DAX, but I had a somewhat working version a few month ago.
On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > I guess it's up to filesystem if it wants to reuse the same spot to write > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > be that sure for btrfs or other filesystems with CoW support. > > Or XFS with reflinks for that matter. Which currently can't be > combined with DAX, but I had a somewhat working version a few month > ago. But in cases like COW when the block mapping changes, the process must run unmap_mapping_range() before installing the new PTE so that all processes mapping this file offset actually refault and see the new mapping. So this would go through pte_none() case. Am I missing something? Honza
On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > be that sure for btrfs or other filesystems with CoW support. > > > > Or XFS with reflinks for that matter. Which currently can't be > > combined with DAX, but I had a somewhat working version a few month > > ago. > > But in cases like COW when the block mapping changes, the process > must run unmap_mapping_range() before installing the new PTE so that all > processes mapping this file offset actually refault and see the new > mapping. So this would go through pte_none() case. Am I missing something? Yes, for DAX COW mappings we'd probably need something like this, unlike the pagecache COW handling for which only the underlying block change, but not the page.
On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > combined with DAX, but I had a somewhat working version a few month > > > ago. > > > > But in cases like COW when the block mapping changes, the process > > must run unmap_mapping_range() before installing the new PTE so that all > > processes mapping this file offset actually refault and see the new > > mapping. So this would go through pte_none() case. Am I missing something? > > Yes, for DAX COW mappings we'd probably need something like this, unlike > the pagecache COW handling for which only the underlying block change, > but not the page. Right. So again nothing where the WARN_ON should trigger. That being said I don't care about the WARN_ON too deeply but it can help to catch DAX bugs so if we can keep it I'd prefer to do so... Honza
On Tue, Jul 25, 2017 at 02:50:37PM +0200, Jan Kara wrote: > On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > > combined with DAX, but I had a somewhat working version a few month > > > > ago. > > > > > > But in cases like COW when the block mapping changes, the process > > > must run unmap_mapping_range() before installing the new PTE so that all > > > processes mapping this file offset actually refault and see the new > > > mapping. So this would go through pte_none() case. Am I missing something? > > > > Yes, for DAX COW mappings we'd probably need something like this, unlike > > the pagecache COW handling for which only the underlying block change, > > but not the page. > > Right. So again nothing where the WARN_ON should trigger. Yes. I was confused on how COW is handled. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..483e84c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 0e517be..b29dd42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot, bool mkwrite) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (mkwrite) { + /* + * For read faults on private mappings the PFN passed + * in may not match the PFN we have mapped if the + * mapped PFN is a writeable COW page. In the mkwrite + * case we are creating a writable PTE for a shared + * mapping and we expect the PFNs to match. + */ + if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn))) + goto out_unlock; + entry = *pte; + goto out_mkwrite; + } else + goto out_unlock; + } /* Ok, finally just insert the thing.. */ if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + +out_mkwrite: + if (mkwrite) { + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } + set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1736,14 +1757,15 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)); - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, + false); return ret; } EXPORT_SYMBOL(vm_insert_pfn_prot); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn) +static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn, bool mkwrite) { pgprot_t pgprot = vma->vm_page_prot; @@ -1772,10 +1794,24 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_to_page(pfn_t_to_pfn(pfn)); return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, pgprot); + return insert_pfn(vma, addr, pfn, pgprot, mkwrite); +} + +int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn) +{ + return __vm_insert_mixed(vma, addr, pfn, false); + } EXPORT_SYMBOL(vm_insert_mixed); +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn) +{ + return __vm_insert_mixed(vma, addr, pfn, true); +} +EXPORT_SYMBOL(vm_insert_mixed_mkwrite); + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results