Message ID | 20170724170616.25810-2-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Jul 24, 2017 at 11:06:12AM -0600, Ross Zwisler wrote: > @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (mkwrite) { > + /* > + * For read faults on private mappings the PFN passed > + * in may not match the PFN we have mapped if the > + * mapped PFN is a writeable COW page. In the mkwrite > + * case we are creating a writable PTE for a shared > + * mapping and we expect the PFNs to match. > + */ Can we? I guess it's up to filesystem if it wants to reuse the same spot to write data or not. I think your assumptions works for ext4 and xfs. I wouldn't be that sure for btrfs or other filesystems with CoW support.
On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > I guess it's up to filesystem if it wants to reuse the same spot to write > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > be that sure for btrfs or other filesystems with CoW support. Or XFS with reflinks for that matter. Which currently can't be combined with DAX, but I had a somewhat working version a few month ago. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > I guess it's up to filesystem if it wants to reuse the same spot to write > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > be that sure for btrfs or other filesystems with CoW support. > > Or XFS with reflinks for that matter. Which currently can't be > combined with DAX, but I had a somewhat working version a few month > ago. But in cases like COW when the block mapping changes, the process must run unmap_mapping_range() before installing the new PTE so that all processes mapping this file offset actually refault and see the new mapping. So this would go through pte_none() case. Am I missing something? Honza
On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > be that sure for btrfs or other filesystems with CoW support. > > > > Or XFS with reflinks for that matter. Which currently can't be > > combined with DAX, but I had a somewhat working version a few month > > ago. > > But in cases like COW when the block mapping changes, the process > must run unmap_mapping_range() before installing the new PTE so that all > processes mapping this file offset actually refault and see the new > mapping. So this would go through pte_none() case. Am I missing something? Yes, for DAX COW mappings we'd probably need something like this, unlike the pagecache COW handling for which only the underlying block change, but not the page. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > combined with DAX, but I had a somewhat working version a few month > > > ago. > > > > But in cases like COW when the block mapping changes, the process > > must run unmap_mapping_range() before installing the new PTE so that all > > processes mapping this file offset actually refault and see the new > > mapping. So this would go through pte_none() case. Am I missing something? > > Yes, for DAX COW mappings we'd probably need something like this, unlike > the pagecache COW handling for which only the underlying block change, > but not the page. Right. So again nothing where the WARN_ON should trigger. That being said I don't care about the WARN_ON too deeply but it can help to catch DAX bugs so if we can keep it I'd prefer to do so... Honza
On Tue, Jul 25, 2017 at 02:50:37PM +0200, Jan Kara wrote: > On Tue 25-07-17 14:15:22, Christoph Hellwig wrote: > > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote: > > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote: > > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote: > > > > > I guess it's up to filesystem if it wants to reuse the same spot to write > > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't > > > > > be that sure for btrfs or other filesystems with CoW support. > > > > > > > > Or XFS with reflinks for that matter. Which currently can't be > > > > combined with DAX, but I had a somewhat working version a few month > > > > ago. > > > > > > But in cases like COW when the block mapping changes, the process > > > must run unmap_mapping_range() before installing the new PTE so that all > > > processes mapping this file offset actually refault and see the new > > > mapping. So this would go through pte_none() case. Am I missing something? > > > > Yes, for DAX COW mappings we'd probably need something like this, unlike > > the pagecache COW handling for which only the underlying block change, > > but not the page. > > Right. So again nothing where the WARN_ON should trigger. Yes. I was confused on how COW is handled. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..483e84c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2293,6 +2293,8 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn, pgprot_t pgprot); int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 0e517be..b29dd42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1646,7 +1646,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr, EXPORT_SYMBOL(vm_insert_page); static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn, pgprot_t prot) + pfn_t pfn, pgprot_t prot, bool mkwrite) { struct mm_struct *mm = vma->vm_mm; int retval; @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, if (!pte) goto out; retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; + if (!pte_none(*pte)) { + if (mkwrite) { + /* + * For read faults on private mappings the PFN passed + * in may not match the PFN we have mapped if the + * mapped PFN is a writeable COW page. In the mkwrite + * case we are creating a writable PTE for a shared + * mapping and we expect the PFNs to match. + */ + if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn))) + goto out_unlock; + entry = *pte; + goto out_mkwrite; + } else + goto out_unlock; + } /* Ok, finally just insert the thing.. */ if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); else entry = pte_mkspecial(pfn_t_pte(pfn, prot)); + +out_mkwrite: + if (mkwrite) { + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + } + set_pte_at(mm, addr, pte, entry); update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ @@ -1736,14 +1757,15 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr, track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)); - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, + false); return ret; } EXPORT_SYMBOL(vm_insert_pfn_prot); -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, - pfn_t pfn) +static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn, bool mkwrite) { pgprot_t pgprot = vma->vm_page_prot; @@ -1772,10 +1794,24 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_to_page(pfn_t_to_pfn(pfn)); return insert_page(vma, addr, page, pgprot); } - return insert_pfn(vma, addr, pfn, pgprot); + return insert_pfn(vma, addr, pfn, pgprot, mkwrite); +} + +int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn) +{ + return __vm_insert_mixed(vma, addr, pfn, false); + } EXPORT_SYMBOL(vm_insert_mixed); +int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, + pfn_t pfn) +{ + return __vm_insert_mixed(vma, addr, pfn, true); +} +EXPORT_SYMBOL(vm_insert_mixed_mkwrite); + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results