diff mbox

[14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup

Message ID 20151002212137.GB30448@deltatee.com (mailing list archive)
State New, archived
Headers show

Commit Message

Logan Gunthorpe Oct. 2, 2015, 9:21 p.m. UTC
Hi Dan,

We've been doing some experimenting and testing with this patchset.
Specifically, we are trying to use you're ZONE_DEVICE work to enable
peer to peer PCIe transfers. This is actually working pretty well
(though we're still testing and working through some things).

However, we've found a couple of issues:

On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3d6baa7d4534..20097e7b679a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -49,12 +49,16 @@ struct page {
>  					 * updated asynchronously */
>  	union {
>  		struct address_space *mapping;	/* If low bit clear, points to
> -						 * inode address_space, or NULL.
> +						 * inode address_space, unless
> +						 * the page is in ZONE_DEVICE
> +						 * then it points to its parent
> +						 * dev_pagemap, otherwise NULL.
>  						 * If page mapped as anonymous
>  						 * memory, low bit is set, and
>  						 * it points to anon_vma object:
>  						 * see PAGE_MAPPING_ANON below.
>  						 */
> +		struct dev_pagemap *pgmap;
>  		void *s_mem;			/* slab first object */
>  	};


When you add to this union and overide the mapping value, we see bugs
in calls to set_page_dirty when it tries to dereference mapping. I believe
a change to page_mapping is required such as the patch that's at the end of
this email.


> diff --git a/mm/gup.c b/mm/gup.c
> index a798293fc648..1064e9a489a4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -98,7 +98,16 @@ retry:
>  	}
>
>  	page = vm_normal_page(vma, address, pte);
> -	if (unlikely(!page)) {
> +	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> +		/*
> +		 * Only return device mapping pages in the FOLL_GET case since
> +		 * they are only valid while holding the pgmap reference.
> +		 */
> +		if (get_dev_pagemap(pte_pfn(pte), NULL))
> +			page = pte_page(pte);
> +		else
> +			goto no_page;
> +	} else if (unlikely(!page)) {

I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
create the pagemap (using devm_register_pagemap) then the get_user_pages code
will go into an infinite loop. I'm not really sure if this as an issue or
not but it seems a bit undesirable for a buggy driver to be able to cause this.

My thoughts are that either devm_register_pagemap needs to be done by
devm_memremap_pages so a driver cannot use one without the other,
or the GUP code needs to return EFAULT if no pagemap was registered so
it doesn't loop forever.

Thanks!

Logan



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dan Williams Oct. 2, 2015, 9:53 p.m. UTC | #1
On Fri, Oct 2, 2015 at 2:21 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Dan,
>
> We've been doing some experimenting and testing with this patchset.
> Specifically, we are trying to use you're ZONE_DEVICE work to enable
> peer to peer PCIe transfers. This is actually working pretty well
> (though we're still testing and working through some things).

Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
the test report is welcome nonetheless.  The definition of dma_addr_t
is the device view of host memory, not necessarily the device view of
a peer device's memory range, so I expect you'll run into issues with
IOMMUs and other parts of the kernel that assume this definition.

>
> However, we've found a couple of issues:
>
> On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3d6baa7d4534..20097e7b679a 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -49,12 +49,16 @@ struct page {
>>                                        * updated asynchronously */
>>       union {
>>               struct address_space *mapping;  /* If low bit clear, points to
>> -                                              * inode address_space, or NULL.
>> +                                              * inode address_space, unless
>> +                                              * the page is in ZONE_DEVICE
>> +                                              * then it points to its parent
>> +                                              * dev_pagemap, otherwise NULL.
>>                                                * If page mapped as anonymous
>>                                                * memory, low bit is set, and
>>                                                * it points to anon_vma object:
>>                                                * see PAGE_MAPPING_ANON below.
>>                                                */
>> +             struct dev_pagemap *pgmap;
>>               void *s_mem;                    /* slab first object */
>>       };
>
>
> When you add to this union and overide the mapping value, we see bugs
> in calls to set_page_dirty when it tries to dereference mapping. I believe
> a change to page_mapping is required such as the patch that's at the end of
> this email.

Yes, this location for dev_pagemap will not work.  I've since moved it
to a union with the lru list_head since ZONE_DEVICE pages memory
should always have an elevated page count and never land on a slab
allocator lru.

>> diff --git a/mm/gup.c b/mm/gup.c
>> index a798293fc648..1064e9a489a4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -98,7 +98,16 @@ retry:
>>       }
>>
>>       page = vm_normal_page(vma, address, pte);
>> -     if (unlikely(!page)) {
>> +     if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
>> +             /*
>> +              * Only return device mapping pages in the FOLL_GET case since
>> +              * they are only valid while holding the pgmap reference.
>> +              */
>> +             if (get_dev_pagemap(pte_pfn(pte), NULL))
>> +                     page = pte_page(pte);
>> +             else
>> +                     goto no_page;
>> +     } else if (unlikely(!page)) {
>
> I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
> create the pagemap (using devm_register_pagemap) then the get_user_pages code
> will go into an infinite loop. I'm not really sure if this as an issue or
> not but it seems a bit undesirable for a buggy driver to be able to cause this.
>
> My thoughts are that either devm_register_pagemap needs to be done by
> devm_memremap_pages so a driver cannot use one without the other,
> or the GUP code needs to return EFAULT if no pagemap was registered so
> it doesn't loop forever.

Exactly, we should fail (-EFAULT)  get_user_pages() in that case since
we don't have a mechanism to pin down the mapping.  I'll track down
what's causing the loop.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Logan Gunthorpe Oct. 2, 2015, 10:14 p.m. UTC | #2
Hi Dan,

Good to know you've already addressed the struct page issue. We'll watch 
out for an updated patchset to try.


On 02/10/15 03:53 PM, Dan Williams wrote:
> Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
> the test report is welcome nonetheless.  The definition of dma_addr_t
> is the device view of host memory, not necessarily the device view of
> a peer device's memory range, so I expect you'll run into issues with
> IOMMUs and other parts of the kernel that assume this definition.

Yeah, we've actually been doing this with a number of more "hacky" 
techniques for some time. ZONE_DEVICE just provides us with a much 
cleaner way to set this up that doesn't require patching around 
get_user_pages in various places in the kernel.

We've never had any issues with the IOMMU getting in the way (at least 
on Intel x86). My understanding always was that the IOMMU sits between a 
PCI card and main memory; it doesn't get in the way of peer-to-peer 
transfers. Though admittedly, I don't have a complete understanding of 
how the IOMMU works in the kernel. I'm just speaking from experimental 
experience. We've never actually tried this on other architectures.

Thanks,

Logan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Logan Gunthorpe Oct. 2, 2015, 10:42 p.m. UTC | #3
On 02/10/15 03:53 PM, Dan Williams wrote:
> Yes, this location for dev_pagemap will not work.  I've since moved it
> to a union with the lru list_head since ZONE_DEVICE pages memory
> should always have an elevated page count and never land on a slab
> allocator lru.

Oh, also, I was actually hoping to make use of the lru list_head in the 
future with ZONE_DEVICE memory. One thought I had was once we have a 
PCIe device with a BAR space, we'd then need to have a way of allocating 
these buffers when user space needs them. The simple way I was thinking 
was to just use the lru list head to store lists of used and unused 
pages -- though there are probably other solutions to this that don't 
require using struct pages.

Thanks,

Logan
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams Oct. 2, 2015, 10:55 p.m. UTC | #4
On Fri, Oct 2, 2015 at 3:42 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 02/10/15 03:53 PM, Dan Williams wrote:
>>
>> Yes, this location for dev_pagemap will not work.  I've since moved it
>> to a union with the lru list_head since ZONE_DEVICE pages memory
>> should always have an elevated page count and never land on a slab
>> allocator lru.
>
>
> Oh, also, I was actually hoping to make use of the lru list_head in the
> future with ZONE_DEVICE memory. One thought I had was once we have a PCIe
> device with a BAR space, we'd then need to have a way of allocating these
> buffers when user space needs them. The simple way I was thinking was to
> just use the lru list head to store lists of used and unused pages -- though
> there are probably other solutions to this that don't require using struct
> pages.
>

The current assumption is the ZONE_DEVICE ranges are being managed by
a physical address allocator.  In the case of persistent memory this
is the block allocator of the filesystem sitting on top of a pmem
block device.  The struct page is really only there to facilitate
in-flight I/O requests.  If it weren't for complexity we'd allocate
them on demand.  So you're "unused" case should be a raw pfn and then
for the time-limited duration it is in use as a struct page it should
hold a reference against the mapping.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/util.c b/mm/util.c
index 68ff8a5..19af683 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -368,6 +368,9 @@  struct address_space *page_mapping(struct page *page)
 		return swap_address_space(entry);
 	}

+	if (unlikely(is_zone_device_page(page)))
+		return NULL;
+
 	mapping = (unsigned long)page->mapping;
 	if (mapping & PAGE_MAPPING_FLAGS)
 		return NULL;