[4/6] mm: introduce page->dma_pinned_flags, _count

Message ID	20181012060014.10242-5-jhubbard@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: john.hubbard@gmail.com To: Matthew Wilcox <willy@infradead.org>, Michal Hocko <mhocko@kernel.org>, Christopher Lameter <cl@linux.com>, Jason Gunthorpe <jgg@ziepe.ca>, Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz> Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, LKML <linux-kernel@vger.kernel.org>, linux-rdma <linux-rdma@vger.kernel.org>, linux-fsdevel@vger.kernel.org, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count Date: Thu, 11 Oct 2018 23:00:12 -0700 Message-Id: <20181012060014.10242-5-jhubbard@nvidia.com> In-Reply-To: <20181012060014.10242-1-jhubbard@nvidia.com> References: <20181012060014.10242-1-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	RFC: gup+dma: tracking dma-pinned pages \| expand [0/6] RFC: gup+dma: tracking dma-pinned pages [1/6] mm: get_user_pages: consolidate error handling [2/6] mm: introduce put_user_page(), placeholder versions [3/6] infiniband/mm: convert put_page() to put_user_page() [4/6] mm: introduce page->dma_pinned_flags, _count [5/6] mm: introduce zone_gup_lock, for dma-pinned pages [6/6] mm: track gup pages with page->dma_pinned_* fields

john.hubbard@gmail.com Oct. 12, 2018, 6 a.m. UTC

From: John Hubbard <jhubbard@nvidia.com>

Add two struct page fields that, combined, are unioned with
struct page->lru. There is no change in the size of
struct page. These new fields are for type safety and clarity.

Also add page flag accessors to test, set and clear the new
page->dma_pinned_flags field.

The page->dma_pinned_count field will be used in upcoming
patches

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm_types.h   | 22 +++++++++++++-----
 include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 6 deletions(-)

Education Directorate Oct. 12, 2018, 10:56 a.m. UTC | #1

On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm_types.h   | 22 +++++++++++++-----
>  include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>  	 */
>  	union {
>  		struct {	/* Page cache and anonymous pages */
> -			/**
> -			 * @lru: Pageout list, eg. active_list protected by
> -			 * zone_lru_lock.  Sometimes used as a generic list
> -			 * by the page owner.
> -			 */
> -			struct list_head lru;
> +			union {
> +				/**
> +				 * @lru: Pageout list, eg. active_list protected
> +				 * by zone_lru_lock.  Sometimes used as a
> +				 * generic list by the page owner.
> +				 */
> +				struct list_head lru;
> +				/* Used by get_user_pages*(). Pages may not be
> +				 * on an LRU while these dma_pinned_* fields
> +				 * are in use.
> +				 */
> +				struct {
> +					unsigned long dma_pinned_flags;
> +					atomic_t      dma_pinned_count;
> +				};
> +			};
>  			/* See page-flags.h for PAGE_MAPPING_FLAGS */
>  			struct address_space *mapping;
>  			pgoff_t index;		/* Our offset within mapping. */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..81ed52c3caae 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -425,6 +425,53 @@ static __always_inline int __PageMovable(struct page *page)
>  				PAGE_MAPPING_MOVABLE;
>  }
>  
> +/*
> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
> + * uses these flags must NOT be on an LRU. That's partly enforced by
> + * ClearPageDmaPinned, which gives the page back to LRU.
> + *
> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
> + * of struct page), and this flag is checked without knowing whether it is a
> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
> + * rather than bit 0.
> + */
> +#define PAGE_DMA_PINNED		0x2
> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
> +

This is really subtle, additional changes to compound_head will need to coordinate
with these flags? Also doesn't this bit need to be unique across all structs in
the union? I guess that is guaranteed by the fact that page == compound_head(page)
as per your assertion, but I've forgotten why that is true. Could you please
add some commentary on that

> +/*
> + * Because these flags are read outside of a lock, ensure visibility between
> + * different threads, by using READ|WRITE_ONCE.
> + */
> +static __always_inline int PageDmaPinnedFlags(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0;
> +}
> +
> +static __always_inline int PageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
> +}
> +
> +static __always_inline void SetPageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));

VM_BUG_ON(!list_empty(&page->lru))

> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
> +}
> +
> +static __always_inline void ClearPageDmaPinned(struct page *page)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
> +
> +	/* This does a WRITE_ONCE to the lru.next, which is also the
> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
> +	 * this provides visibility to other threads.
> +	 */
> +	INIT_LIST_HEAD(&page->lru);

This assumes certain things about list_head, why not use the correct
initialization bits.

> +}
> +
>  #ifdef CONFIG_KSM
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> -- 
> 2.19.1
>

John Hubbard Oct. 13, 2018, 12:15 a.m. UTC | #2

On 10/12/18 3:56 AM, Balbir Singh wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>> + * ClearPageDmaPinned, which gives the page back to LRU.
>> + *
>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>> + * of struct page), and this flag is checked without knowing whether it is a
>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>> + * rather than bit 0.
>> + */
>> +#define PAGE_DMA_PINNED		0x2
>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>> +
> 
> This is really subtle, additional changes to compound_head will need to coordinate
> with these flags? Also doesn't this bit need to be unique across all structs in
> the union? I guess that is guaranteed by the fact that page == compound_head(page)
> as per your assertion, but I've forgotten why that is true. Could you please
> add some commentary on that
> 

Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
to even have it). So now it looks like this:

/*
 * Because page->dma_pinned_flags is unioned with page->lru, any page that
 * uses these flags must NOT be on an LRU. That's partly enforced by
 * ClearPageDmaPinned, which gives the page back to LRU.
 *
 * PageDmaPinned is checked without knowing whether it is a tail page or a
 * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
 * bit in the first union of struct page), and instead uses bit 1 (0x2),
 * rather than bit 0.
 *
 * PageDmaPinned can only be used if no other systems are using the same bit
 * across the first struct page union. In this regard, it is similar to
 * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
 * alone, bit 1 is also left alone so far: other union elements (ignoring tail
 * pages) put pointers there, and pointer alignment leaves the lower two bits
 * available.
 *
 * So, constraints include:
 *
 *     -- Only use PageDmaPinned on non-tail pages.
 *     -- Remove the page from any LRU list first.
 */
#define PAGE_DMA_PINNED		0x2

/*
 * Because these flags are read outside of a lock, ensure visibility between
 * different threads, by using READ|WRITE_ONCE.
 */
static __always_inline int PageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
}

[...]
>> +static __always_inline void SetPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
> 
> VM_BUG_ON(!list_empty(&page->lru))

There is only one place where we set this flag, and that is when (in patch 6/6)
transitioning from a page that might (or might not) have been
on an LRU. In that case, the calling code has already corrupted page->lru, by
writing to page->dma_pinned_count, which is unions with page->lru:

		atomic_set(&page->dma_pinned_count, 1);
		SetPageDmaPinned(page);

...so it would be inappropriate to call a list function, such as 
list_empty(), on that field.  Let's just leave it as-is.

> 
>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>> +}
>> +
>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>> +{
>> +	VM_BUG_ON(page != compound_head(page));
>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>> +
>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>> +	 * this provides visibility to other threads.
>> +	 */
>> +	INIT_LIST_HEAD(&page->lru);
> 
> This assumes certain things about list_head, why not use the correct
> initialization bits.
> 

Yes, OK, changed to:

static __always_inline void ClearPageDmaPinned(struct page *page)
{
	VM_BUG_ON(page != compound_head(page));
	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);

	/* Provide visibility to other threads: */
	WRITE_ONCE(page->dma_pinned_flags, 0);

	/*
	 * Safety precaution: restore the list head, before possibly returning
	 * the page to other subsystems.
	 */
	INIT_LIST_HEAD(&page->lru);
}

Dave Chinner Oct. 13, 2018, 3:55 a.m. UTC | #3

On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm_types.h   | 22 +++++++++++++-----
>  include/linux/page-flags.h | 47 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>  	 */
>  	union {
>  		struct {	/* Page cache and anonymous pages */
> -			/**
> -			 * @lru: Pageout list, eg. active_list protected by
> -			 * zone_lru_lock.  Sometimes used as a generic list
> -			 * by the page owner.
> -			 */
> -			struct list_head lru;
> +			union {
> +				/**
> +				 * @lru: Pageout list, eg. active_list protected
> +				 * by zone_lru_lock.  Sometimes used as a
> +				 * generic list by the page owner.
> +				 */
> +				struct list_head lru;
> +				/* Used by get_user_pages*(). Pages may not be
> +				 * on an LRU while these dma_pinned_* fields
> +				 * are in use.
> +				 */
> +				struct {
> +					unsigned long dma_pinned_flags;
> +					atomic_t      dma_pinned_count;
> +				};
> +			};

Isn't this broken for mapped file-backed pages? i.e. they may be
passed as the user buffer to read/write direct IO and so the pages
passed to gup will be on the active/inactive LRUs. hence I can't see
how you can have dual use of the LRU list head like this....

What am I missing here?

Cheers,

Dave.

John Hubbard Oct. 13, 2018, 7:34 a.m. UTC | #4

On 10/12/18 8:55 PM, Dave Chinner wrote:
> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
[...]
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 5ed8f6292a53..017ab82e36ca 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -78,12 +78,22 @@ struct page {
>>  	 */
>>  	union {
>>  		struct {	/* Page cache and anonymous pages */
>> -			/**
>> -			 * @lru: Pageout list, eg. active_list protected by
>> -			 * zone_lru_lock.  Sometimes used as a generic list
>> -			 * by the page owner.
>> -			 */
>> -			struct list_head lru;
>> +			union {
>> +				/**
>> +				 * @lru: Pageout list, eg. active_list protected
>> +				 * by zone_lru_lock.  Sometimes used as a
>> +				 * generic list by the page owner.
>> +				 */
>> +				struct list_head lru;
>> +				/* Used by get_user_pages*(). Pages may not be
>> +				 * on an LRU while these dma_pinned_* fields
>> +				 * are in use.
>> +				 */
>> +				struct {
>> +					unsigned long dma_pinned_flags;
>> +					atomic_t      dma_pinned_count;
>> +				};
>> +			};
> 
> Isn't this broken for mapped file-backed pages? i.e. they may be
> passed as the user buffer to read/write direct IO and so the pages
> passed to gup will be on the active/inactive LRUs. hence I can't see
> how you can have dual use of the LRU list head like this....
> 
> What am I missing here?

Hi Dave,

In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
unceremoniously rips the pages out of the LRU, as a prerequisite to using
either of the page->dma_pinned_* fields. 

The idea is that LRU is not especially useful for this situation anyway,
so we'll just make it one or the other: either a page is dma-pinned, and
just hanging out doing RDMA most likely (and LRU is less meaningful during that
time), or it's possibly on an LRU list.

Christoph Hellwig Oct. 13, 2018, 4:47 p.m. UTC | #5

On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 
> 
> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> time), or it's possibly on an LRU list.

Have you done any benchmarking what this does to direct I/O performance,
especially for small I/O directly to a (fast) block device?

John Hubbard Oct. 13, 2018, 9:19 p.m. UTC | #6

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?

Not yet. I can go do that now. If you have any particular test suites, benchmarks,
or just programs to recommend, please let me know. So far, I see

   tools/testing/selftests/vm/gup_benchmark.c

Dave Chinner Oct. 13, 2018, 11:01 p.m. UTC | #7

On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> On 10/12/18 8:55 PM, Dave Chinner wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> >> From: John Hubbard <jhubbard@nvidia.com>
> [...]
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 5ed8f6292a53..017ab82e36ca 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -78,12 +78,22 @@ struct page {
> >>  	 */
> >>  	union {
> >>  		struct {	/* Page cache and anonymous pages */
> >> -			/**
> >> -			 * @lru: Pageout list, eg. active_list protected by
> >> -			 * zone_lru_lock.  Sometimes used as a generic list
> >> -			 * by the page owner.
> >> -			 */
> >> -			struct list_head lru;
> >> +			union {
> >> +				/**
> >> +				 * @lru: Pageout list, eg. active_list protected
> >> +				 * by zone_lru_lock.  Sometimes used as a
> >> +				 * generic list by the page owner.
> >> +				 */
> >> +				struct list_head lru;
> >> +				/* Used by get_user_pages*(). Pages may not be
> >> +				 * on an LRU while these dma_pinned_* fields
> >> +				 * are in use.
> >> +				 */
> >> +				struct {
> >> +					unsigned long dma_pinned_flags;
> >> +					atomic_t      dma_pinned_count;
> >> +				};
> >> +			};
> > 
> > Isn't this broken for mapped file-backed pages? i.e. they may be
> > passed as the user buffer to read/write direct IO and so the pages
> > passed to gup will be on the active/inactive LRUs. hence I can't see
> > how you can have dual use of the LRU list head like this....
> > 
> > What am I missing here?
> 
> Hi Dave,
> 
> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 

How is that safe? If you've ripped the page out of the LRU, it's no
longer being tracked by the page cache aging and reclaim algorithms.
Patch 6 doesn't appear to put these pages back in the LRU, either,
so it looks to me like this just dumps them on the ground after the
gup reference is dropped.  How do we reclaim these page cache pages
when there is memory pressure if they aren't in the LRU?

> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> time), or it's possibly on an LRU list.

gup isn't just used for RDMA. It's used by direct IO in far, far
more situations and machines than RDMA is. Please explain why
ripping pages out of the LRU and not putting them back is safe, has
no side effects, doesn't adversely impact page cache reclaim, etc.
Indeed, I'd love to see a description of all the page references and
where they come and go so we know the changes aren't just leaking
these pages until the filesystem invalidates them at unmount.

Maybe I'm not seeing why this is safe yet, but seeing as you haven't
explained why it is safe then, at minimum, the patch descriptions
are incomplete.

Cheers,

Dave.

Jan Kara Oct. 16, 2018, 8:51 a.m. UTC | #8

On Sun 14-10-18 10:01:24, Dave Chinner wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> > On 10/12/18 8:55 PM, Dave Chinner wrote:
> > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> > >> From: John Hubbard <jhubbard@nvidia.com>
> > [...]
> > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > >> index 5ed8f6292a53..017ab82e36ca 100644
> > >> --- a/include/linux/mm_types.h
> > >> +++ b/include/linux/mm_types.h
> > >> @@ -78,12 +78,22 @@ struct page {
> > >>  	 */
> > >>  	union {
> > >>  		struct {	/* Page cache and anonymous pages */
> > >> -			/**
> > >> -			 * @lru: Pageout list, eg. active_list protected by
> > >> -			 * zone_lru_lock.  Sometimes used as a generic list
> > >> -			 * by the page owner.
> > >> -			 */
> > >> -			struct list_head lru;
> > >> +			union {
> > >> +				/**
> > >> +				 * @lru: Pageout list, eg. active_list protected
> > >> +				 * by zone_lru_lock.  Sometimes used as a
> > >> +				 * generic list by the page owner.
> > >> +				 */
> > >> +				struct list_head lru;
> > >> +				/* Used by get_user_pages*(). Pages may not be
> > >> +				 * on an LRU while these dma_pinned_* fields
> > >> +				 * are in use.
> > >> +				 */
> > >> +				struct {
> > >> +					unsigned long dma_pinned_flags;
> > >> +					atomic_t      dma_pinned_count;
> > >> +				};
> > >> +			};
> > > 
> > > Isn't this broken for mapped file-backed pages? i.e. they may be
> > > passed as the user buffer to read/write direct IO and so the pages
> > > passed to gup will be on the active/inactive LRUs. hence I can't see
> > > how you can have dual use of the LRU list head like this....
> > > 
> > > What am I missing here?
> > 
> > Hi Dave,
> > 
> > In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> > unceremoniously rips the pages out of the LRU, as a prerequisite to using
> > either of the page->dma_pinned_* fields. 
> 
> How is that safe? If you've ripped the page out of the LRU, it's no
> longer being tracked by the page cache aging and reclaim algorithms.
> Patch 6 doesn't appear to put these pages back in the LRU, either,
> so it looks to me like this just dumps them on the ground after the
> gup reference is dropped.  How do we reclaim these page cache pages
> when there is memory pressure if they aren't in the LRU?

Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should
return the page to the LRU from put_user_page().

								Honza

John Hubbard Oct. 17, 2018, 1:48 a.m. UTC | #9

On 10/16/18 1:51 AM, Jan Kara wrote:
> On Sun 14-10-18 10:01:24, Dave Chinner wrote:
>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>> On 10/12/18 8:55 PM, Dave Chinner wrote:
>>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>>> From: John Hubbard <jhubbard@nvidia.com>
>>> [...]
>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>> index 5ed8f6292a53..017ab82e36ca 100644
>>>>> --- a/include/linux/mm_types.h
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -78,12 +78,22 @@ struct page {
>>>>>  	 */
>>>>>  	union {
>>>>>  		struct {	/* Page cache and anonymous pages */
>>>>> -			/**
>>>>> -			 * @lru: Pageout list, eg. active_list protected by
>>>>> -			 * zone_lru_lock.  Sometimes used as a generic list
>>>>> -			 * by the page owner.
>>>>> -			 */
>>>>> -			struct list_head lru;
>>>>> +			union {
>>>>> +				/**
>>>>> +				 * @lru: Pageout list, eg. active_list protected
>>>>> +				 * by zone_lru_lock.  Sometimes used as a
>>>>> +				 * generic list by the page owner.
>>>>> +				 */
>>>>> +				struct list_head lru;
>>>>> +				/* Used by get_user_pages*(). Pages may not be
>>>>> +				 * on an LRU while these dma_pinned_* fields
>>>>> +				 * are in use.
>>>>> +				 */
>>>>> +				struct {
>>>>> +					unsigned long dma_pinned_flags;
>>>>> +					atomic_t      dma_pinned_count;
>>>>> +				};
>>>>> +			};
>>>>
>>>> Isn't this broken for mapped file-backed pages? i.e. they may be
>>>> passed as the user buffer to read/write direct IO and so the pages
>>>> passed to gup will be on the active/inactive LRUs. hence I can't see
>>>> how you can have dual use of the LRU list head like this....
>>>>
>>>> What am I missing here?
>>>
>>> Hi Dave,
>>>
>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>> either of the page->dma_pinned_* fields. 
>>
>> How is that safe? If you've ripped the page out of the LRU, it's no
>> longer being tracked by the page cache aging and reclaim algorithms.
>> Patch 6 doesn't appear to put these pages back in the LRU, either,
>> so it looks to me like this just dumps them on the ground after the
>> gup reference is dropped.  How do we reclaim these page cache pages
>> when there is memory pressure if they aren't in the LRU?
> 
> Yeah, that's a bug in patch 6/6 (possibly in ClearPageDmaPinned). It should
> return the page to the LRU from put_user_page().
> 

Yes. Ugh, the LRU handling in this series is definitely not all there yet:
probably need to track (in the page->dma_pinned_flags) which LRU (if any) each 
page was taken from. 

It's hard to say exactly what the active/inactive/unevictable list should
be when DMA is done and put_user_page*() is called, because we don't know
if some device read, wrote, or ignored any of those pages. Although if 
put_user_pages_dirty() is called, that's an argument for "active", at least.

And maybe this will all be pointless if the DIRECT_IO performance test, that
Christoph requested, shows that LRU operations are too expensive here, anyway.
I wonder if we should just limit this to 64-bit arches and find a real
page flag...well, let's see what the testing shows first I suppose.

Michal Hocko Oct. 17, 2018, 11:09 a.m. UTC | #10

On Tue 16-10-18 18:48:23, John Hubbard wrote:
[...]
> It's hard to say exactly what the active/inactive/unevictable list should
> be when DMA is done and put_user_page*() is called, because we don't know
> if some device read, wrote, or ignored any of those pages. Although if 
> put_user_pages_dirty() is called, that's an argument for "active", at least.

Any reason to not use putback_lru_page?

Please note I haven't really got through your patches to have a wider
picture of the change so this is just hint for the LRU part of the
issue.

John Hubbard Oct. 18, 2018, 12:03 a.m. UTC | #11

On 10/17/18 4:09 AM, Michal Hocko wrote:
> On Tue 16-10-18 18:48:23, John Hubbard wrote:
> [...]
>> It's hard to say exactly what the active/inactive/unevictable list should
>> be when DMA is done and put_user_page*() is called, because we don't know
>> if some device read, wrote, or ignored any of those pages. Although if 
>> put_user_pages_dirty() is called, that's an argument for "active", at least.
> 
> Any reason to not use putback_lru_page?

That does help with which LRU to use. I guess I'd still need to track whether
a page was on an LRU when get_user_pages() was called, because it seems
that that is not necessarily always the case. And putback_lru_page() definitely
wants to deal with a page that *was* previously on an LRU.

> 
> Please note I haven't really got through your patches to have a wider
> picture of the change so this is just hint for the LRU part of the
> issue.
> 

Understood, and the hints are much appreciated.

Michal Hocko Oct. 19, 2018, 8:11 a.m. UTC | #12

On Wed 17-10-18 17:03:03, John Hubbard wrote:
> On 10/17/18 4:09 AM, Michal Hocko wrote:
> > On Tue 16-10-18 18:48:23, John Hubbard wrote:
> > [...]
> >> It's hard to say exactly what the active/inactive/unevictable list should
> >> be when DMA is done and put_user_page*() is called, because we don't know
> >> if some device read, wrote, or ignored any of those pages. Although if 
> >> put_user_pages_dirty() is called, that's an argument for "active", at least.
> > 
> > Any reason to not use putback_lru_page?
> 
> That does help with which LRU to use. I guess I'd still need to track whether
> a page was on an LRU when get_user_pages() was called, because it seems
> that that is not necessarily always the case. And putback_lru_page() definitely
> wants to deal with a page that *was* previously on an LRU.

Well, if you ever g-u-p pages which are never going to go to LRU then
sure (e.g. hugetlb pages).

Education Directorate Oct. 24, 2018, 11 a.m. UTC | #13

On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote:
> On 10/12/18 3:56 AM, Balbir Singh wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
> >> From: John Hubbard <jhubbard@nvidia.com>
> [...]
> >> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
> >> + * uses these flags must NOT be on an LRU. That's partly enforced by
> >> + * ClearPageDmaPinned, which gives the page back to LRU.
> >> + *
> >> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
> >> + * of struct page), and this flag is checked without knowing whether it is a
> >> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
> >> + * rather than bit 0.
> >> + */
> >> +#define PAGE_DMA_PINNED		0x2
> >> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
> >> +
> > 
> > This is really subtle, additional changes to compound_head will need to coordinate
> > with these flags? Also doesn't this bit need to be unique across all structs in
> > the union? I guess that is guaranteed by the fact that page == compound_head(page)
> > as per your assertion, but I've forgotten why that is true. Could you please
> > add some commentary on that
> > 
> 
> Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
> to even have it). So now it looks like this:
> 
> /*
>  * Because page->dma_pinned_flags is unioned with page->lru, any page that
>  * uses these flags must NOT be on an LRU. That's partly enforced by
>  * ClearPageDmaPinned, which gives the page back to LRU.
>  *
>  * PageDmaPinned is checked without knowing whether it is a tail page or a
>  * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
>  * bit in the first union of struct page), and instead uses bit 1 (0x2),
>  * rather than bit 0.
>  *
>  * PageDmaPinned can only be used if no other systems are using the same bit
>  * across the first struct page union. In this regard, it is similar to
>  * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
>  * alone, bit 1 is also left alone so far: other union elements (ignoring tail
>  * pages) put pointers there, and pointer alignment leaves the lower two bits
>  * available.
>  *
>  * So, constraints include:
>  *
>  *     -- Only use PageDmaPinned on non-tail pages.
>  *     -- Remove the page from any LRU list first.
>  */
> #define PAGE_DMA_PINNED		0x2
> 
> /*
>  * Because these flags are read outside of a lock, ensure visibility between
>  * different threads, by using READ|WRITE_ONCE.
>  */
> static __always_inline int PageDmaPinned(struct page *page)
> {
> 	VM_BUG_ON(page != compound_head(page));
> 	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
> }
> 
> [...]
> >> +static __always_inline void SetPageDmaPinned(struct page *page)
> >> +{
> >> +	VM_BUG_ON(page != compound_head(page));
> > 
> > VM_BUG_ON(!list_empty(&page->lru))
> 
> 
> There is only one place where we set this flag, and that is when (in patch 6/6)
> transitioning from a page that might (or might not) have been
> on an LRU. In that case, the calling code has already corrupted page->lru, by
> writing to page->dma_pinned_count, which is unions with page->lru:
> 
> 		atomic_set(&page->dma_pinned_count, 1);
> 		SetPageDmaPinned(page);
> 
> ...so it would be inappropriate to call a list function, such as 
> list_empty(), on that field.  Let's just leave it as-is.
> 
> 
> > 
> >> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
> >> +}
> >> +
> >> +static __always_inline void ClearPageDmaPinned(struct page *page)
> >> +{
> >> +	VM_BUG_ON(page != compound_head(page));
> >> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
> >> +
> >> +	/* This does a WRITE_ONCE to the lru.next, which is also the
> >> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
> >> +	 * this provides visibility to other threads.
> >> +	 */
> >> +	INIT_LIST_HEAD(&page->lru);
> > 
> > This assumes certain things about list_head, why not use the correct
> > initialization bits.
> > 
> 
> Yes, OK, changed to:
> 
> static __always_inline void ClearPageDmaPinned(struct page *page)
> {
> 	VM_BUG_ON(page != compound_head(page));
> 	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
> 
> 	/* Provide visibility to other threads: */
> 	WRITE_ONCE(page->dma_pinned_flags, 0);
> 
> 	/*
> 	 * Safety precaution: restore the list head, before possibly returning
> 	 * the page to other subsystems.
> 	 */
> 	INIT_LIST_HEAD(&page->lru);
> }
> 
>

Sorry, I've been distracted with other things

This looks better, do we still need the INIT_LIST_HEAD?

Balbir Singh.
 
> 
> -- 
> thanks,
> John Hubbard
> NVIDIA

John Hubbard Nov. 2, 2018, 11:27 p.m. UTC | #14

On 10/24/18 4:00 AM, Balbir Singh wrote:
> On Fri, Oct 12, 2018 at 05:15:51PM -0700, John Hubbard wrote:
>> On 10/12/18 3:56 AM, Balbir Singh wrote:
>>> On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubbard@gmail.com wrote:
>>>> From: John Hubbard <jhubbard@nvidia.com>
>> [...]
>>>> + * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>>> + * uses these flags must NOT be on an LRU. That's partly enforced by
>>>> + * ClearPageDmaPinned, which gives the page back to LRU.
>>>> + *
>>>> + * PageDmaPinned also corresponds to PageTail (the 0th bit in the first union
>>>> + * of struct page), and this flag is checked without knowing whether it is a
>>>> + * tail page or a PageDmaPinned page. Therefore, start the flags at bit 1 (0x2),
>>>> + * rather than bit 0.
>>>> + */
>>>> +#define PAGE_DMA_PINNED		0x2
>>>> +#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
>>>> +
>>>
>>> This is really subtle, additional changes to compound_head will need to coordinate
>>> with these flags? Also doesn't this bit need to be unique across all structs in
>>> the union? I guess that is guaranteed by the fact that page == compound_head(page)
>>> as per your assertion, but I've forgotten why that is true. Could you please
>>> add some commentary on that
>>>
>>
>> Yes, agreed. I've rewritten and augmented that comment block, plus removed the 
>> PAGE_DMA_PINNED_FLAGS (there are no more bits available, so it's just misleading 
>> to even have it). So now it looks like this:
>>
>> /*
>>  * Because page->dma_pinned_flags is unioned with page->lru, any page that
>>  * uses these flags must NOT be on an LRU. That's partly enforced by
>>  * ClearPageDmaPinned, which gives the page back to LRU.
>>  *
>>  * PageDmaPinned is checked without knowing whether it is a tail page or a
>>  * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
>>  * bit in the first union of struct page), and instead uses bit 1 (0x2),
>>  * rather than bit 0.
>>  *
>>  * PageDmaPinned can only be used if no other systems are using the same bit
>>  * across the first struct page union. In this regard, it is similar to
>>  * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
>>  * alone, bit 1 is also left alone so far: other union elements (ignoring tail
>>  * pages) put pointers there, and pointer alignment leaves the lower two bits
>>  * available.
>>  *
>>  * So, constraints include:
>>  *
>>  *     -- Only use PageDmaPinned on non-tail pages.
>>  *     -- Remove the page from any LRU list first.
>>  */
>> #define PAGE_DMA_PINNED		0x2
>>
>> /*
>>  * Because these flags are read outside of a lock, ensure visibility between
>>  * different threads, by using READ|WRITE_ONCE.
>>  */
>> static __always_inline int PageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
>> }
>>
>> [...]
>>>> +static __always_inline void SetPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>
>>> VM_BUG_ON(!list_empty(&page->lru))
>>
>>
>> There is only one place where we set this flag, and that is when (in patch 6/6)
>> transitioning from a page that might (or might not) have been
>> on an LRU. In that case, the calling code has already corrupted page->lru, by
>> writing to page->dma_pinned_count, which is unions with page->lru:
>>
>> 		atomic_set(&page->dma_pinned_count, 1);
>> 		SetPageDmaPinned(page);
>>
>> ...so it would be inappropriate to call a list function, such as 
>> list_empty(), on that field.  Let's just leave it as-is.
>>
>>
>>>
>>>> +	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
>>>> +}
>>>> +
>>>> +static __always_inline void ClearPageDmaPinned(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON(page != compound_head(page));
>>>> +	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
>>>> +
>>>> +	/* This does a WRITE_ONCE to the lru.next, which is also the
>>>> +	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
>>>> +	 * this provides visibility to other threads.
>>>> +	 */
>>>> +	INIT_LIST_HEAD(&page->lru);
>>>
>>> This assumes certain things about list_head, why not use the correct
>>> initialization bits.
>>>
>>
>> Yes, OK, changed to:
>>
>> static __always_inline void ClearPageDmaPinned(struct page *page)
>> {
>> 	VM_BUG_ON(page != compound_head(page));
>> 	VM_BUG_ON_PAGE(!PageDmaPinned(page), page);
>>
>> 	/* Provide visibility to other threads: */
>> 	WRITE_ONCE(page->dma_pinned_flags, 0);
>>
>> 	/*
>> 	 * Safety precaution: restore the list head, before possibly returning
>> 	 * the page to other subsystems.
>> 	 */
>> 	INIT_LIST_HEAD(&page->lru);
>> }
>>
>>
> 
> Sorry, I've been distracted with other things
> 
> This looks better, do we still need the INIT_LIST_HEAD?
> 

Good point. I guess not. I was getting a little too fancy, and it's better 
for ClearPageDmaPinned to be true to its name, and just only do that.

(Sorry for the delayed response.)

thanks,

John Hubbard Nov. 5, 2018, 7:10 a.m. UTC | #15

On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>> either of the page->dma_pinned_* fields. 
>>
>> The idea is that LRU is not especially useful for this situation anyway,
>> so we'll just make it one or the other: either a page is dma-pinned, and
>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>> time), or it's possibly on an LRU list.
> 
> Have you done any benchmarking what this does to direct I/O performance,
> especially for small I/O directly to a (fast) block device?
> 

Hi Christoph,

I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
and buggy yet, but I've got enough of them done to briefly run the test.

One thing that occurs to me is that jumping on and off the LRU takes time, and
if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
know that leaves 32-bit out in the cold, but...maybe use this slower approach
for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
by 20%. 

Test program is below. I hope I didn't overlook something obvious, but it's 
definitely possible, given my lack of experience with direct IO. 

I'm preparing to send an updated RFC this week, that contains the feedback to date,
and also many converted call sites as well, so that everyone can see what the whole
(proposed) story would look like in its latest incarnation.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

const static unsigned BUF_SIZE       = 4096;
static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;

void read_from_file(int fd, size_t how_much, char * buf)
{
	size_t bytes_read;

	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		bytes_read = read(fd, buf, BUF_SIZE);
		if (bytes_read != BUF_SIZE) {
			printf("reading file failed: %m\n");
			exit(3);
		}
	}
}

void seek_to_start(int fd, char *caller)
{
	off_t result = lseek(fd, 0, SEEK_SET);
	if (result == -1) {
		printf("%s: lseek failed: %m\n", caller);
		exit(4);
	}
}

void write_to_file(int fd, size_t how_much, char * buf)
{
	int result;
	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
		result = write(fd, buf, BUF_SIZE);
		if (result < 0) {
			printf("writing file failed: %m\n");
			exit(3);
		}
	}
}

void read_and_write(int fd, size_t how_much, char * buf)
{
	seek_to_start(fd, "About to read");
	read_from_file(fd, how_much, buf);

	memset(buf, 'a', BUF_SIZE);

	seek_to_start(fd, "About to write");
	write_to_file(fd, how_much, buf);
}

int main(int argc, char *argv[])
{
	void *buf;
	/*
	 * O_DIRECT requires at least 512 B alighnment, but runs faster
	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
	 */
	unsigned align = 4096;
	posix_memalign(&buf, align, BUF_SIZE );

	if (argc < 3) {
		printf("Usage: %s <filename> <iterations>\n", argv[0]);
		return 1;
	}
	char *filename = argv[1];
	unsigned iterations = strtoul(argv[2], 0, 0);

	/* Not using O_SYNC for now, anyway. */
	int fd = open(filename, O_DIRECT | O_RDWR);
	if (fd < 0) {
		printf("Failed to open %s: %m\n", filename);
		return 2;
	}

	printf("File: %s, data size: %u, interations: %u\n",
		       filename, FULL_DATA_SIZE, iterations);

	for (int count = 0; count < iterations; count++) {
		read_and_write(fd, FULL_DATA_SIZE, buf);
	}

	close(fd);
	return 0;
}


thanks,

Jan Kara Nov. 5, 2018, 9:54 a.m. UTC | #16

On Sun 04-11-18 23:10:12, John Hubbard wrote:
> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
> > On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> >> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
> >> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> >> either of the page->dma_pinned_* fields. 
> >>
> >> The idea is that LRU is not especially useful for this situation anyway,
> >> so we'll just make it one or the other: either a page is dma-pinned, and
> >> just hanging out doing RDMA most likely (and LRU is less meaningful during that
> >> time), or it's possibly on an LRU list.
> > 
> > Have you done any benchmarking what this does to direct I/O performance,
> > especially for small I/O directly to a (fast) block device?
> > 
> 
> Hi Christoph,
> 
> I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
> on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
> and buggy yet, but I've got enough of them done to briefly run the test.
> 
> One thing that occurs to me is that jumping on and off the LRU takes time, and
> if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
> know that leaves 32-bit out in the cold, but...maybe use this slower approach
> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
> by 20%. 
> 
> Test program is below. I hope I didn't overlook something obvious, but it's 
> definitely possible, given my lack of experience with direct IO. 
> 
> I'm preparing to send an updated RFC this week, that contains the feedback to date,
> and also many converted call sites as well, so that everyone can see what the whole
> (proposed) story would look like in its latest incarnation.

Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
going to max-out NVME iops by far. Can I suggest you install fio [1] (it
has the advantage that it is pretty much standard for a test like this so
everyone knows what the test does from a glimpse) and run with it something
like the following workfile:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

And see how the numbers with and without your patches compare?

								Honza

[1] https://github.com/axboe/fio


> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdbool.h>
> #include <string.h>
> 
> const static unsigned BUF_SIZE       = 4096;
> static const unsigned FULL_DATA_SIZE = 2 * BUF_SIZE;
> 
> void read_from_file(int fd, size_t how_much, char * buf)
> {
> 	size_t bytes_read;
> 
> 	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
> 		bytes_read = read(fd, buf, BUF_SIZE);
> 		if (bytes_read != BUF_SIZE) {
> 			printf("reading file failed: %m\n");
> 			exit(3);
> 		}
> 	}
> }
> 
> void seek_to_start(int fd, char *caller)
> {
> 	off_t result = lseek(fd, 0, SEEK_SET);
> 	if (result == -1) {
> 		printf("%s: lseek failed: %m\n", caller);
> 		exit(4);
> 	}
> }
> 
> void write_to_file(int fd, size_t how_much, char * buf)
> {
> 	int result;
> 	for (size_t index = 0; index < how_much; index += BUF_SIZE) {
> 		result = write(fd, buf, BUF_SIZE);
> 		if (result < 0) {
> 			printf("writing file failed: %m\n");
> 			exit(3);
> 		}
> 	}
> }
> 
> void read_and_write(int fd, size_t how_much, char * buf)
> {
> 	seek_to_start(fd, "About to read");
> 	read_from_file(fd, how_much, buf);
> 
> 	memset(buf, 'a', BUF_SIZE);
> 
> 	seek_to_start(fd, "About to write");
> 	write_to_file(fd, how_much, buf);
> }
> 
> int main(int argc, char *argv[])
> {
> 	void *buf;
> 	/*
> 	 * O_DIRECT requires at least 512 B alighnment, but runs faster
> 	 * (2.8 sec, vs. 3.5 sec) with 4096 B alignment.
> 	 */
> 	unsigned align = 4096;
> 	posix_memalign(&buf, align, BUF_SIZE );
> 
> 	if (argc < 3) {
> 		printf("Usage: %s <filename> <iterations>\n", argv[0]);
> 		return 1;
> 	}
> 	char *filename = argv[1];
> 	unsigned iterations = strtoul(argv[2], 0, 0);
> 
> 	/* Not using O_SYNC for now, anyway. */
> 	int fd = open(filename, O_DIRECT | O_RDWR);
> 	if (fd < 0) {
> 		printf("Failed to open %s: %m\n", filename);
> 		return 2;
> 	}
> 
> 	printf("File: %s, data size: %u, interations: %u\n",
> 		       filename, FULL_DATA_SIZE, iterations);
> 
> 	for (int count = 0; count < iterations; count++) {
> 		read_and_write(fd, FULL_DATA_SIZE, buf);
> 	}
> 
> 	close(fd);
> 	return 0;
> }
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

John Hubbard Nov. 6, 2018, 12:26 a.m. UTC | #17

On 11/5/18 1:54 AM, Jan Kara wrote:
> On Sun 04-11-18 23:10:12, John Hubbard wrote:
>> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
>>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>>> In patch 6/6, pin_page_for_dma(), which is called at the end of get_user_pages(),
>>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>>> either of the page->dma_pinned_* fields. 
>>>>
>>>> The idea is that LRU is not especially useful for this situation anyway,
>>>> so we'll just make it one or the other: either a page is dma-pinned, and
>>>> just hanging out doing RDMA most likely (and LRU is less meaningful during that
>>>> time), or it's possibly on an LRU list.
>>>
>>> Have you done any benchmarking what this does to direct I/O performance,
>>> especially for small I/O directly to a (fast) block device?
>>>
>>
>> Hi Christoph,
>>
>> I'm seeing about 20% slower in one case: lots of reads and writes of size 8192 B,
>> on a fast NVMe device. My put_page() --> put_user_page() conversions are incomplete 
>> and buggy yet, but I've got enough of them done to briefly run the test.
>>
>> One thing that occurs to me is that jumping on and off the LRU takes time, and
>> if we limited this to 64-bit platforms, maybe we could use a real page flag? I 
>> know that leaves 32-bit out in the cold, but...maybe use this slower approach
>> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down anything
>> by 20%. 
>>
>> Test program is below. I hope I didn't overlook something obvious, but it's 
>> definitely possible, given my lack of experience with direct IO. 
>>
>> I'm preparing to send an updated RFC this week, that contains the feedback to date,
>> and also many converted call sites as well, so that everyone can see what the whole
>> (proposed) story would look like in its latest incarnation.
> 
> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> has the advantage that it is pretty much standard for a test like this so
> everyone knows what the test does from a glimpse) and run with it something
> like the following workfile:
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
> 
> And see how the numbers with and without your patches compare?
> 
> 								Honza
> 
> [1] https://github.com/axboe/fio

That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
is approximately 74 MiB/s with my patch (it varies a bit, run to run),
as compared to around 85 without the patch, so still showing about a 20%
performance degradation, assuming I'm reading this correctly.

Raw data follows, using the fio options you listed above:

Baseline (without my patch):
---------------------------- 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov  5 12:08:45 2018
   read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec)
    slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
    clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23
     lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
    clat percentiles (usec):
     |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
     | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
     | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
     | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
     | 99.99th=[12256]
   bw (  KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, samples=23
   iops        : min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25%
  cpu          : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=85.7MiB/s (89.9MB/s), 85.7MiB/s-85.7MiB/s (89.9MB/s-89.9MB/s), io=1024MiB (1074MB), run=11945-11945msec

Disk stats (read/write):
  nvme0n1: ios=260906/3, merge=0/1, ticks=14618/4, in_queue=17670, util=100.00%

Modified (with my patch):
---------------------------- 

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=74.1MiB/s,w=0KiB/s][r=18.0k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1808: Mon Nov  5 16:11:09 2018
   read: IOPS=18.3k, BW=71.4MiB/s (74.9MB/s)(1024MiB/14334msec)
    slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
    clat (usec): min=31, max=15622, avg=3443.86, stdev=1431.27
     lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
    clat percentiles (usec):
     |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
     | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
     | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
     | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
     | 99.99th=[14484]
   bw (  KiB/s): min=63142, max=77464, per=99.97%, avg=73128.46, stdev=3383.81, samples=28
   iops        : min=15785, max=19366, avg=18281.96, stdev=845.95, samples=28
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=84.77%, 10=14.12%, 20=1.09%
  cpu          : usr=2.20%, sys=95.72%, ctx=360, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.4MiB/s (74.9MB/s), 71.4MiB/s-71.4MiB/s (74.9MB/s-74.9MB/s), io=1024MiB (1074MB), run=14334-14334msec

Disk stats (read/write):
  nvme0n1: ios=258235/3, merge=0/1, ticks=12583/10, in_queue=14779, util=100.00%



thanks,

Dave Chinner Nov. 6, 2018, 2:47 a.m. UTC | #18

On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> On 11/5/18 1:54 AM, Jan Kara wrote:
> > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > has the advantage that it is pretty much standard for a test like this so
> > everyone knows what the test does from a glimpse) and run with it something
> > like the following workfile:
> > 
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> > 
> > And see how the numbers with and without your patches compare?
> > 
> > 								Honza
> > 
> > [1] https://github.com/axboe/fio
> 
> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> as compared to around 85 without the patch, so still showing about a 20%
> performance degradation, assuming I'm reading this correctly.
> 
> Raw data follows, using the fio options you listed above:
> 
> Baseline (without my patch):
> ---------------------------- 
....
>      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>     clat percentiles (usec):
>      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>      | 99.99th=[12256]
.....
> Modified (with my patch):
> ---------------------------- 
.....
>      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>     clat percentiles (usec):
>      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>      | 99.99th=[14484]

So it's adding at least 500us of completion latency to every IO?
I'd argue that the IO latency impact is far worse than the a 20%
throughput drop.

i.e. You can make up for throughput drops by running a deeper
queue/more dispatch threads, but you can't reduce IO latency at
all...

Cheers,

Dave.

Jan Kara Nov. 6, 2018, 11 a.m. UTC | #19

On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > has the advantage that it is pretty much standard for a test like this so
> > > everyone knows what the test does from a glimpse) and run with it something
> > > like the following workfile:
> > > 
> > > [reader]
> > > direct=1
> > > ioengine=libaio
> > > blocksize=4096
> > > size=1g
> > > numjobs=1
> > > rw=read
> > > iodepth=64
> > > 
> > > And see how the numbers with and without your patches compare?
> > > 
> > > 								Honza
> > > 
> > > [1] https://github.com/axboe/fio
> > 
> > That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > as compared to around 85 without the patch, so still showing about a 20%
> > performance degradation, assuming I'm reading this correctly.
> > 
> > Raw data follows, using the fio options you listed above:
> > 
> > Baseline (without my patch):
> > ---------------------------- 
> ....
> >      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> >     clat percentiles (usec):
> >      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> >      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> >      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> >      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> >      | 99.99th=[12256]
> .....
> > Modified (with my patch):
> > ---------------------------- 
> .....
> >      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> >     clat percentiles (usec):
> >      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> >      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> >      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> >      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> >      | 99.99th=[14484]
> 
> So it's adding at least 500us of completion latency to every IO?
> I'd argue that the IO latency impact is far worse than the a 20%
> throughput drop.

Hum, right. So for each IO we have to remove the page from LRU on submit
and then put it back on IO completion (which is going to race with new
submits so LRU lock contention might be an issue). Spending 500 us on that
is not unthinkable when the lock is contended but it is more expensive than
I'd have thought. John, could you perhaps profile where the time is spent?

								Honza

Dave Chinner Nov. 6, 2018, 8:41 p.m. UTC | #20

On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > > has the advantage that it is pretty much standard for a test like this so
> > > > everyone knows what the test does from a glimpse) and run with it something
> > > > like the following workfile:
> > > > 
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > > 
> > > > And see how the numbers with and without your patches compare?
> > > > 
> > > > 								Honza
> > > > 
> > > > [1] https://github.com/axboe/fio
> > > 
> > > That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
> > > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > > as compared to around 85 without the patch, so still showing about a 20%
> > > performance degradation, assuming I'm reading this correctly.
> > > 
> > > Raw data follows, using the fio options you listed above:
> > > 
> > > Baseline (without my patch):
> > > ---------------------------- 
> > ....
> > >      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> > >     clat percentiles (usec):
> > >      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> > >      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> > >      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> > >      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> > >      | 99.99th=[12256]
> > .....
> > > Modified (with my patch):
> > > ---------------------------- 
> > .....
> > >      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> > >     clat percentiles (usec):
> > >      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> > >      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> > >      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> > >      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> > >      | 99.99th=[14484]
> > 
> > So it's adding at least 500us of completion latency to every IO?
> > I'd argue that the IO latency impact is far worse than the a 20%
> > throughput drop.
> 
> Hum, right. So for each IO we have to remove the page from LRU on submit

Which cost us less then 10us on average:

	slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
vs
	slat (usec): min=18, max=4378, avg=52.59, stdev=63.66

> and then put it back on IO completion (which is going to race with new
> submits so LRU lock contention might be an issue).

Removal has to take the same LRU lock, so I don't think contention
is the problem here. More likely the overhead is in selecting the
LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
a memcg lookup.

> Spending 500 us on that
> is not unthinkable when the lock is contended but it is more expensive than
> I'd have thought. John, could you perhaps profile where the time is spent?

That'll tell us for sure :)

Cheers,

Dave.

John Hubbard Nov. 7, 2018, 6:36 a.m. UTC | #21

On 11/6/18 12:41 PM, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
>> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
>>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
>>>> On 11/5/18 1:54 AM, Jan Kara wrote:
>>>>> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
>>>>> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
>>>>> has the advantage that it is pretty much standard for a test like this so
>>>>> everyone knows what the test does from a glimpse) and run with it something
>>>>> like the following workfile:
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>>
>>>>> And see how the numbers with and without your patches compare?
>>>>>
>>>>> 								Honza
>>>>>
>>>>> [1] https://github.com/axboe/fio
>>>>
>>>> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
>>>> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
>>>> as compared to around 85 without the patch, so still showing about a 20%
>>>> performance degradation, assuming I'm reading this correctly.
>>>>
>>>> Raw data follows, using the fio options you listed above:
>>>>
>>>> Baseline (without my patch):
>>>> ---------------------------- 
>>> ....
>>>>      lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>>>>      | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>>>>      | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>>>>      | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>>>>      | 99.99th=[12256]
>>> .....
>>>> Modified (with my patch):
>>>> ---------------------------- 
>>> .....
>>>>      lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>>>>     clat percentiles (usec):
>>>>      |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>>>>      | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>>>>      | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>>>>      | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>>>>      | 99.99th=[14484]
>>>
>>> So it's adding at least 500us of completion latency to every IO?
>>> I'd argue that the IO latency impact is far worse than the a 20%
>>> throughput drop.
>>
>> Hum, right. So for each IO we have to remove the page from LRU on submit
> 
> Which cost us less then 10us on average:
> 
> 	slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
> vs
> 	slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
> 
>> and then put it back on IO completion (which is going to race with new
>> submits so LRU lock contention might be an issue).
> 
> Removal has to take the same LRU lock, so I don't think contention
> is the problem here. More likely the overhead is in selecting the
> LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
> a memcg lookup.
> 
>> Spending 500 us on that
>> is not unthinkable when the lock is contended but it is more expensive than
>> I'd have thought. John, could you perhaps profile where the time is spent?
> 

OK, some updates on that:

1. First of all, I fixed a direct-io-related call site (it was still calling put_page
instead of put_user_page), and that not only got rid of a problem, it also changed
performance: it makes the impact of the patch a bit less. (Sorry about that!
I was hoping that I could get away with temporarily ignoring that failure, but no.)
The bandwidth numbers in particular look much closer to each other.

2. Second, note that these fio results are noisy. The std deviation is large enough 
that some of this could be noise. In order to highlight that, I did 5 runs each of
with, and without the patch, and while there is definitely a performance drop on 
average, it's also true that there is overlap in the results. In other words, the
best "with patch" run is about the same as the worst "without patch" run.

3. Finally, initial profiling shows that we're adding about 1% total to the this
particular test run...I think. I may have to narrow this down some more, but I don't 
seem to see any real lock contention. Hints or ideas on measurement methods are
welcome, btw.

    -- 0.59% in put_user_page
    -- 0.2% (or 0.54%, depending on how you read the perf out below) in 
       get_user_pages_fast:


          1.36%--iov_iter_get_pages
                    |
                     --1.27%--get_user_pages_fast
                               |
                                --0.54%--pin_page_for_dma

          0.59%--put_user_page

          1.34%     0.03%  fio   [kernel.vmlinux]     [k] _raw_spin_lock
          0.95%     0.55%  fio   [kernel.vmlinux]     [k] do_raw_spin_lock
          0.17%     0.03%  fio   [kernel.vmlinux]     [k] isolate_lru_page
          0.06%     0.00%  fio   [kernel.vmlinux]     [k] putback_lru_page

4. Here's some raw fio data: one run without the patch, and two with the patch:

------------------------------------------------------
WITHOUT the patch:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
   read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
    slat (usec): min=25, max=4870, avg=68.19, stdev=85.21
    clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03
     lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65
    clat percentiles (usec):
     |  1.00th=[ 3687],  5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752],
     | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785],
     | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225],
     | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171],
     | 99.99th=[18220]
   bw (  KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37
   iops        : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08%
  cpu          : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec

Disk stats (read/write):
  nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00%

------------------------------------------------------
With the patch applied:
------------------------------------------------------
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov  6 20:03:50 2018
   read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec)
    slat (usec): min=33, max=4365, avg=74.05, stdev=85.79
    clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79
     lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23
    clat percentiles (usec):
     |  1.00th=[ 4047],  5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080],
     | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146],
     | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717],
     | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220],
     | 99.99th=[19006]
   bw (  KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40
   iops        : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53%
  cpu          : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec

Disk stats (read/write):
  nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00%

------------------------------------------------------
OR, here's a better run WITH the patch applied, and you can see that this is nearly as good
as the "without" case:
------------------------------------------------------

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
   read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
    slat (usec): min=30, max=12458, avg=69.71, stdev=88.01
    clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29
     lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06
    clat percentiles (usec):
     |  1.00th=[ 3884],  5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916],
     | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949],
     | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848],
     | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695],
     | 99.99th=[24249]
   bw (  KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38
   iops        : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02%
  cpu          : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec

Disk stats (read/write):
  nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00%


thanks,

[4/6] mm: introduce page->dma_pinned_flags, _count

Commit Message

Comments

Patch