[RFC,2/3] NFS41: send real write size in layoutget

Message ID	1344391392-1948-3-git-send-email-bergwolf@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Peng Tao <bergwolf@gmail.com> To: bharrosh@panasas.com Cc: linux-nfs@vger.kernel.org, Peng Tao <tao.peng@emc.com> Subject: [PATCH RFC 2/3] NFS41: send real write size in layoutget Date: Wed, 8 Aug 2012 10:03:11 +0800 Message-Id: <1344391392-1948-3-git-send-email-bergwolf@gmail.com> In-Reply-To: <1344391392-1948-1-git-send-email-bergwolf@gmail.com> References: <1344391392-1948-1-git-send-email-bergwolf@gmail.com> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Peng Tao Aug. 8, 2012, 2:03 a.m. UTC

From: Peng Tao <tao.peng@emc.com>

For bufferred write, scan dirty pages to find out longest continuous
dirty pages. In this case, also allow layout driver to specify a
maximum layoutget size which is useful to avoid busy scanning dirty pages
for block layout client.

For direct write, just use dreq->bytes_left.

Signed-off-by: Peng Tao <tao.peng@emc.com>
---
 fs/nfs/direct.c   |    7 ++++++
 fs/nfs/internal.h |    1 +
 fs/nfs/pnfs.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 64 insertions(+), 2 deletions(-)

Trond Myklebust Aug. 8, 2012, 6:50 p.m. UTC | #1

On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote:
> From: Peng Tao <tao.peng@emc.com>

> 

> For bufferred write, scan dirty pages to find out longest continuous

> dirty pages. In this case, also allow layout driver to specify a

> maximum layoutget size which is useful to avoid busy scanning dirty pages

> for block layout client.

This is _really_ ugly, and is a source of potential deadlocks if
multiple threads are scanning+locking pages at the same time.

Why not simplify this by dropping the complicated tests for whether or
not the pages are dirty? That gets rid of the deadlock-prone lock_page()
calls, and would enable you to just use radix_tree_next_hole().

> For direct write, just use dreq->bytes_left.

This makes sense.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Peng Tao Aug. 9, 2012, 2:24 a.m. UTC | #2

On Thu, Aug 9, 2012 at 2:50 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Wed, 2012-08-08 at 10:03 +0800, Peng Tao wrote:
>> From: Peng Tao <tao.peng@emc.com>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>
> This is _really_ ugly, and is a source of potential deadlocks if
> multiple threads are scanning+locking pages at the same time.
Multiple threads won't be able to scan into the same region as borders
are protected by page writeback bit.

>
> Why not simplify this by dropping the complicated tests for whether or
> not the pages are dirty? That gets rid of the deadlock-prone lock_page()
> calls, and would enable you to just use radix_tree_next_hole().
>
OK. I will cook a patch to drop the complicated page dirty/writeback
checks and just use radix_tree_next_hole.

Boaz Harrosh Aug. 12, 2012, 6:30 p.m. UTC | #3

On 08/08/2012 05:03 AM, Peng Tao wrote:

> From: Peng Tao <tao.peng@emc.com>
> 
> For bufferred write, scan dirty pages to find out longest continuous
> dirty pages. In this case, also allow layout driver to specify a
> maximum layoutget size which is useful to avoid busy scanning dirty pages
> for block layout client.
> 
> For direct write, just use dreq->bytes_left.
> 
> Signed-off-by: Peng Tao <tao.peng@emc.com>
> ---
>  fs/nfs/direct.c   |    7 ++++++
>  fs/nfs/internal.h |    1 +
>  fs/nfs/pnfs.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 64 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index c39f775..c1899dd 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -46,6 +46,7 @@
>  #include <linux/kref.h>
>  #include <linux/slab.h>
>  #include <linux/task_io_accounting_ops.h>
> +#include <linux/module.h>
>  
>  #include <linux/nfs_fs.h>
>  #include <linux/nfs_page.h>
> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>  	kref_put(&dreq->kref, nfs_direct_req_free);
>  }
>  
> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
> +{
> +	return dreq->bytes_left;
> +}
> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
> +
>  /*
>   * Collects and returns the final error value/byte-count.
>   */
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index 31fdb03..e68d329 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>  {
>  	inode_dio_wait(inode);
>  }
> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>  


Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
here ?

>  /* nfs4proc.c */
>  extern void __nfs4_read_done_cb(struct nfs_read_data *);
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 2e00fea..e61a373 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -29,6 +29,7 @@
>  
>  #include <linux/nfs_fs.h>
>  #include <linux/nfs_page.h>
> +#include <linux/pagevec.h>
>  #include <linux/module.h>
>  #include "internal.h"
>  #include "pnfs.h"
> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>  }
>  EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>  
> +/*
> + * Return the number of contiguous bytes in dirty pages for a given inode
> + * starting at page frame idx.
> + */
> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	pgoff_t	index;
> +	struct pagevec pvec;
> +	pgoff_t num = 1; /* self */
> +	int i, done = 0;
> +
> +	pagevec_init(&pvec, 0);
> +	idx++; /* self */
> +	while (!done) {
> +		index = idx;
> +		pagevec_lookup_tag(&pvec, mapping, &index,
> +				   PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
> +		if (pagevec_count(&pvec) == 0)
> +			break;
> +
> +		for (i = 0; i < pagevec_count(&pvec); i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			lock_page(page);
> +			if (unlikely(page->mapping != mapping) ||
> +			    !PageDirty(page) ||
> +			    PageWriteback(page) ||
> +			    page->index != idx) {
> +				done = 1;
> +				unlock_page(page);
> +				break;
> +			}
> +			unlock_page(page);
> +			if (done)
> +				break;
> +			idx++;
> +			num++;
> +		}
> +		pagevec_release(&pvec);
> +	}
> +	return num << PAGE_CACHE_SHIFT;
> +}
> +


Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
do any locking this is just an hint, and not a transaction guaranty. Best first
guess approximation is all we need.

Also you might want to optimize for the most common case of a linear write from
zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
pages, if they are the same you know there are no holes, and need not scan.

>  void
> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
> +			   struct nfs_page *req)


Nothing changed here, please don't do this

>  {
> +	u64 wb_size;
> +
>  	BUG_ON(pgio->pg_lseg != NULL);
>  
>  	if (req->wb_offset != req->wb_pgbase) {
>  		nfs_pageio_reset_write_mds(pgio);
>  		return;
>  	}
> +
> +	if (pgio->pg_dreq == NULL)
> +		wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
> +	else
> +		wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
> +
>  	pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>  					   req->wb_context,
>  					   req_offset(req),
> -					   req->wb_bytes,
> +					   wb_size?:req->wb_bytes,


wb_size is always set above in the if() or else. No need to check here.

>  					   IOMODE_RW,
>  					   GFP_NOFS);
>  	/* If no lseg, fall back to write through mds */



But No!

much much better then last time, thank you for working on this
but it is not optimum for objects and files
(when "files" supports segments)

For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
objects (and files) it is i_size_read(). The objects/files server usually
determines it's topology according to file-size. And it does not have any
bigger resources because of holes or no holes. (The files example I think of
is CEPH)

So for objects the wasting factor is the actual i_size extending as a cause
of layout_get, and not the number of pages served. So for us the gain is if
client, that has a much newer information about i_size, sends it on first
layout_get. Though extending file size only once on first layout_get and
not on every layout_get.

So the small change I want is:

+enum pnfs_layout_get_strategy {
+	PLGS_USE_ISIZE,
+	PLGS_SEARCH_FIRST_HOLE,
+	PLGS_ALL_FILE,
+};

-pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req,
+			    enum pnfs_layout_get_strategy plgs)


 ...

+	if (pgio->pg_dreq == NULL) {
+		switch (plgs) {
+		case PLGS_SEARCH_FIRST_HOLE: 
+			wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
+			break;
+		case PLGS_USE_ISIZE:
+			wb_size = i_size_read() - req_offset(req);
+			break;
+		case PLGS_ALL_FILE:
+			wb_size = NFS4_MAX_UINT64;
+			break;
+		default:
+			WARN_ON(1);
+			wb_size = PAGE_SIZE;
+	}
+	else
+		wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);

The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init
and each LD should pass the proper pnfs_layout_get_strategy flag.

But yes, this is much more like it. Thanks again, nice progress.

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh Aug. 12, 2012, 6:40 p.m. UTC | #4

On 08/12/2012 09:30 PM, Boaz Harrosh wrote:

> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
> 

I want to clarify here. The i_size does not and must not grow as part of
a layout_get. Only a layout_commit might extend i_size.

the "file-size" I meant above is the current maximum size that can be
described by the inode's layout device-map. The device map does grow on
layout_get both for objects, as well as for example  a CEPH cluster. If we
send i_size from client then we only need extend device-map once during the
complete writeout. (If need extending)

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peng Tao Aug. 13, 2012, 6:15 a.m. UTC | #5

On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 08/08/2012 05:03 AM, Peng Tao wrote:
>
>> From: Peng Tao <tao.peng@emc.com>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>>
>> For direct write, just use dreq->bytes_left.
>>
>> Signed-off-by: Peng Tao <tao.peng@emc.com>
>> ---
>>  fs/nfs/direct.c   |    7 ++++++
>>  fs/nfs/internal.h |    1 +
>>  fs/nfs/pnfs.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 64 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index c39f775..c1899dd 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -46,6 +46,7 @@
>>  #include <linux/kref.h>
>>  #include <linux/slab.h>
>>  #include <linux/task_io_accounting_ops.h>
>> +#include <linux/module.h>
>>
>>  #include <linux/nfs_fs.h>
>>  #include <linux/nfs_page.h>
>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>>       kref_put(&dreq->kref, nfs_direct_req_free);
>>  }
>>
>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>> +{
>> +     return dreq->bytes_left;
>> +}
>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>> +
>>  /*
>>   * Collects and returns the final error value/byte-count.
>>   */
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index 31fdb03..e68d329 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>>  {
>>       inode_dio_wait(inode);
>>  }
>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>
>
>
> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
> here ?
>
struct nfs_direct_req is internal in direct.c. Cannot access its
member structure outside without exported APIs.

>>  /* nfs4proc.c */
>>  extern void __nfs4_read_done_cb(struct nfs_read_data *);
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e00fea..e61a373 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -29,6 +29,7 @@
>>
>>  #include <linux/nfs_fs.h>
>>  #include <linux/nfs_page.h>
>> +#include <linux/pagevec.h>
>>  #include <linux/module.h>
>>  #include "internal.h"
>>  #include "pnfs.h"
>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>>  }
>>  EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>
>> +/*
>> + * Return the number of contiguous bytes in dirty pages for a given inode
>> + * starting at page frame idx.
>> + */
>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>> +{
>> +     struct address_space *mapping = inode->i_mapping;
>> +     pgoff_t index;
>> +     struct pagevec pvec;
>> +     pgoff_t num = 1; /* self */
>> +     int i, done = 0;
>> +
>> +     pagevec_init(&pvec, 0);
>> +     idx++; /* self */
>> +     while (!done) {
>> +             index = idx;
>> +             pagevec_lookup_tag(&pvec, mapping, &index,
>> +                                PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>> +             if (pagevec_count(&pvec) == 0)
>> +                     break;
>> +
>> +             for (i = 0; i < pagevec_count(&pvec); i++) {
>> +                     struct page *page = pvec.pages[i];
>> +
>> +                     lock_page(page);
>> +                     if (unlikely(page->mapping != mapping) ||
>> +                         !PageDirty(page) ||
>> +                         PageWriteback(page) ||
>> +                         page->index != idx) {
>> +                             done = 1;
>> +                             unlock_page(page);
>> +                             break;
>> +                     }
>> +                     unlock_page(page);
>> +                     if (done)
>> +                             break;
>> +                     idx++;
>> +                     num++;
>> +             }
>> +             pagevec_release(&pvec);
>> +     }
>> +     return num << PAGE_CACHE_SHIFT;
>> +}
>> +
>
>
> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
> do any locking this is just an hint, and not a transaction guaranty. Best first
> guess approximation is all we need.
>
> Also you might want to optimize for the most common case of a linear write from
> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
> pages, if they are the same you know there are no holes, and need not scan.
>
Sounds reasonable. Will do it in v2.

>>  void
>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>> +                        struct nfs_page *req)
>
>
> Nothing changed here, please don't do this
>
>>  {
>> +     u64 wb_size;
>> +
>>       BUG_ON(pgio->pg_lseg != NULL);
>>
>>       if (req->wb_offset != req->wb_pgbase) {
>>               nfs_pageio_reset_write_mds(pgio);
>>               return;
>>       }
>> +
>> +     if (pgio->pg_dreq == NULL)
>> +             wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>> +     else
>> +             wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>> +
>>       pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>>                                          req->wb_context,
>>                                          req_offset(req),
>> -                                        req->wb_bytes,
>> +                                        wb_size?:req->wb_bytes,
>
>
> wb_size is always set above in the if() or else. No need to check here.
>
>>                                          IOMODE_RW,
>>                                          GFP_NOFS);
>>       /* If no lseg, fall back to write through mds */
>
>
>
> But No!
>
> much much better then last time, thank you for working on this
> but it is not optimum for objects and files
> (when "files" supports segments)
>
> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
> objects (and files) it is i_size_read(). The objects/files server usually
> determines it's topology according to file-size. And it does not have any
> bigger resources because of holes or no holes. (The files example I think of
> is CEPH)
>
> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
>
> So the small change I want is:
>
> +enum pnfs_layout_get_strategy {
> +       PLGS_USE_ISIZE,
> +       PLGS_SEARCH_FIRST_HOLE,
> +       PLGS_ALL_FILE,
> +};
>
> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req,
> +                           enum pnfs_layout_get_strategy plgs)
>
>
>  ...
>
> +       if (pgio->pg_dreq == NULL) {
> +               switch (plgs) {
> +               case PLGS_SEARCH_FIRST_HOLE:
> +                       wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
> +                       break;
> +               case PLGS_USE_ISIZE:
> +                       wb_size = i_size_read() - req_offset(req);
> +                       break;
> +               case PLGS_ALL_FILE:
> +                       wb_size = NFS4_MAX_UINT64;
> +                       break;
> +               default:
> +                       WARN_ON(1);
> +                       wb_size = PAGE_SIZE;
> +       }
> +       else
> +               wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>
> The call to pnfs_generic_pg_init_write is always made by the LD's .pg_init
> and each LD should pass the proper pnfs_layout_get_strategy flag.
>
Will add it in v2. Thanks for reviewing.

Cheers,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peng Tao Aug. 13, 2012, 9:44 a.m. UTC | #6

On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
> On 08/08/2012 05:03 AM, Peng Tao wrote:
>
>> From: Peng Tao <tao.peng@emc.com>
>>
>> For bufferred write, scan dirty pages to find out longest continuous
>> dirty pages. In this case, also allow layout driver to specify a
>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>> for block layout client.
>>
>> For direct write, just use dreq->bytes_left.
>>
>> Signed-off-by: Peng Tao <tao.peng@emc.com>
>> ---
>>  fs/nfs/direct.c   |    7 ++++++
>>  fs/nfs/internal.h |    1 +
>>  fs/nfs/pnfs.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 64 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>> index c39f775..c1899dd 100644
>> --- a/fs/nfs/direct.c
>> +++ b/fs/nfs/direct.c
>> @@ -46,6 +46,7 @@
>>  #include <linux/kref.h>
>>  #include <linux/slab.h>
>>  #include <linux/task_io_accounting_ops.h>
>> +#include <linux/module.h>
>>
>>  #include <linux/nfs_fs.h>
>>  #include <linux/nfs_page.h>
>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>>       kref_put(&dreq->kref, nfs_direct_req_free);
>>  }
>>
>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>> +{
>> +     return dreq->bytes_left;
>> +}
>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>> +
>>  /*
>>   * Collects and returns the final error value/byte-count.
>>   */
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index 31fdb03..e68d329 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>>  {
>>       inode_dio_wait(inode);
>>  }
>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>
>
>
> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
> here ?
>
>>  /* nfs4proc.c */
>>  extern void __nfs4_read_done_cb(struct nfs_read_data *);
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 2e00fea..e61a373 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -29,6 +29,7 @@
>>
>>  #include <linux/nfs_fs.h>
>>  #include <linux/nfs_page.h>
>> +#include <linux/pagevec.h>
>>  #include <linux/module.h>
>>  #include "internal.h"
>>  #include "pnfs.h"
>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>>  }
>>  EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>
>> +/*
>> + * Return the number of contiguous bytes in dirty pages for a given inode
>> + * starting at page frame idx.
>> + */
>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>> +{
>> +     struct address_space *mapping = inode->i_mapping;
>> +     pgoff_t index;
>> +     struct pagevec pvec;
>> +     pgoff_t num = 1; /* self */
>> +     int i, done = 0;
>> +
>> +     pagevec_init(&pvec, 0);
>> +     idx++; /* self */
>> +     while (!done) {
>> +             index = idx;
>> +             pagevec_lookup_tag(&pvec, mapping, &index,
>> +                                PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>> +             if (pagevec_count(&pvec) == 0)
>> +                     break;
>> +
>> +             for (i = 0; i < pagevec_count(&pvec); i++) {
>> +                     struct page *page = pvec.pages[i];
>> +
>> +                     lock_page(page);
>> +                     if (unlikely(page->mapping != mapping) ||
>> +                         !PageDirty(page) ||
>> +                         PageWriteback(page) ||
>> +                         page->index != idx) {
>> +                             done = 1;
>> +                             unlock_page(page);
>> +                             break;
>> +                     }
>> +                     unlock_page(page);
>> +                     if (done)
>> +                             break;
>> +                     idx++;
>> +                     num++;
>> +             }
>> +             pagevec_release(&pvec);
>> +     }
>> +     return num << PAGE_CACHE_SHIFT;
>> +}
>> +
>
>
> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
> do any locking this is just an hint, and not a transaction guaranty. Best first
> guess approximation is all we need.
>
> Also you might want to optimize for the most common case of a linear write from
> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
> pages, if they are the same you know there are no holes, and need not scan.
>
>>  void
>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>> +                        struct nfs_page *req)
>
>
> Nothing changed here, please don't do this
>
>>  {
>> +     u64 wb_size;
>> +
>>       BUG_ON(pgio->pg_lseg != NULL);
>>
>>       if (req->wb_offset != req->wb_pgbase) {
>>               nfs_pageio_reset_write_mds(pgio);
>>               return;
>>       }
>> +
>> +     if (pgio->pg_dreq == NULL)
>> +             wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>> +     else
>> +             wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>> +
>>       pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>>                                          req->wb_context,
>>                                          req_offset(req),
>> -                                        req->wb_bytes,
>> +                                        wb_size?:req->wb_bytes,
>
>
> wb_size is always set above in the if() or else. No need to check here.
>
>>                                          IOMODE_RW,
>>                                          GFP_NOFS);
>>       /* If no lseg, fall back to write through mds */
>
>
>
> But No!
>
> much much better then last time, thank you for working on this
> but it is not optimum for objects and files
> (when "files" supports segments)
>
> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
> objects (and files) it is i_size_read(). The objects/files server usually
> determines it's topology according to file-size. And it does not have any
> bigger resources because of holes or no holes. (The files example I think of
> is CEPH)
>
> So for objects the wasting factor is the actual i_size extending as a cause
> of layout_get, and not the number of pages served. So for us the gain is if
> client, that has a much newer information about i_size, sends it on first
> layout_get. Though extending file size only once on first layout_get and
> not on every layout_get.
>
> So the small change I want is:
>
> +enum pnfs_layout_get_strategy {
> +       PLGS_USE_ISIZE,
> +       PLGS_SEARCH_FIRST_HOLE,
> +       PLGS_ALL_FILE,
> +};
>
Just a second thought, since each layout driver would use one
strategy, it is more reasonable to set the strategy in
pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I
will do it this way.

Boaz Harrosh Aug. 13, 2012, 8:13 p.m. UTC | #7

On 08/13/2012 12:44 PM, Peng Tao wrote:

> On Mon, Aug 13, 2012 at 2:30 AM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>> On 08/08/2012 05:03 AM, Peng Tao wrote:
>>
>>> From: Peng Tao <tao.peng@emc.com>
>>>
>>> For bufferred write, scan dirty pages to find out longest continuous
>>> dirty pages. In this case, also allow layout driver to specify a
>>> maximum layoutget size which is useful to avoid busy scanning dirty pages
>>> for block layout client.
>>>
>>> For direct write, just use dreq->bytes_left.
>>>
>>> Signed-off-by: Peng Tao <tao.peng@emc.com>
>>> ---
>>>  fs/nfs/direct.c   |    7 ++++++
>>>  fs/nfs/internal.h |    1 +
>>>  fs/nfs/pnfs.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  3 files changed, 64 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
>>> index c39f775..c1899dd 100644
>>> --- a/fs/nfs/direct.c
>>> +++ b/fs/nfs/direct.c
>>> @@ -46,6 +46,7 @@
>>>  #include <linux/kref.h>
>>>  #include <linux/slab.h>
>>>  #include <linux/task_io_accounting_ops.h>
>>> +#include <linux/module.h>
>>>
>>>  #include <linux/nfs_fs.h>
>>>  #include <linux/nfs_page.h>
>>> @@ -191,6 +192,12 @@ static void nfs_direct_req_release(struct nfs_direct_req *dreq)
>>>       kref_put(&dreq->kref, nfs_direct_req_free);
>>>  }
>>>
>>> +ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq)
>>> +{
>>> +     return dreq->bytes_left;
>>> +}
>>> +EXPORT_SYMBOL_GPL(nfs_dreq_bytes_left);
>>> +
>>>  /*
>>>   * Collects and returns the final error value/byte-count.
>>>   */
>>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>>> index 31fdb03..e68d329 100644
>>> --- a/fs/nfs/internal.h
>>> +++ b/fs/nfs/internal.h
>>> @@ -464,6 +464,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
>>>  {
>>>       inode_dio_wait(inode);
>>>  }
>>> +extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
>>>
>>
>>
>> Why is this an EXPORT_SYMBOL_GPL at .c file. Why not just an inline
>> here ?
>>
>>>  /* nfs4proc.c */
>>>  extern void __nfs4_read_done_cb(struct nfs_read_data *);
>>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>>> index 2e00fea..e61a373 100644
>>> --- a/fs/nfs/pnfs.c
>>> +++ b/fs/nfs/pnfs.c
>>> @@ -29,6 +29,7 @@
>>>
>>>  #include <linux/nfs_fs.h>
>>>  #include <linux/nfs_page.h>
>>> +#include <linux/pagevec.h>
>>>  #include <linux/module.h>
>>>  #include "internal.h"
>>>  #include "pnfs.h"
>>> @@ -1172,19 +1173,72 @@ pnfs_generic_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *r
>>>  }
>>>  EXPORT_SYMBOL_GPL(pnfs_generic_pg_init_read);
>>>
>>> +/*
>>> + * Return the number of contiguous bytes in dirty pages for a given inode
>>> + * starting at page frame idx.
>>> + */
>>> +static u64 pnfs_num_dirty_bytes(struct inode *inode, pgoff_t idx)
>>> +{
>>> +     struct address_space *mapping = inode->i_mapping;
>>> +     pgoff_t index;
>>> +     struct pagevec pvec;
>>> +     pgoff_t num = 1; /* self */
>>> +     int i, done = 0;
>>> +
>>> +     pagevec_init(&pvec, 0);
>>> +     idx++; /* self */
>>> +     while (!done) {
>>> +             index = idx;
>>> +             pagevec_lookup_tag(&pvec, mapping, &index,
>>> +                                PAGECACHE_TAG_DIRTY, (pgoff_t)PAGEVEC_SIZE);
>>> +             if (pagevec_count(&pvec) == 0)
>>> +                     break;
>>> +
>>> +             for (i = 0; i < pagevec_count(&pvec); i++) {
>>> +                     struct page *page = pvec.pages[i];
>>> +
>>> +                     lock_page(page);
>>> +                     if (unlikely(page->mapping != mapping) ||
>>> +                         !PageDirty(page) ||
>>> +                         PageWriteback(page) ||
>>> +                         page->index != idx) {
>>> +                             done = 1;
>>> +                             unlock_page(page);
>>> +                             break;
>>> +                     }
>>> +                     unlock_page(page);
>>> +                     if (done)
>>> +                             break;
>>> +                     idx++;
>>> +                     num++;
>>> +             }
>>> +             pagevec_release(&pvec);
>>> +     }
>>> +     return num << PAGE_CACHE_SHIFT;
>>> +}
>>> +
>>
>>
>> Same as what Trond said. radix_tree_next_hole() should be nicer. We need never
>> do any locking this is just an hint, and not a transaction guaranty. Best first
>> guess approximation is all we need.
>>
>> Also you might want to optimize for the most common case of a linear write from
>> zero. This you can do by comparing i_size / PAGE_SIZE and the number of dirty
>> pages, if they are the same you know there are no holes, and need not scan.
>>
>>>  void
>>> -pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>>> +pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio,
>>> +                        struct nfs_page *req)
>>
>>
>> Nothing changed here, please don't do this
>>
>>>  {
>>> +     u64 wb_size;
>>> +
>>>       BUG_ON(pgio->pg_lseg != NULL);
>>>
>>>       if (req->wb_offset != req->wb_pgbase) {
>>>               nfs_pageio_reset_write_mds(pgio);
>>>               return;
>>>       }
>>> +
>>> +     if (pgio->pg_dreq == NULL)
>>> +             wb_size = pnfs_num_dirty_bytes(pgio->pg_inode, req->wb_index);
>>> +     else
>>> +             wb_size = nfs_dreq_bytes_left(pgio->pg_dreq);
>>> +
>>>       pgio->pg_lseg = pnfs_update_layout(pgio->pg_inode,
>>>                                          req->wb_context,
>>>                                          req_offset(req),
>>> -                                        req->wb_bytes,
>>> +                                        wb_size?:req->wb_bytes,
>>
>>
>> wb_size is always set above in the if() or else. No need to check here.
>>
>>>                                          IOMODE_RW,
>>>                                          GFP_NOFS);
>>>       /* If no lseg, fall back to write through mds */
>>
>>
>>
>> But No!
>>
>> much much better then last time, thank you for working on this
>> but it is not optimum for objects and files
>> (when "files" supports segments)
>>
>> For blocks, Yes radix_tree_next_hole() is the perfect fit. But for
>> objects (and files) it is i_size_read(). The objects/files server usually
>> determines it's topology according to file-size. And it does not have any
>> bigger resources because of holes or no holes. (The files example I think of
>> is CEPH)
>>
>> So for objects the wasting factor is the actual i_size extending as a cause
>> of layout_get, and not the number of pages served. So for us the gain is if
>> client, that has a much newer information about i_size, sends it on first
>> layout_get. Though extending file size only once on first layout_get and
>> not on every layout_get.
>>
>> So the small change I want is:
>>
>> +enum pnfs_layout_get_strategy {
>> +       PLGS_USE_ISIZE,
>> +       PLGS_SEARCH_FIRST_HOLE,
>> +       PLGS_ALL_FILE,
>> +};
>>
> Just a second thought, since each layout driver would use one
> strategy, it is more reasonable to set the strategy in
> pnfs_curr_ld->flags instead of changing pg_init API to pass it in. I
> will do it this way.
> 


It's fine, as you see fit. I think it's more flexible this way but
both ways will work for now.

Please note that for files, once it will support segments, it would
want to use i_size like objects.

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 13, 2012, 8:21 p.m. UTC | #8

T24gTW9uLCAyMDEyLTA4LTEzIGF0IDIzOjEzICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+
IE9uIDA4LzEzLzIwMTIgMTI6NDQgUE0sIFBlbmcgVGFvIHdyb3RlOg0KPiANCj4gPiBPbiBNb24s
IEF1ZyAxMywgMjAxMiBhdCAyOjMwIEFNLCBCb2F6IEhhcnJvc2ggPGJoYXJyb3NoQHBhbmFzYXMu
Y29tPiB3cm90ZToNCj4gPj4gU28gdGhlIHNtYWxsIGNoYW5nZSBJIHdhbnQgaXM6DQo+ID4+DQo+
ID4+ICtlbnVtIHBuZnNfbGF5b3V0X2dldF9zdHJhdGVneSB7DQo+ID4+ICsgICAgICAgUExHU19V
U0VfSVNJWkUsDQo+ID4+ICsgICAgICAgUExHU19TRUFSQ0hfRklSU1RfSE9MRSwNCj4gPj4gKyAg
ICAgICBQTEdTX0FMTF9GSUxFLA0KPiA+PiArfTsNCj4gPj4NCj4gPiBKdXN0IGEgc2Vjb25kIHRo
b3VnaHQsIHNpbmNlIGVhY2ggbGF5b3V0IGRyaXZlciB3b3VsZCB1c2Ugb25lDQo+ID4gc3RyYXRl
Z3ksIGl0IGlzIG1vcmUgcmVhc29uYWJsZSB0byBzZXQgdGhlIHN0cmF0ZWd5IGluDQo+ID4gcG5m
c19jdXJyX2xkLT5mbGFncyBpbnN0ZWFkIG9mIGNoYW5naW5nIHBnX2luaXQgQVBJIHRvIHBhc3Mg
aXQgaW4uIEkNCj4gPiB3aWxsIGRvIGl0IHRoaXMgd2F5Lg0KPiA+IA0KPiANCj4gDQo+IEl0J3Mg
ZmluZSwgYXMgeW91IHNlZSBmaXQuIEkgdGhpbmsgaXQncyBtb3JlIGZsZXhpYmxlIHRoaXMgd2F5
IGJ1dA0KPiBib3RoIHdheXMgd2lsbCB3b3JrIGZvciBub3cuDQo+IA0KPiBQbGVhc2Ugbm90ZSB0
aGF0IGZvciBmaWxlcywgb25jZSBpdCB3aWxsIHN1cHBvcnQgc2VnbWVudHMsIGl0IHdvdWxkDQo+
IHdhbnQgdG8gdXNlIGlfc2l6ZSBsaWtlIG9iamVjdHMuDQoNCg0KTGF5b3V0LXNwZWNpZmljIGZs
YWdzIGluIHRoZSBnZW5lcmljIGNvZGUgYXJlIG5vdCBhY2NlcHRhYmxlLiBJZiB0aGUNCnN0cmF0
ZWd5IGlzIGxheW91dCBzcGVjaWZpYywgdGhlbiB0aGVyZSBzaG91bGQgYmUgbm8gbmVlZCB0byBw
YXNzIGl0DQphcm91bmQgdGhlIGdlbmVyaWMgbGF5ZXJzIGF0IGFsbC4gSnVzdCBkbyB0aGUgcmln
aHQgdGhpbmcgaW4gdGhlIGRyaXZlcg0KKGkuZS4gcGdfaW5pdCkgYW5kIHdlJ3JlIGRvbmUuDQoN
Ci0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0
QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC,2/3] NFS41: send real write size in layoutget

Commit Message

Comments

Patch