[v2,2/4] mm: add file_fdatawait_range and file_write_and_wait
diff mbox

Message ID 1501500421.4663.4.camel@redhat.com
State New
Headers show

Commit Message

Jeff Layton July 31, 2017, 11:27 a.m. UTC
On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > +int file_write_and_wait(struct file *file)
> > > +{
> > > +	int err = 0, err2;
> > > +	struct address_space *mapping = file->f_mapping;
> > > +
> > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > +		err = filemap_fdatawrite(mapping);
> > > +		/* See comment of filemap_write_and_wait() */
> > > +		if (err != -EIO) {
> > > +			loff_t i_size = i_size_read(mapping->host);
> > > +
> > > +			if (i_size != 0)
> > > +				__filemap_fdatawait_range(mapping, 0,
> > > +							  i_size - 1);
> > > +		}
> > > +	}
> > 
> > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > range and ignore i_size. It is much easier than trying to wrap your head
> > around possible races with file operations modifying i_size.
> > 
> > 								Honza
> 
> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> as I'm leery of making subtle behavior changes in the actual writeback
> behavior. For example:
> 
> -----------------8<----------------
> static inline int __filemap_fdatawrite(struct address_space *mapping,
>         int sync_mode)
> {
>         return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> }
> 
> int filemap_fdatawrite(struct address_space *mapping)
> {
>         return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> }
> EXPORT_SYMBOL(filemap_fdatawrite);
> -----------------8<----------------
> 
> ...which then sets up the wbc with the right ranges and sync mode and
> kicks off writepages. But then, it does the i_size_read to figure out
> what range it should wait on (with the shortcut for the size == 0 case).
> 
> My assumption was that it was intentionally designed that way, but I'm
> guessing from your comments that it wasn't? If so, then we can turn
> file_write_and_wait a static inline wrapper around
> file_write_and_wait_range.

FWIW, I did a bit of archaeology in the linux-history tree and found
this patch from Marcelo in 2004. Is this optimization still helpful? If
not, then that does simplify the code a bit.

-------------------8<--------------------

[PATCH] small wait_on_page_writeback_range() optimization

filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
parameter.  This is not needed since we know the EOF from the inode.  Use
that instead.

Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
 mm/filemap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Steven Whitehouse July 31, 2017, 11:32 a.m. UTC | #1
Hi,


On 31/07/17 12:27, Jeff Layton wrote:
> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>> +int file_write_and_wait(struct file *file)
>>>> +{
>>>> +	int err = 0, err2;
>>>> +	struct address_space *mapping = file->f_mapping;
>>>> +
>>>> +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>> +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>> +		err = filemap_fdatawrite(mapping);
>>>> +		/* See comment of filemap_write_and_wait() */
>>>> +		if (err != -EIO) {
>>>> +			loff_t i_size = i_size_read(mapping->host);
>>>> +
>>>> +			if (i_size != 0)
>>>> +				__filemap_fdatawait_range(mapping, 0,
>>>> +							  i_size - 1);
>>>> +		}
>>>> +	}
>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>> range and ignore i_size. It is much easier than trying to wrap your head
>>> around possible races with file operations modifying i_size.
>>>
>>> 								Honza
>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>> as I'm leery of making subtle behavior changes in the actual writeback
>> behavior. For example:
>>
>> -----------------8<----------------
>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>>          int sync_mode)
>> {
>>          return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>> }
>>
>> int filemap_fdatawrite(struct address_space *mapping)
>> {
>>          return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>> }
>> EXPORT_SYMBOL(filemap_fdatawrite);
>> -----------------8<----------------
>>
>> ...which then sets up the wbc with the right ranges and sync mode and
>> kicks off writepages. But then, it does the i_size_read to figure out
>> what range it should wait on (with the shortcut for the size == 0 case).
>>
>> My assumption was that it was intentionally designed that way, but I'm
>> guessing from your comments that it wasn't? If so, then we can turn
>> file_write_and_wait a static inline wrapper around
>> file_write_and_wait_range.
> FWIW, I did a bit of archaeology in the linux-history tree and found
> this patch from Marcelo in 2004. Is this optimization still helpful? If
> not, then that does simplify the code a bit.
>
> -------------------8<--------------------
>
> [PATCH] small wait_on_page_writeback_range() optimization
>
> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> parameter.  This is not needed since we know the EOF from the inode.  Use
> that instead.
>
> Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> ---
>   mm/filemap.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 78e18b7639b6..55fb7b4141e4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
>    */
>   int filemap_fdatawait(struct address_space *mapping)
>   {
> -	return wait_on_page_writeback_range(mapping, 0, -1);
> +	loff_t i_size = i_size_read(mapping->host);
> +
> +	if (i_size == 0)
> +		return 0;
> +
> +	return wait_on_page_writeback_range(mapping, 0,
> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
>   }
>   EXPORT_SYMBOL(filemap_fdatawait);
>

Does this ever get called in cases where we would not hold fs locks? In 
that case we definitely don't want to be relying on i_size,

Steve.
Jeff Layton July 31, 2017, 11:44 a.m. UTC | #2
On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> Hi,
> 
> 
> On 31/07/17 12:27, Jeff Layton wrote:
> > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > +int file_write_and_wait(struct file *file)
> > > > > +{
> > > > > +	int err = 0, err2;
> > > > > +	struct address_space *mapping = file->f_mapping;
> > > > > +
> > > > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > +		err = filemap_fdatawrite(mapping);
> > > > > +		/* See comment of filemap_write_and_wait() */
> > > > > +		if (err != -EIO) {
> > > > > +			loff_t i_size = i_size_read(mapping->host);
> > > > > +
> > > > > +			if (i_size != 0)
> > > > > +				__filemap_fdatawait_range(mapping, 0,
> > > > > +							  i_size - 1);
> > > > > +		}
> > > > > +	}
> > > > 
> > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > around possible races with file operations modifying i_size.
> > > > 
> > > > 								Honza
> > > 
> > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > as I'm leery of making subtle behavior changes in the actual writeback
> > > behavior. For example:
> > > 
> > > -----------------8<----------------
> > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > >          int sync_mode)
> > > {
> > >          return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > }
> > > 
> > > int filemap_fdatawrite(struct address_space *mapping)
> > > {
> > >          return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > }
> > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > -----------------8<----------------
> > > 
> > > ...which then sets up the wbc with the right ranges and sync mode and
> > > kicks off writepages. But then, it does the i_size_read to figure out
> > > what range it should wait on (with the shortcut for the size == 0 case).
> > > 
> > > My assumption was that it was intentionally designed that way, but I'm
> > > guessing from your comments that it wasn't? If so, then we can turn
> > > file_write_and_wait a static inline wrapper around
> > > file_write_and_wait_range.
> > 
> > FWIW, I did a bit of archaeology in the linux-history tree and found
> > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > not, then that does simplify the code a bit.
> > 
> > -------------------8<--------------------
> > 
> > [PATCH] small wait_on_page_writeback_range() optimization
> > 
> > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > parameter.  This is not needed since we know the EOF from the inode.  Use
> > that instead.
> > 
> > Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> > Signed-off-by: Andrew Morton <akpm@osdl.org>
> > Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > ---
> >   mm/filemap.c | 8 +++++++-
> >   1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 78e18b7639b6..55fb7b4141e4 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> >    */
> >   int filemap_fdatawait(struct address_space *mapping)
> >   {
> > -	return wait_on_page_writeback_range(mapping, 0, -1);
> > +	loff_t i_size = i_size_read(mapping->host);
> > +
> > +	if (i_size == 0)
> > +		return 0;
> > +
> > +	return wait_on_page_writeback_range(mapping, 0,
> > +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> >   }
> >   EXPORT_SYMBOL(filemap_fdatawait);
> > 
> 
> Does this ever get called in cases where we would not hold fs locks? In 
> that case we definitely don't want to be relying on i_size,
> 
> Steve.
> 

Yes. We can initiate and wait on writeback from any context where you
can sleep, really.

We're just waiting on whole file writeback here, so I don't think
there's anything wrong. As long as the i_size was valid at some point in
time prior to waiting then you're ok.

The question I have is more whether this optimization is still useful. 

What we do now is just walk the radix tree and wait_on_page_writeback
for each page. Do we gain anything by avoiding ranges beyond the current
EOF with the pagecache infrastructure of 2017?
Steven Whitehouse July 31, 2017, 12:05 p.m. UTC | #3
Hi,


On 31/07/17 12:44, Jeff Layton wrote:
> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 31/07/17 12:27, Jeff Layton wrote:
>>> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>>>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>>>> +int file_write_and_wait(struct file *file)
>>>>>> +{
>>>>>> +	int err = 0, err2;
>>>>>> +	struct address_space *mapping = file->f_mapping;
>>>>>> +
>>>>>> +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>>>> +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>>>> +		err = filemap_fdatawrite(mapping);
>>>>>> +		/* See comment of filemap_write_and_wait() */
>>>>>> +		if (err != -EIO) {
>>>>>> +			loff_t i_size = i_size_read(mapping->host);
>>>>>> +
>>>>>> +			if (i_size != 0)
>>>>>> +				__filemap_fdatawait_range(mapping, 0,
>>>>>> +							  i_size - 1);
>>>>>> +		}
>>>>>> +	}
>>>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>>>> range and ignore i_size. It is much easier than trying to wrap your head
>>>>> around possible races with file operations modifying i_size.
>>>>>
>>>>> 								Honza
>>>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>>>> as I'm leery of making subtle behavior changes in the actual writeback
>>>> behavior. For example:
>>>>
>>>> -----------------8<----------------
>>>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>>>>           int sync_mode)
>>>> {
>>>>           return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>>>> }
>>>>
>>>> int filemap_fdatawrite(struct address_space *mapping)
>>>> {
>>>>           return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>>>> }
>>>> EXPORT_SYMBOL(filemap_fdatawrite);
>>>> -----------------8<----------------
>>>>
>>>> ...which then sets up the wbc with the right ranges and sync mode and
>>>> kicks off writepages. But then, it does the i_size_read to figure out
>>>> what range it should wait on (with the shortcut for the size == 0 case).
>>>>
>>>> My assumption was that it was intentionally designed that way, but I'm
>>>> guessing from your comments that it wasn't? If so, then we can turn
>>>> file_write_and_wait a static inline wrapper around
>>>> file_write_and_wait_range.
>>> FWIW, I did a bit of archaeology in the linux-history tree and found
>>> this patch from Marcelo in 2004. Is this optimization still helpful? If
>>> not, then that does simplify the code a bit.
>>>
>>> -------------------8<--------------------
>>>
>>> [PATCH] small wait_on_page_writeback_range() optimization
>>>
>>> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
>>> parameter.  This is not needed since we know the EOF from the inode.  Use
>>> that instead.
>>>
>>> Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
>>> Signed-off-by: Andrew Morton <akpm@osdl.org>
>>> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>> ---
>>>    mm/filemap.c | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 78e18b7639b6..55fb7b4141e4 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
>>>     */
>>>    int filemap_fdatawait(struct address_space *mapping)
>>>    {
>>> -	return wait_on_page_writeback_range(mapping, 0, -1);
>>> +	loff_t i_size = i_size_read(mapping->host);
>>> +
>>> +	if (i_size == 0)
>>> +		return 0;
>>> +
>>> +	return wait_on_page_writeback_range(mapping, 0,
>>> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
>>>    }
>>>    EXPORT_SYMBOL(filemap_fdatawait);
>>>
>> Does this ever get called in cases where we would not hold fs locks? In
>> that case we definitely don't want to be relying on i_size,
>>
>> Steve.
>>
> Yes. We can initiate and wait on writeback from any context where you
> can sleep, really.
>
> We're just waiting on whole file writeback here, so I don't think
> there's anything wrong. As long as the i_size was valid at some point in
> time prior to waiting then you're ok.
>
> The question I have is more whether this optimization is still useful.
>
> What we do now is just walk the radix tree and wait_on_page_writeback
> for each page. Do we gain anything by avoiding ranges beyond the current
> EOF with the pagecache infrastructure of 2017?
>

If this can be called from anywhere without fs locks, then i_size is not 
known. That has been a problem in the past since i_size may have changed 
on another node. We avoid that in this case due to only changing i_size 
under an exclusive lock, and also only having dirty pages when we have 
an exclusive lock. There is another case though, if the inode is a block 
device, i_size will be zero. That is the case for the address space that 
looks after rgrps for GFS2. We do (luckily!) call 
filemap_fdatawait_range() directly in that case. For "normal" inodes 
though, the address space for metadata is backed by the block device 
inode, so that looks like it might be an issue, since 
fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the 
metamapping. It might potentially be an issue in other cases too,

Steve.
Jan Kara July 31, 2017, 12:07 p.m. UTC | #4
On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > On 31/07/17 12:27, Jeff Layton wrote:
> > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > +int file_write_and_wait(struct file *file)
> > > > > > +{
> > > > > > +	int err = 0, err2;
> > > > > > +	struct address_space *mapping = file->f_mapping;
> > > > > > +
> > > > > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > +		err = filemap_fdatawrite(mapping);
> > > > > > +		/* See comment of filemap_write_and_wait() */
> > > > > > +		if (err != -EIO) {
> > > > > > +			loff_t i_size = i_size_read(mapping->host);
> > > > > > +
> > > > > > +			if (i_size != 0)
> > > > > > +				__filemap_fdatawait_range(mapping, 0,
> > > > > > +							  i_size - 1);
> > > > > > +		}
> > > > > > +	}
> > > > > 
> > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > around possible races with file operations modifying i_size.
> > > > > 
> > > > > 								Honza
> > > > 
> > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > behavior. For example:
> > > > 
> > > > -----------------8<----------------
> > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > >          int sync_mode)
> > > > {
> > > >          return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > }
> > > > 
> > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > {
> > > >          return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > }
> > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > -----------------8<----------------
> > > > 
> > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > 
> > > > My assumption was that it was intentionally designed that way, but I'm
> > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > file_write_and_wait a static inline wrapper around
> > > > file_write_and_wait_range.
> > > 
> > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > not, then that does simplify the code a bit.
> > > 
> > > -------------------8<--------------------
> > > 
> > > [PATCH] small wait_on_page_writeback_range() optimization
> > > 
> > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > parameter.  This is not needed since we know the EOF from the inode.  Use
> > > that instead.
> > > 
> > > Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> > > Signed-off-by: Andrew Morton <akpm@osdl.org>
> > > Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > > ---
> > >   mm/filemap.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 78e18b7639b6..55fb7b4141e4 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > >    */
> > >   int filemap_fdatawait(struct address_space *mapping)
> > >   {
> > > -	return wait_on_page_writeback_range(mapping, 0, -1);
> > > +	loff_t i_size = i_size_read(mapping->host);
> > > +
> > > +	if (i_size == 0)
> > > +		return 0;
> > > +
> > > +	return wait_on_page_writeback_range(mapping, 0,
> > > +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> > >   }
> > >   EXPORT_SYMBOL(filemap_fdatawait);
> > > 
> > 
> > Does this ever get called in cases where we would not hold fs locks? In 
> > that case we definitely don't want to be relying on i_size,
> > 
> > Steve.
> > 
> 
> Yes. We can initiate and wait on writeback from any context where you
> can sleep, really.
> 
> We're just waiting on whole file writeback here, so I don't think
> there's anything wrong. As long as the i_size was valid at some point in
> time prior to waiting then you're ok.
> 
> The question I have is more whether this optimization is still useful. 
> 
> What we do now is just walk the radix tree and wait_on_page_writeback
> for each page. Do we gain anything by avoiding ranges beyond the current
> EOF with the pagecache infrastructure of 2017?

FWIW I'm not aware of any significant benefit of using i_size in
filemap_fdatawait() - we iterate to the end of the radix tree node anyway
since pagevec_lookup_tag() does not support range searches anyway (I'm
working on fixing that however even after that the benefit would be still
rather marginal).

What Marcello might have meant even back in 2004 was that if we are in the
middle of truncate, i_size is already reduced but page cache not truncated
yet, then filemap_fdatawait() does not have to wait for writeback of
truncated pages. That might be a noticeable benefit even today if such race
happens however I'm not sure it's worth optimizing for and surprises
arising from randomly snapshotting i_size (which especially for clustered
filesystems may be out of date) IMHO overweight the possible advantage.

								Honza
Jeff Layton July 31, 2017, 12:22 p.m. UTC | #5
On Mon, 2017-07-31 at 13:05 +0100, Steven Whitehouse wrote:
> Hi,
> 
> 
> On 31/07/17 12:44, Jeff Layton wrote:
> > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > 
> > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > +{
> > > > > > > +	int err = 0, err2;
> > > > > > > +	struct address_space *mapping = file->f_mapping;
> > > > > > > +
> > > > > > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > +		err = filemap_fdatawrite(mapping);
> > > > > > > +		/* See comment of filemap_write_and_wait() */
> > > > > > > +		if (err != -EIO) {
> > > > > > > +			loff_t i_size = i_size_read(mapping->host);
> > > > > > > +
> > > > > > > +			if (i_size != 0)
> > > > > > > +				__filemap_fdatawait_range(mapping, 0,
> > > > > > > +							  i_size - 1);
> > > > > > > +		}
> > > > > > > +	}
> > > > > > 
> > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > around possible races with file operations modifying i_size.
> > > > > > 
> > > > > > 								Honza
> > > > > 
> > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > behavior. For example:
> > > > > 
> > > > > -----------------8<----------------
> > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > >           int sync_mode)
> > > > > {
> > > > >           return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > }
> > > > > 
> > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > {
> > > > >           return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > -----------------8<----------------
> > > > > 
> > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > > 
> > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > file_write_and_wait a static inline wrapper around
> > > > > file_write_and_wait_range.
> > > > 
> > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > not, then that does simplify the code a bit.
> > > > 
> > > > -------------------8<--------------------
> > > > 
> > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > > 
> > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > parameter.  This is not needed since we know the EOF from the inode.  Use
> > > > that instead.
> > > > 
> > > > Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> > > > Signed-off-by: Andrew Morton <akpm@osdl.org>
> > > > Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > > > ---
> > > >    mm/filemap.c | 8 +++++++-
> > > >    1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > >     */
> > > >    int filemap_fdatawait(struct address_space *mapping)
> > > >    {
> > > > -	return wait_on_page_writeback_range(mapping, 0, -1);
> > > > +	loff_t i_size = i_size_read(mapping->host);
> > > > +
> > > > +	if (i_size == 0)
> > > > +		return 0;
> > > > +
> > > > +	return wait_on_page_writeback_range(mapping, 0,
> > > > +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> > > >    }
> > > >    EXPORT_SYMBOL(filemap_fdatawait);
> > > > 
> > > 
> > > Does this ever get called in cases where we would not hold fs locks? In
> > > that case we definitely don't want to be relying on i_size,
> > > 
> > > Steve.
> > > 
> > 
> > Yes. We can initiate and wait on writeback from any context where you
> > can sleep, really.
> > 
> > We're just waiting on whole file writeback here, so I don't think
> > there's anything wrong. As long as the i_size was valid at some point in
> > time prior to waiting then you're ok.
> > 
> > The question I have is more whether this optimization is still useful.
> > 
> > What we do now is just walk the radix tree and wait_on_page_writeback
> > for each page. Do we gain anything by avoiding ranges beyond the current
> > EOF with the pagecache infrastructure of 2017?
> > 
> 
> If this can be called from anywhere without fs locks, then i_size is not 
> known. That has been a problem in the past since i_size may have changed 
> on another node. We avoid that in this case due to only changing i_size 
> under an exclusive lock, and also only having dirty pages when we have 
> an exclusive lock. There is another case though, if the inode is a block 
> device, i_size will be zero. That is the case for the address space that 
> looks after rgrps for GFS2. We do (luckily!) call 
> filemap_fdatawait_range() directly in that case. For "normal" inodes 
> though, the address space for metadata is backed by the block device 
> inode, so that looks like it might be an issue, since 
> fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the 
> metamapping. It might potentially be an issue in other cases too,
> 
> Steve.
> 

Some of those do sound problematic.

Again though, we're only waiting on writeback here, and I assume with
gfs2 that would only be pages that were written on the local node.

Is it possible to have pages under writeback and in still in the tree,
but that are beyond the current i_size? It seems like that's the main
worrisome case.
Steven Whitehouse July 31, 2017, 12:25 p.m. UTC | #6
Hi,


On 31/07/17 13:22, Jeff Layton wrote:
> On Mon, 2017-07-31 at 13:05 +0100, Steven Whitehouse wrote:
>> Hi,
>>
>>
>> On 31/07/17 12:44, Jeff Layton wrote:
>>> On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
>>>> Hi,
>>>>
>>>>
>>>> On 31/07/17 12:27, Jeff Layton wrote:
>>>>> On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
>>>>>> On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
>>>>>>> On Wed 26-07-17 13:55:36, Jeff Layton wrote:
>>>>>>>> +int file_write_and_wait(struct file *file)
>>>>>>>> +{
>>>>>>>> +	int err = 0, err2;
>>>>>>>> +	struct address_space *mapping = file->f_mapping;
>>>>>>>> +
>>>>>>>> +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
>>>>>>>> +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
>>>>>>>> +		err = filemap_fdatawrite(mapping);
>>>>>>>> +		/* See comment of filemap_write_and_wait() */
>>>>>>>> +		if (err != -EIO) {
>>>>>>>> +			loff_t i_size = i_size_read(mapping->host);
>>>>>>>> +
>>>>>>>> +			if (i_size != 0)
>>>>>>>> +				__filemap_fdatawait_range(mapping, 0,
>>>>>>>> +							  i_size - 1);
>>>>>>>> +		}
>>>>>>>> +	}
>>>>>>> Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
>>>>>>> range and ignore i_size. It is much easier than trying to wrap your head
>>>>>>> around possible races with file operations modifying i_size.
>>>>>>>
>>>>>>> 								Honza
>>>>>> I'm basically emulating _exactly_ what filemap_write_and_wait does here,
>>>>>> as I'm leery of making subtle behavior changes in the actual writeback
>>>>>> behavior. For example:
>>>>>>
>>>>>> -----------------8<----------------
>>>>>> static inline int __filemap_fdatawrite(struct address_space *mapping,
>>>>>>            int sync_mode)
>>>>>> {
>>>>>>            return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
>>>>>> }
>>>>>>
>>>>>> int filemap_fdatawrite(struct address_space *mapping)
>>>>>> {
>>>>>>            return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
>>>>>> }
>>>>>> EXPORT_SYMBOL(filemap_fdatawrite);
>>>>>> -----------------8<----------------
>>>>>>
>>>>>> ...which then sets up the wbc with the right ranges and sync mode and
>>>>>> kicks off writepages. But then, it does the i_size_read to figure out
>>>>>> what range it should wait on (with the shortcut for the size == 0 case).
>>>>>>
>>>>>> My assumption was that it was intentionally designed that way, but I'm
>>>>>> guessing from your comments that it wasn't? If so, then we can turn
>>>>>> file_write_and_wait a static inline wrapper around
>>>>>> file_write_and_wait_range.
>>>>> FWIW, I did a bit of archaeology in the linux-history tree and found
>>>>> this patch from Marcelo in 2004. Is this optimization still helpful? If
>>>>> not, then that does simplify the code a bit.
>>>>>
>>>>> -------------------8<--------------------
>>>>>
>>>>> [PATCH] small wait_on_page_writeback_range() optimization
>>>>>
>>>>> filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
>>>>> parameter.  This is not needed since we know the EOF from the inode.  Use
>>>>> that instead.
>>>>>
>>>>> Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
>>>>> Signed-off-by: Andrew Morton <akpm@osdl.org>
>>>>> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>>>> ---
>>>>>     mm/filemap.c | 8 +++++++-
>>>>>     1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>> index 78e18b7639b6..55fb7b4141e4 100644
>>>>> --- a/mm/filemap.c
>>>>> +++ b/mm/filemap.c
>>>>> @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
>>>>>      */
>>>>>     int filemap_fdatawait(struct address_space *mapping)
>>>>>     {
>>>>> -	return wait_on_page_writeback_range(mapping, 0, -1);
>>>>> +	loff_t i_size = i_size_read(mapping->host);
>>>>> +
>>>>> +	if (i_size == 0)
>>>>> +		return 0;
>>>>> +
>>>>> +	return wait_on_page_writeback_range(mapping, 0,
>>>>> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
>>>>>     }
>>>>>     EXPORT_SYMBOL(filemap_fdatawait);
>>>>>
>>>> Does this ever get called in cases where we would not hold fs locks? In
>>>> that case we definitely don't want to be relying on i_size,
>>>>
>>>> Steve.
>>>>
>>> Yes. We can initiate and wait on writeback from any context where you
>>> can sleep, really.
>>>
>>> We're just waiting on whole file writeback here, so I don't think
>>> there's anything wrong. As long as the i_size was valid at some point in
>>> time prior to waiting then you're ok.
>>>
>>> The question I have is more whether this optimization is still useful.
>>>
>>> What we do now is just walk the radix tree and wait_on_page_writeback
>>> for each page. Do we gain anything by avoiding ranges beyond the current
>>> EOF with the pagecache infrastructure of 2017?
>>>
>> If this can be called from anywhere without fs locks, then i_size is not
>> known. That has been a problem in the past since i_size may have changed
>> on another node. We avoid that in this case due to only changing i_size
>> under an exclusive lock, and also only having dirty pages when we have
>> an exclusive lock. There is another case though, if the inode is a block
>> device, i_size will be zero. That is the case for the address space that
>> looks after rgrps for GFS2. We do (luckily!) call
>> filemap_fdatawait_range() directly in that case. For "normal" inodes
>> though, the address space for metadata is backed by the block device
>> inode, so that looks like it might be an issue, since
>> fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
>> metamapping. It might potentially be an issue in other cases too,
>>
>> Steve.
>>
> Some of those do sound problematic.
>
> Again though, we're only waiting on writeback here, and I assume with
> gfs2 that would only be pages that were written on the local node.
Yes
>
> Is it possible to have pages under writeback and in still in the tree,
> but that are beyond the current i_size? It seems like that's the main
> worrisome case.
>
Thats what I was wondering too. I'm not 100% sure without some more 
detailed investigation. Either way the block device case also seems 
problematic, although not impossible to special case I suppose. The real 
question is what do we get from this optmisation? Is the pain of 
checking correctness worth it for the benefits gained,

Steve.
Bob Peterson July 31, 2017, 12:38 p.m. UTC | #7
----- Original Message -----
| > If this can be called from anywhere without fs locks, then i_size is not
| > known. That has been a problem in the past since i_size may have changed
| > on another node. We avoid that in this case due to only changing i_size
| > under an exclusive lock, and also only having dirty pages when we have
| > an exclusive lock. There is another case though, if the inode is a block
| > device, i_size will be zero. That is the case for the address space that
| > looks after rgrps for GFS2. We do (luckily!) call
| > filemap_fdatawait_range() directly in that case. For "normal" inodes
| > though, the address space for metadata is backed by the block device
| > inode, so that looks like it might be an issue, since
| > fs/gfs2/glops.c:inode_go_sync() calls filemap_fdatawait() on the
| > metamapping. It might potentially be an issue in other cases too,
| > 
| > Steve.
| > 
| 
| Some of those do sound problematic.
| 
| Again though, we're only waiting on writeback here, and I assume with
| gfs2 that would only be pages that were written on the local node.
| 
| Is it possible to have pages under writeback and in still in the tree,
| but that are beyond the current i_size? It seems like that's the main
| worrisome case.
| 
| --
| Jeff Layton <jlayton@redhat.com>

Hi Jeff,

I believe the answer is yes.

I was recently "bitten" by a case where (whether due to a bug or not)
I had blocks allocated in a GFS2 file beyond i_size. I had implemented a
delete algorithm that used i_size, but I found cases where files couldn't
be deleted because of blocks hanging out past EOF. I'm not sure if they
can be in writeback, but possibly. It's already on my "to investigate"
list, but I haven't gotten to it yet. Yes, it seems like a bug. Yes, we
need to fix it. But now there may be lots of legacy file systems out in
the field that have this problem. Not sure if they can get to writeback
until I study the situation more closely.

I believe Ben Marzinski also may have come across a case in which we
can have blocks in writeback that are beyond i_size. See the commit
message on Ben's patch here:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=fd4c5748b8d3f7420e8932ed0bde3d53cc8acc9d

Regards,

Bob Peterson
Red Hat File Systems
Jeff Layton July 31, 2017, 1 p.m. UTC | #8
On Mon, 2017-07-31 at 14:07 +0200, Jan Kara wrote:
> On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > +{
> > > > > > > +	int err = 0, err2;
> > > > > > > +	struct address_space *mapping = file->f_mapping;
> > > > > > > +
> > > > > > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > +		err = filemap_fdatawrite(mapping);
> > > > > > > +		/* See comment of filemap_write_and_wait() */
> > > > > > > +		if (err != -EIO) {
> > > > > > > +			loff_t i_size = i_size_read(mapping->host);
> > > > > > > +
> > > > > > > +			if (i_size != 0)
> > > > > > > +				__filemap_fdatawait_range(mapping, 0,
> > > > > > > +							  i_size - 1);
> > > > > > > +		}
> > > > > > > +	}
> > > > > > 
> > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > around possible races with file operations modifying i_size.
> > > > > > 
> > > > > > 								Honza
> > > > > 
> > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > behavior. For example:
> > > > > 
> > > > > -----------------8<----------------
> > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > >          int sync_mode)
> > > > > {
> > > > >          return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > }
> > > > > 
> > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > {
> > > > >          return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > -----------------8<----------------
> > > > > 
> > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > > 
> > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > file_write_and_wait a static inline wrapper around
> > > > > file_write_and_wait_range.
> > > > 
> > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > not, then that does simplify the code a bit.
> > > > 
> > > > -------------------8<--------------------
> > > > 
> > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > > 
> > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > parameter.  This is not needed since we know the EOF from the inode.  Use
> > > > that instead.
> > > > 
> > > > Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> > > > Signed-off-by: Andrew Morton <akpm@osdl.org>
> > > > Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > > > ---
> > > >   mm/filemap.c | 8 +++++++-
> > > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > >    */
> > > >   int filemap_fdatawait(struct address_space *mapping)
> > > >   {
> > > > -	return wait_on_page_writeback_range(mapping, 0, -1);
> > > > +	loff_t i_size = i_size_read(mapping->host);
> > > > +
> > > > +	if (i_size == 0)
> > > > +		return 0;
> > > > +
> > > > +	return wait_on_page_writeback_range(mapping, 0,
> > > > +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> > > >   }
> > > >   EXPORT_SYMBOL(filemap_fdatawait);
> > > > 
> > > 
> > > Does this ever get called in cases where we would not hold fs locks? In 
> > > that case we definitely don't want to be relying on i_size,
> > > 
> > > Steve.
> > > 
> > 
> > Yes. We can initiate and wait on writeback from any context where you
> > can sleep, really.
> > 
> > We're just waiting on whole file writeback here, so I don't think
> > there's anything wrong. As long as the i_size was valid at some point in
> > time prior to waiting then you're ok.
> > 
> > The question I have is more whether this optimization is still useful. 
> > 
> > What we do now is just walk the radix tree and wait_on_page_writeback
> > for each page. Do we gain anything by avoiding ranges beyond the current
> > EOF with the pagecache infrastructure of 2017?
> 
> FWIW I'm not aware of any significant benefit of using i_size in
> filemap_fdatawait() - we iterate to the end of the radix tree node anyway
> since pagevec_lookup_tag() does not support range searches anyway (I'm
> working on fixing that however even after that the benefit would be still
> rather marginal).
> 
> What Marcello might have meant even back in 2004 was that if we are in the
> middle of truncate, i_size is already reduced but page cache not truncated
> yet, then filemap_fdatawait() does not have to wait for writeback of
> truncated pages. That might be a noticeable benefit even today if such race
> happens however I'm not sure it's worth optimizing for and surprises
> arising from randomly snapshotting i_size (which especially for clustered
> filesystems may be out of date) IMHO overweight the possible advantage.
> 
> 								Honza

Thanks for clarifying.

Given that file_write_and_wait is a new helper function anyway, I'll
just make it a wrapper around file_write_and_wait_range. Since it might
be racy, should remove this optimization from the "legacy"
filemap_fdatawait / filemap_fdatawait_keep_errors calls?
Jan Kara July 31, 2017, 1:32 p.m. UTC | #9
On Mon 31-07-17 09:00:37, Jeff Layton wrote:
> On Mon, 2017-07-31 at 14:07 +0200, Jan Kara wrote:
> > On Mon 31-07-17 07:44:16, Jeff Layton wrote:
> > > On Mon, 2017-07-31 at 12:32 +0100, Steven Whitehouse wrote:
> > > > On 31/07/17 12:27, Jeff Layton wrote:
> > > > > On Thu, 2017-07-27 at 08:48 -0400, Jeff Layton wrote:
> > > > > > On Thu, 2017-07-27 at 10:49 +0200, Jan Kara wrote:
> > > > > > > On Wed 26-07-17 13:55:36, Jeff Layton wrote:
> > > > > > > > +int file_write_and_wait(struct file *file)
> > > > > > > > +{
> > > > > > > > +	int err = 0, err2;
> > > > > > > > +	struct address_space *mapping = file->f_mapping;
> > > > > > > > +
> > > > > > > > +	if ((!dax_mapping(mapping) && mapping->nrpages) ||
> > > > > > > > +	    (dax_mapping(mapping) && mapping->nrexceptional)) {
> > > > > > > > +		err = filemap_fdatawrite(mapping);
> > > > > > > > +		/* See comment of filemap_write_and_wait() */
> > > > > > > > +		if (err != -EIO) {
> > > > > > > > +			loff_t i_size = i_size_read(mapping->host);
> > > > > > > > +
> > > > > > > > +			if (i_size != 0)
> > > > > > > > +				__filemap_fdatawait_range(mapping, 0,
> > > > > > > > +							  i_size - 1);
> > > > > > > > +		}
> > > > > > > > +	}
> > > > > > > 
> > > > > > > Err, what's the i_size check doing here? I'd just pass ~0 as the end of the
> > > > > > > range and ignore i_size. It is much easier than trying to wrap your head
> > > > > > > around possible races with file operations modifying i_size.
> > > > > > > 
> > > > > > > 								Honza
> > > > > > 
> > > > > > I'm basically emulating _exactly_ what filemap_write_and_wait does here,
> > > > > > as I'm leery of making subtle behavior changes in the actual writeback
> > > > > > behavior. For example:
> > > > > > 
> > > > > > -----------------8<----------------
> > > > > > static inline int __filemap_fdatawrite(struct address_space *mapping,
> > > > > >          int sync_mode)
> > > > > > {
> > > > > >          return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
> > > > > > }
> > > > > > 
> > > > > > int filemap_fdatawrite(struct address_space *mapping)
> > > > > > {
> > > > > >          return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
> > > > > > }
> > > > > > EXPORT_SYMBOL(filemap_fdatawrite);
> > > > > > -----------------8<----------------
> > > > > > 
> > > > > > ...which then sets up the wbc with the right ranges and sync mode and
> > > > > > kicks off writepages. But then, it does the i_size_read to figure out
> > > > > > what range it should wait on (with the shortcut for the size == 0 case).
> > > > > > 
> > > > > > My assumption was that it was intentionally designed that way, but I'm
> > > > > > guessing from your comments that it wasn't? If so, then we can turn
> > > > > > file_write_and_wait a static inline wrapper around
> > > > > > file_write_and_wait_range.
> > > > > 
> > > > > FWIW, I did a bit of archaeology in the linux-history tree and found
> > > > > this patch from Marcelo in 2004. Is this optimization still helpful? If
> > > > > not, then that does simplify the code a bit.
> > > > > 
> > > > > -------------------8<--------------------
> > > > > 
> > > > > [PATCH] small wait_on_page_writeback_range() optimization
> > > > > 
> > > > > filemap_fdatawait() calls wait_on_page_writeback_range() with -1 as "end"
> > > > > parameter.  This is not needed since we know the EOF from the inode.  Use
> > > > > that instead.
> > > > > 
> > > > > Signed-off-by: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
> > > > > Signed-off-by: Andrew Morton <akpm@osdl.org>
> > > > > Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > > > > ---
> > > > >   mm/filemap.c | 8 +++++++-
> > > > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 78e18b7639b6..55fb7b4141e4 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -287,7 +287,13 @@ EXPORT_SYMBOL(sync_page_range);
> > > > >    */
> > > > >   int filemap_fdatawait(struct address_space *mapping)
> > > > >   {
> > > > > -	return wait_on_page_writeback_range(mapping, 0, -1);
> > > > > +	loff_t i_size = i_size_read(mapping->host);
> > > > > +
> > > > > +	if (i_size == 0)
> > > > > +		return 0;
> > > > > +
> > > > > +	return wait_on_page_writeback_range(mapping, 0,
> > > > > +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> > > > >   }
> > > > >   EXPORT_SYMBOL(filemap_fdatawait);
> > > > > 
> > > > 
> > > > Does this ever get called in cases where we would not hold fs locks? In 
> > > > that case we definitely don't want to be relying on i_size,
> > > > 
> > > > Steve.
> > > > 
> > > 
> > > Yes. We can initiate and wait on writeback from any context where you
> > > can sleep, really.
> > > 
> > > We're just waiting on whole file writeback here, so I don't think
> > > there's anything wrong. As long as the i_size was valid at some point in
> > > time prior to waiting then you're ok.
> > > 
> > > The question I have is more whether this optimization is still useful. 
> > > 
> > > What we do now is just walk the radix tree and wait_on_page_writeback
> > > for each page. Do we gain anything by avoiding ranges beyond the current
> > > EOF with the pagecache infrastructure of 2017?
> > 
> > FWIW I'm not aware of any significant benefit of using i_size in
> > filemap_fdatawait() - we iterate to the end of the radix tree node anyway
> > since pagevec_lookup_tag() does not support range searches anyway (I'm
> > working on fixing that however even after that the benefit would be still
> > rather marginal).
> > 
> > What Marcello might have meant even back in 2004 was that if we are in the
> > middle of truncate, i_size is already reduced but page cache not truncated
> > yet, then filemap_fdatawait() does not have to wait for writeback of
> > truncated pages. That might be a noticeable benefit even today if such race
> > happens however I'm not sure it's worth optimizing for and surprises
> > arising from randomly snapshotting i_size (which especially for clustered
> > filesystems may be out of date) IMHO overweight the possible advantage.
> > 
> > 								Honza
> 
> Thanks for clarifying.
> 
> Given that file_write_and_wait is a new helper function anyway, I'll
> just make it a wrapper around file_write_and_wait_range. Since it might

Agreed.

> be racy, should remove this optimization from the "legacy"
> filemap_fdatawait / filemap_fdatawait_keep_errors calls?

I'm for it.

								Honza

Patch
diff mbox

diff --git a/mm/filemap.c b/mm/filemap.c
index 78e18b7639b6..55fb7b4141e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -287,7 +287,13 @@  EXPORT_SYMBOL(sync_page_range);
  */
 int filemap_fdatawait(struct address_space *mapping)
 {
-	return wait_on_page_writeback_range(mapping, 0, -1);
+	loff_t i_size = i_size_read(mapping->host);
+
+	if (i_size == 0)
+		return 0;
+
+	return wait_on_page_writeback_range(mapping, 0,
+				(i_size - 1) >> PAGE_CACHE_SHIFT);
 }
 EXPORT_SYMBOL(filemap_fdatawait);