diff mbox

[RESEND] fs: always set I_DIRTY_TIME to fsync correctly on lazytime

Message ID 20161031190245.13404-1-naota@elisp.net (mailing list archive)
State New, archived
Headers show

Commit Message

Naohiro Aota Oct. 31, 2016, 7:02 p.m. UTC
While lazytime states that "The on-disk timestamps are updated only
when: ... - the application employs fsync(2), syncfs(2), or sync(2)"
[1], it does not write a timestamp update on fsync().

[1] http://manpages.ubuntu.com/manpages/xenial/man8/mount.8.html

The following commands will reproduce the problem:

$ mount -o noatime,lazytime ext4.img /mnt/tmp
$ cd /mnt/tmp
(create an 128M file to fio, not to observe size update)
$ dd if=/dev/zero of=wxyz.0.0 bs=1M count=128
(do write/fsync)
$ fio --name wxyz --direct=1 --buffered=0 --size=128m --bs=64k --rw=write \
  --ioengine=sync --numjobs=1 --fsync=5

Since fio invokes 1 fsync per 5 writes, we should see rapid journal
commits for timestamp update by tracing jbd2:jbd2_end_commit trace
point. Only we can see are, however, some periodic (~5 sec) commits from
bdi flush like below.

$ trace jbd2:jbd2_end_commit
    jbd2/loop0-8-1617  [002] ....    96.637351: jbd2_end_commit: dev 7,0 transaction 5393 sync 0 head 5393
    jbd2/loop0-8-1617  [000] ....   101.679411: jbd2_end_commit: dev 7,0 transaction 5394 sync 0 head 5393
    jbd2/loop0-8-1617  [003] ....   106.743628: jbd2_end_commit: dev 7,0 transaction 5395 sync 0 head 5393
    jbd2/loop0-8-1617  [001] ....   111.801964: jbd2_end_commit: dev 7,0 transaction 5396 sync 0 head 5393
...

The problem is __mark_inode_dirty() does not always flag I_DIRTY_TIME.
It seems that it is no use to mark an inode I_DIRTY_TIME when the inode
is already I_DIRTY_INODE. However, by that decision, we're skipping
journal write if we invoke two fsync()s between two bdi flushes. As the
following table shows, any fsync after the first fsync do nothing (if
there's no update other than timestamp).

Event                | i_state      | journal
---------------------+--------------+------------------------
<timestamp update>   | I_DIRTY_TIME | no write (lazytime)
<fsync>              | I_DIRTY_SYNC | write timestamp update
<timestamp update>   | I_DIRTY_SYNC | no write (lazytime)
<fsync>              | I_DIRTY_SYNC | no write *BUG*
...
<bdi flush>          | 0            |
<timestamp update>   | I_DIRTY_TIME | no write (lazytime)
<fsync>              | I_DIRTY_SYNC | write timestamp update

We should set I_DIRTY_TIME on the second timestamp update to let fsync()
notice there's a timestamp update after the last inode writeout.

After this patch, we can see rapid trace of journal commit:
$ trace jbd2:jbd2_end_commit
    jbd2/loop0-8-1879  [002] ....   208.275057: jbd2_end_commit: dev 7,0 transaction 5364 sync 0 head 3343
    jbd2/loop0-8-1879  [000] ....   208.302539: jbd2_end_commit: dev 7,0 transaction 5365 sync 0 head 3343
    jbd2/loop0-8-1879  [000] ....   208.327238: jbd2_end_commit: dev 7,0 transaction 5366 sync 0 head 3343
    jbd2/loop0-8-1879  [003] ....   208.347618: jbd2_end_commit: dev 7,0 transaction 5367 sync 0 head 3343
...

Reported-by: Asraa Ali Mardan <asraaiteng@gmail.com>
Signed-off-by: Naohiro Aota <naota@elisp.net>
---

 fs/fs-writeback.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Jan Kara Oct. 31, 2016, 10:46 p.m. UTC | #1
On Tue 01-11-16 04:02:45, Naohiro Aota wrote:
> While lazytime states that "The on-disk timestamps are updated only
> when: ... - the application employs fsync(2), syncfs(2), or sync(2)"
> [1], it does not write a timestamp update on fsync().
> 
> [1] http://manpages.ubuntu.com/manpages/xenial/man8/mount.8.html
> 
> The following commands will reproduce the problem:
> 
> $ mount -o noatime,lazytime ext4.img /mnt/tmp
> $ cd /mnt/tmp
> (create an 128M file to fio, not to observe size update)
> $ dd if=/dev/zero of=wxyz.0.0 bs=1M count=128
> (do write/fsync)
> $ fio --name wxyz --direct=1 --buffered=0 --size=128m --bs=64k --rw=write \
>   --ioengine=sync --numjobs=1 --fsync=5
> 
> Since fio invokes 1 fsync per 5 writes, we should see rapid journal
> commits for timestamp update by tracing jbd2:jbd2_end_commit trace
> point. Only we can see are, however, some periodic (~5 sec) commits from
> bdi flush like below.
> 
> $ trace jbd2:jbd2_end_commit
>     jbd2/loop0-8-1617  [002] ....    96.637351: jbd2_end_commit: dev 7,0 transaction 5393 sync 0 head 5393
>     jbd2/loop0-8-1617  [000] ....   101.679411: jbd2_end_commit: dev 7,0 transaction 5394 sync 0 head 5393
>     jbd2/loop0-8-1617  [003] ....   106.743628: jbd2_end_commit: dev 7,0 transaction 5395 sync 0 head 5393
>     jbd2/loop0-8-1617  [001] ....   111.801964: jbd2_end_commit: dev 7,0 transaction 5396 sync 0 head 5393
> ...
> 
> The problem is __mark_inode_dirty() does not always flag I_DIRTY_TIME.
> It seems that it is no use to mark an inode I_DIRTY_TIME when the inode
> is already I_DIRTY_INODE. However, by that decision, we're skipping
> journal write if we invoke two fsync()s between two bdi flushes. As the
> following table shows, any fsync after the first fsync do nothing (if
> there's no update other than timestamp).
> 
> Event                | i_state      | journal
> ---------------------+--------------+------------------------
> <timestamp update>   | I_DIRTY_TIME | no write (lazytime)
> <fsync>              | I_DIRTY_SYNC | write timestamp update
> <timestamp update>   | I_DIRTY_SYNC | no write (lazytime)
> <fsync>              | I_DIRTY_SYNC | no write *BUG*
> ...
> <bdi flush>          | 0            |
> <timestamp update>   | I_DIRTY_TIME | no write (lazytime)
> <fsync>              | I_DIRTY_SYNC | write timestamp update
> 
> We should set I_DIRTY_TIME on the second timestamp update to let fsync()
> notice there's a timestamp update after the last inode writeout.
> 
> After this patch, we can see rapid trace of journal commit:
> $ trace jbd2:jbd2_end_commit
>     jbd2/loop0-8-1879  [002] ....   208.275057: jbd2_end_commit: dev 7,0 transaction 5364 sync 0 head 3343
>     jbd2/loop0-8-1879  [000] ....   208.302539: jbd2_end_commit: dev 7,0 transaction 5365 sync 0 head 3343
>     jbd2/loop0-8-1879  [000] ....   208.327238: jbd2_end_commit: dev 7,0 transaction 5366 sync 0 head 3343
>     jbd2/loop0-8-1879  [003] ....   208.347618: jbd2_end_commit: dev 7,0 transaction 5367 sync 0 head 3343
> ...
> 
> Reported-by: Asraa Ali Mardan <asraaiteng@gmail.com>
> Signed-off-by: Naohiro Aota <naota@elisp.net>

Thanks for the patch. It makes sense. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Jens, can you please merge the patch? Thanks!

								Honza
> ---
> 
>  fs/fs-writeback.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05713a5..ace628c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2100,16 +2100,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  	 */
>  	smp_mb();
>  
> -	if (((inode->i_state & flags) == flags) ||
> -	    (dirtytime && (inode->i_state & I_DIRTY_INODE)))
> +	if ((inode->i_state & flags) == flags)
>  		return;
>  
>  	if (unlikely(block_dump))
>  		block_dump___mark_inode_dirty(inode);
>  
>  	spin_lock(&inode->i_lock);
> -	if (dirtytime && (inode->i_state & I_DIRTY_INODE))
> +	if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
> +		inode->i_state |= I_DIRTY_TIME;
>  		goto out_unlock_inode;
> +	}
>  	if ((inode->i_state & flags) != flags) {
>  		const int was_dirty = inode->i_state & I_DIRTY;
>  
> -- 
> 2.8.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Naohiro Aota March 17, 2017, 6:06 a.m. UTC | #2
Hello, all

What is the status of this patch? Can this be picked up for some tree?

Regards,
Naohiro

2016-11-01 7:46 GMT+09:00 Jan Kara <jack@suse.cz>:
> On Tue 01-11-16 04:02:45, Naohiro Aota wrote:
>> While lazytime states that "The on-disk timestamps are updated only
>> when: ... - the application employs fsync(2), syncfs(2), or sync(2)"
>> [1], it does not write a timestamp update on fsync().
>>
>> [1] http://manpages.ubuntu.com/manpages/xenial/man8/mount.8.html
>>
>> The following commands will reproduce the problem:
>>
>> $ mount -o noatime,lazytime ext4.img /mnt/tmp
>> $ cd /mnt/tmp
>> (create an 128M file to fio, not to observe size update)
>> $ dd if=/dev/zero of=wxyz.0.0 bs=1M count=128
>> (do write/fsync)
>> $ fio --name wxyz --direct=1 --buffered=0 --size=128m --bs=64k --rw=write \
>>   --ioengine=sync --numjobs=1 --fsync=5
>>
>> Since fio invokes 1 fsync per 5 writes, we should see rapid journal
>> commits for timestamp update by tracing jbd2:jbd2_end_commit trace
>> point. Only we can see are, however, some periodic (~5 sec) commits from
>> bdi flush like below.
>>
>> $ trace jbd2:jbd2_end_commit
>>     jbd2/loop0-8-1617  [002] ....    96.637351: jbd2_end_commit: dev 7,0 transaction 5393 sync 0 head 5393
>>     jbd2/loop0-8-1617  [000] ....   101.679411: jbd2_end_commit: dev 7,0 transaction 5394 sync 0 head 5393
>>     jbd2/loop0-8-1617  [003] ....   106.743628: jbd2_end_commit: dev 7,0 transaction 5395 sync 0 head 5393
>>     jbd2/loop0-8-1617  [001] ....   111.801964: jbd2_end_commit: dev 7,0 transaction 5396 sync 0 head 5393
>> ...
>>
>> The problem is __mark_inode_dirty() does not always flag I_DIRTY_TIME.
>> It seems that it is no use to mark an inode I_DIRTY_TIME when the inode
>> is already I_DIRTY_INODE. However, by that decision, we're skipping
>> journal write if we invoke two fsync()s between two bdi flushes. As the
>> following table shows, any fsync after the first fsync do nothing (if
>> there's no update other than timestamp).
>>
>> Event                | i_state      | journal
>> ---------------------+--------------+------------------------
>> <timestamp update>   | I_DIRTY_TIME | no write (lazytime)
>> <fsync>              | I_DIRTY_SYNC | write timestamp update
>> <timestamp update>   | I_DIRTY_SYNC | no write (lazytime)
>> <fsync>              | I_DIRTY_SYNC | no write *BUG*
>> ...
>> <bdi flush>          | 0            |
>> <timestamp update>   | I_DIRTY_TIME | no write (lazytime)
>> <fsync>              | I_DIRTY_SYNC | write timestamp update
>>
>> We should set I_DIRTY_TIME on the second timestamp update to let fsync()
>> notice there's a timestamp update after the last inode writeout.
>>
>> After this patch, we can see rapid trace of journal commit:
>> $ trace jbd2:jbd2_end_commit
>>     jbd2/loop0-8-1879  [002] ....   208.275057: jbd2_end_commit: dev 7,0 transaction 5364 sync 0 head 3343
>>     jbd2/loop0-8-1879  [000] ....   208.302539: jbd2_end_commit: dev 7,0 transaction 5365 sync 0 head 3343
>>     jbd2/loop0-8-1879  [000] ....   208.327238: jbd2_end_commit: dev 7,0 transaction 5366 sync 0 head 3343
>>     jbd2/loop0-8-1879  [003] ....   208.347618: jbd2_end_commit: dev 7,0 transaction 5367 sync 0 head 3343
>> ...
>>
>> Reported-by: Asraa Ali Mardan <asraaiteng@gmail.com>
>> Signed-off-by: Naohiro Aota <naota@elisp.net>
>
> Thanks for the patch. It makes sense. You can add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> Jens, can you please merge the patch? Thanks!
>
>                                                                 Honza
>> ---
>>
>>  fs/fs-writeback.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 05713a5..ace628c 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -2100,16 +2100,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>>        */
>>       smp_mb();
>>
>> -     if (((inode->i_state & flags) == flags) ||
>> -         (dirtytime && (inode->i_state & I_DIRTY_INODE)))
>> +     if ((inode->i_state & flags) == flags)
>>               return;
>>
>>       if (unlikely(block_dump))
>>               block_dump___mark_inode_dirty(inode);
>>
>>       spin_lock(&inode->i_lock);
>> -     if (dirtytime && (inode->i_state & I_DIRTY_INODE))
>> +     if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
>> +             inode->i_state |= I_DIRTY_TIME;
>>               goto out_unlock_inode;
>> +     }
>>       if ((inode->i_state & flags) != flags) {
>>               const int was_dirty = inode->i_state & I_DIRTY;
>>
>> --
>> 2.8.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5..ace628c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2100,16 +2100,17 @@  void __mark_inode_dirty(struct inode *inode, int flags)
 	 */
 	smp_mb();
 
-	if (((inode->i_state & flags) == flags) ||
-	    (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+	if ((inode->i_state & flags) == flags)
 		return;
 
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode->i_lock);
-	if (dirtytime && (inode->i_state & I_DIRTY_INODE))
+	if (dirtytime && (inode->i_state & I_DIRTY_INODE)) {
+		inode->i_state |= I_DIRTY_TIME;
 		goto out_unlock_inode;
+	}
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;