diff mbox

Remove filesize checks for sync I/O journal commit

Message ID 20141104155222.GA5746@shrek.lan (mailing list archive)
State New, archived
Headers show

Commit Message

Goldwyn Rodrigues Nov. 4, 2014, 3:52 p.m. UTC
Filesize is not a good indication that the file needs to be synced.
An example where this breaks is:
 1. Open the file in O_SYNC|O_RDWR
 2. Read a small portion of the file (say 64 bytes)
 3. Lseek to starting of the file
 4. Write 64 bytes

If the node crashes, it is not written out to disk because this
was not committed in the journal and the other node which reads
the file after recovery reads stale data (even if the write on
the other node was successful)

--- 
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 324dc93..69fb9f7 100644

Comments

Andrew Morton Nov. 4, 2014, 10:45 p.m. UTC | #1
On Tue, 4 Nov 2014 09:52:22 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:

> Filesize is not a good indication that the file needs to be synced.
> An example where this breaks is:
>  1. Open the file in O_SYNC|O_RDWR
>  2. Read a small portion of the file (say 64 bytes)
>  3. Lseek to starting of the file
>  4. Write 64 bytes
> 
> If the node crashes, it is not written out to disk because this
> was not committed in the journal and the other node which reads
> the file after recovery reads stale data (even if the write on
> the other node was successful)
> 
> ...
>
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2381,9 +2381,7 @@ out_dio:
>  		if (ret < 0)
>  			written = ret;
>  
> -		if (!ret && ((old_size != i_size_read(inode)) ||
> -			     (old_clusters != OCFS2_I(inode)->ip_clusters) ||
> -			     has_refcount)) {
> +		if (!ret) {
>  			ret = jbd2_journal_force_commit(osb->journal->j_journal);
>  			if (ret < 0)
>  				written = ret;

Can we have a signed-off-by for this, please?
Goldwyn Rodrigues Nov. 5, 2014, 3:26 p.m. UTC | #2
On 11/04/2014 04:45 PM, Andrew Morton wrote:
> On Tue, 4 Nov 2014 09:52:22 -0600 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
>> Filesize is not a good indication that the file needs to be synced.
>> An example where this breaks is:
>>   1. Open the file in O_SYNC|O_RDWR
>>   2. Read a small portion of the file (say 64 bytes)
>>   3. Lseek to starting of the file
>>   4. Write 64 bytes
>>
>> If the node crashes, it is not written out to disk because this
>> was not committed in the journal and the other node which reads
>> the file after recovery reads stale data (even if the write on
>> the other node was successful)
>>
>> ...
>>
>> --- a/fs/ocfs2/file.c
>> +++ b/fs/ocfs2/file.c
>> @@ -2381,9 +2381,7 @@ out_dio:
>>   		if (ret < 0)
>>   			written = ret;
>>
>> -		if (!ret && ((old_size != i_size_read(inode)) ||
>> -			     (old_clusters != OCFS2_I(inode)->ip_clusters) ||
>> -			     has_refcount)) {
>> +		if (!ret) {
>>   			ret = jbd2_journal_force_commit(osb->journal->j_journal);
>>   			if (ret < 0)
>>   				written = ret;
>
> Can we have a signed-off-by for this, please?
>

Oops. Missed that.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Mark Fasheh Nov. 5, 2014, 9:18 p.m. UTC | #3
On Tue, Nov 04, 2014 at 09:52:22AM -0600, Goldwyn Rodrigues wrote:
> Filesize is not a good indication that the file needs to be synced.
> An example where this breaks is:
>  1. Open the file in O_SYNC|O_RDWR
>  2. Read a small portion of the file (say 64 bytes)
>  3. Lseek to starting of the file
>  4. Write 64 bytes
> 
> If the node crashes, it is not written out to disk because this
> was not committed in the journal and the other node which reads
> the file after recovery reads stale data (even if the write on
> the other node was successful)

This patch looks good, thanks for sending it over.

Reviewed-by: Mark Fasheh <mfasheh@suse.de>
	--Mark

--
Mark Fasheh
Xue jiufei Dec. 3, 2014, 8:20 a.m. UTC | #4
Hi Goldwyn,
On 2014/11/4 23:52, Goldwyn Rodrigues wrote:
> Filesize is not a good indication that the file needs to be synced.
> An example where this breaks is:
>  1. Open the file in O_SYNC|O_RDWR
>  2. Read a small portion of the file (say 64 bytes)
>  3. Lseek to starting of the file
>  4. Write 64 bytes
> 
> If the node crashes, it is not written out to disk because this
> was not committed in the journal and the other node which reads
> the file after recovery reads stale data (even if the write on
> the other node was successful)
> 
I have a question that if user open the file with O_SYNC,
it will call filemap_fdatawait_range() after generic_perform_write()
to wait until data reaches the disk, why another node may read the
stale data?

Thanks,
Xuejiufei

> --- 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 324dc93..69fb9f7 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2381,9 +2381,7 @@ out_dio:
>  		if (ret < 0)
>  			written = ret;
>  
> -		if (!ret && ((old_size != i_size_read(inode)) ||
> -			     (old_clusters != OCFS2_I(inode)->ip_clusters) ||
> -			     has_refcount)) {
> +		if (!ret) {
>  			ret = jbd2_journal_force_commit(osb->journal->j_journal);
>  			if (ret < 0)
>  				written = ret;
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
diff mbox

Patch

--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2381,9 +2381,7 @@  out_dio:
 		if (ret < 0)
 			written = ret;
 
-		if (!ret && ((old_size != i_size_read(inode)) ||
-			     (old_clusters != OCFS2_I(inode)->ip_clusters) ||
-			     has_refcount)) {
+		if (!ret) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;