ocfs2: return error when we attempt to access a dirty bh in jbd2
diff mbox

Message ID 5A6943BC.5010406@huawei.com
State New
Headers show

Commit Message

piaojun Jan. 25, 2018, 2:41 a.m. UTC
We should not reuse the dirty bh in jbd2 directly due to the following
situation:

1. When removing extent rec, we will dirty the bhs of extent rec and
   truncate log at the same time, and hand them over to jbd2.
2. The bhs are not flushed to disk due to abnormal storage link.
3. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate
   in __ocfs2_journal_access():

ocfs2_truncate_log_worker
  ocfs2_flush_truncate_log
    __ocfs2_flush_truncate_log
      ocfs2_replay_truncate_records
        ocfs2_journal_access_di
          __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
5. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

So we should return -EIO in case of ruining atomicity and consistency
of space reclaim.

Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")

Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
---
 fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

Comments

Gang He Jan. 25, 2018, 8:40 a.m. UTC | #1
Hi Jun,

If we return -EIO here, what is the consequence?
make the journal aborted and file system will become read-only?

Thanks
Gang


>>> 
> We should not reuse the dirty bh in jbd2 directly due to the following
> situation:
> 
> 1. When removing extent rec, we will dirty the bhs of extent rec and
>    truncate log at the same time, and hand them over to jbd2.
> 2. The bhs are not flushed to disk due to abnormal storage link.
> 3. After a while the storage link become normal. Truncate log flush
>    worker triggered by the next space reclaiming found the dirty bh of
>    truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>    in __ocfs2_journal_access():
> 
> ocfs2_truncate_log_worker
>   ocfs2_flush_truncate_log
>     __ocfs2_flush_truncate_log
>       ocfs2_replay_truncate_records
>         ocfs2_journal_access_di
>           __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
> uptodata.
> 
> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>    extent rec is still in error state, and unfortunately nobody will
>    take care of it.
> 5. At last the space of extent rec was not reduced, but truncate log
>    flush worker have given it back to globalalloc. That will cause
>    duplicate cluster problem which could be identified by fsck.ocfs2.
> 
> So we should return -EIO in case of ruining atomicity and consistency
> of space reclaim.
> 
> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
> __ocfs2_journal_access")
> 
> Signed-off-by: Jun Piao <piaojun@huawei.com>
> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
> ---
>  fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 3630443..d769ca2 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>  	/* we can safely remove this assertion after testing. */
>  	if (!buffer_uptodate(bh)) {
>  		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
> -		     (unsigned long long)bh->b_blocknr);
> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
> 
>  		lock_buffer(bh);
>  		/*
> -		 * A previous attempt to write this buffer head failed.
> -		 * Nothing we can do but to retry the write and hope for
> -		 * the best.
> +		 * We should not reuse the dirty bh directly due to the
> +		 * following situation:
> +		 *
> +		 * 1. When removing extent rec, we will dirty the bhs of
> +		 *    extent rec and truncate log at the same time, and
> +		 *    hand them over to jbd2.
> +		 * 2. The bhs are not flushed to disk due to abnormal
> +		 *    storage link.
> +		 * 3. After a while the storage link become normal.
> +		 *    Truncate log flush worker triggered by the next
> +		 *    space reclaiming found the dirty bh of truncate log
> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
> +		 *    in __ocfs2_journal_access():
> +		 *
> +		 *    ocfs2_truncate_log_worker
> +		 *      ocfs2_flush_truncate_log
> +		 *        __ocfs2_flush_truncate_log
> +		 *          ocfs2_replay_truncate_records
> +		 *            ocfs2_journal_access_di
> +		 *              __ocfs2_journal_access
> +		 *
> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
> +		 *    but the bh of extent rec is still in error state, and
> +		 *    unfortunately nobody will take care of it.
> +		 * 5. At last the space of extent rec was not reduced,
> +		 *    but truncate log flush worker have given it back to
> +		 *    globalalloc. That will cause duplicate cluster problem
> +		 *    which could be identified by fsck.ocfs2.
> +		 *
> +		 * So we should return -EIO in case of ruining atomicity
> +		 * and consistency of space reclaim.
>  		 */
>  		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
> -			clear_buffer_write_io_error(bh);
> -			set_buffer_uptodate(bh);
> -		}
> -
> -		if (!buffer_uptodate(bh)) {
> +			mlog(ML_ERROR, "A previous attempt to write this "
> +				"buffer head failed\n");
>  			unlock_buffer(bh);
>  			return -EIO;
>  		}
> -- 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com 
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Changwei Ge Jan. 25, 2018, 11:59 a.m. UTC | #2
Hi Jun,

On 2018/1/25 10:41, piaojun wrote:
> We should not reuse the dirty bh in jbd2 directly due to the following
> situation:
> 
> 1. When removing extent rec, we will dirty the bhs of extent rec and
Quick questions:
Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?

Afterwards, I can understand the issue your change log is describing better.

Thanks,
Changwei

>     truncate log at the same time, and hand them over to jbd2.
> 2. The bhs are not flushed to disk due to abnormal storage link.
> 3. After a while the storage link become normal. Truncate log flush
>     worker triggered by the next space reclaiming found the dirty bh of
>     truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>     in __ocfs2_journal_access():
> 
> ocfs2_truncate_log_worker
>    ocfs2_flush_truncate_log
>      __ocfs2_flush_truncate_log
>        ocfs2_replay_truncate_records
>          ocfs2_journal_access_di
>            __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
> 
> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>     extent rec is still in error state, and unfortunately nobody will
>     take care of it.
> 5. At last the space of extent rec was not reduced, but truncate log
>     flush worker have given it back to globalalloc. That will cause
>     duplicate cluster problem which could be identified by fsck.ocfs2.
> 
> So we should return -EIO in case of ruining atomicity and consistency
> of space reclaim.
> 
> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
> 
> Signed-off-by: Jun Piao <piaojun@huawei.com>
> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
> ---
>   fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>   1 file changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 3630443..d769ca2 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>   	/* we can safely remove this assertion after testing. */
>   	if (!buffer_uptodate(bh)) {
>   		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
> -		     (unsigned long long)bh->b_blocknr);
> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
> 
>   		lock_buffer(bh);
>   		/*
> -		 * A previous attempt to write this buffer head failed.
> -		 * Nothing we can do but to retry the write and hope for
> -		 * the best.
> +		 * We should not reuse the dirty bh directly due to the
> +		 * following situation:
> +		 *
> +		 * 1. When removing extent rec, we will dirty the bhs of
> +		 *    extent rec and truncate log at the same time, and
> +		 *    hand them over to jbd2.
> +		 * 2. The bhs are not flushed to disk due to abnormal
> +		 *    storage link.
> +		 * 3. After a while the storage link become normal.
> +		 *    Truncate log flush worker triggered by the next
> +		 *    space reclaiming found the dirty bh of truncate log
> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
> +		 *    in __ocfs2_journal_access():
> +		 *
> +		 *    ocfs2_truncate_log_worker
> +		 *      ocfs2_flush_truncate_log
> +		 *        __ocfs2_flush_truncate_log
> +		 *          ocfs2_replay_truncate_records
> +		 *            ocfs2_journal_access_di
> +		 *              __ocfs2_journal_access
> +		 *
> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
> +		 *    but the bh of extent rec is still in error state, and
> +		 *    unfortunately nobody will take care of it.
> +		 * 5. At last the space of extent rec was not reduced,
> +		 *    but truncate log flush worker have given it back to
> +		 *    globalalloc. That will cause duplicate cluster problem
> +		 *    which could be identified by fsck.ocfs2.
> +		 *
> +		 * So we should return -EIO in case of ruining atomicity
> +		 * and consistency of space reclaim.
>   		 */
>   		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
> -			clear_buffer_write_io_error(bh);
> -			set_buffer_uptodate(bh);
> -		}
> -
> -		if (!buffer_uptodate(bh)) {
> +			mlog(ML_ERROR, "A previous attempt to write this "
> +				"buffer head failed\n");
>   			unlock_buffer(bh);
>   			return -EIO;
>   		}
>
piaojun Jan. 25, 2018, 12:10 p.m. UTC | #3
Hi Gang,

Filesystem won't become readonly and journal remains normal, so this
problem needs user umount and mount again to recover. And I'm thinking
about if we should set readonly to notice user.

thanks,
Jun

On 2018/1/25 16:40, Gang He wrote:
> Hi Jun,
> 
> If we return -EIO here, what is the consequence?
> make the journal aborted and file system will become read-only?
> 
> Thanks
> Gang
> 
> 
>>>>
>> We should not reuse the dirty bh in jbd2 directly due to the following
>> situation:
>>
>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>    truncate log at the same time, and hand them over to jbd2.
>> 2. The bhs are not flushed to disk due to abnormal storage link.
>> 3. After a while the storage link become normal. Truncate log flush
>>    worker triggered by the next space reclaiming found the dirty bh of
>>    truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>    in __ocfs2_journal_access():
>>
>> ocfs2_truncate_log_worker
>>   ocfs2_flush_truncate_log
>>     __ocfs2_flush_truncate_log
>>       ocfs2_replay_truncate_records
>>         ocfs2_journal_access_di
>>           __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>> uptodata.
>>
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>    extent rec is still in error state, and unfortunately nobody will
>>    take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>>    flush worker have given it back to globalalloc. That will cause
>>    duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> So we should return -EIO in case of ruining atomicity and consistency
>> of space reclaim.
>>
>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>> __ocfs2_journal_access")
>>
>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> ---
>>  fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 3630443..d769ca2 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>  	/* we can safely remove this assertion after testing. */
>>  	if (!buffer_uptodate(bh)) {
>>  		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>> -		     (unsigned long long)bh->b_blocknr);
>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>
>>  		lock_buffer(bh);
>>  		/*
>> -		 * A previous attempt to write this buffer head failed.
>> -		 * Nothing we can do but to retry the write and hope for
>> -		 * the best.
>> +		 * We should not reuse the dirty bh directly due to the
>> +		 * following situation:
>> +		 *
>> +		 * 1. When removing extent rec, we will dirty the bhs of
>> +		 *    extent rec and truncate log at the same time, and
>> +		 *    hand them over to jbd2.
>> +		 * 2. The bhs are not flushed to disk due to abnormal
>> +		 *    storage link.
>> +		 * 3. After a while the storage link become normal.
>> +		 *    Truncate log flush worker triggered by the next
>> +		 *    space reclaiming found the dirty bh of truncate log
>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>> +		 *    in __ocfs2_journal_access():
>> +		 *
>> +		 *    ocfs2_truncate_log_worker
>> +		 *      ocfs2_flush_truncate_log
>> +		 *        __ocfs2_flush_truncate_log
>> +		 *          ocfs2_replay_truncate_records
>> +		 *            ocfs2_journal_access_di
>> +		 *              __ocfs2_journal_access
>> +		 *
>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>> +		 *    but the bh of extent rec is still in error state, and
>> +		 *    unfortunately nobody will take care of it.
>> +		 * 5. At last the space of extent rec was not reduced,
>> +		 *    but truncate log flush worker have given it back to
>> +		 *    globalalloc. That will cause duplicate cluster problem
>> +		 *    which could be identified by fsck.ocfs2.
>> +		 *
>> +		 * So we should return -EIO in case of ruining atomicity
>> +		 * and consistency of space reclaim.
>>  		 */
>>  		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>> -			clear_buffer_write_io_error(bh);
>> -			set_buffer_uptodate(bh);
>> -		}
>> -
>> -		if (!buffer_uptodate(bh)) {
>> +			mlog(ML_ERROR, "A previous attempt to write this "
>> +				"buffer head failed\n");
>>  			unlock_buffer(bh);
>>  			return -EIO;
>>  		}
>> -- 
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> .
>
piaojun Jan. 25, 2018, 12:17 p.m. UTC | #4
Hi Changwei,

On 2018/1/25 19:59, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/25 10:41, piaojun wrote:
>> We should not reuse the dirty bh in jbd2 directly due to the following
>> situation:
>>
>> 1. When removing extent rec, we will dirty the bhs of extent rec and
> Quick questions:
> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
> 
> Afterwards, I can understand the issue your change log is describing better.
> 
> Thanks,
> Changwei
> 
Yes, I mean they are in the same transaction as below:

ocfs2_remove_btree_range
  ocfs2_remove_extent // modify extent records
    ocfs2_truncate_log_append // modify truncate log

thanks,
Jun

>>     truncate log at the same time, and hand them over to jbd2.
>> 2. The bhs are not flushed to disk due to abnormal storage link.
>> 3. After a while the storage link become normal. Truncate log flush
>>     worker triggered by the next space reclaiming found the dirty bh of
>>     truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>     in __ocfs2_journal_access():
>>
>> ocfs2_truncate_log_worker
>>    ocfs2_flush_truncate_log
>>      __ocfs2_flush_truncate_log
>>        ocfs2_replay_truncate_records
>>          ocfs2_journal_access_di
>>            __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>     extent rec is still in error state, and unfortunately nobody will
>>     take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>>     flush worker have given it back to globalalloc. That will cause
>>     duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> So we should return -EIO in case of ruining atomicity and consistency
>> of space reclaim.
>>
>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>
>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> ---
>>   fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 3630443..d769ca2 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>   	/* we can safely remove this assertion after testing. */
>>   	if (!buffer_uptodate(bh)) {
>>   		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>> -		     (unsigned long long)bh->b_blocknr);
>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>
>>   		lock_buffer(bh);
>>   		/*
>> -		 * A previous attempt to write this buffer head failed.
>> -		 * Nothing we can do but to retry the write and hope for
>> -		 * the best.
>> +		 * We should not reuse the dirty bh directly due to the
>> +		 * following situation:
>> +		 *
>> +		 * 1. When removing extent rec, we will dirty the bhs of
>> +		 *    extent rec and truncate log at the same time, and
>> +		 *    hand them over to jbd2.
>> +		 * 2. The bhs are not flushed to disk due to abnormal
>> +		 *    storage link.
>> +		 * 3. After a while the storage link become normal.
>> +		 *    Truncate log flush worker triggered by the next
>> +		 *    space reclaiming found the dirty bh of truncate log
>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>> +		 *    in __ocfs2_journal_access():
>> +		 *
>> +		 *    ocfs2_truncate_log_worker
>> +		 *      ocfs2_flush_truncate_log
>> +		 *        __ocfs2_flush_truncate_log
>> +		 *          ocfs2_replay_truncate_records
>> +		 *            ocfs2_journal_access_di
>> +		 *              __ocfs2_journal_access
>> +		 *
>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>> +		 *    but the bh of extent rec is still in error state, and
>> +		 *    unfortunately nobody will take care of it.
>> +		 * 5. At last the space of extent rec was not reduced,
>> +		 *    but truncate log flush worker have given it back to
>> +		 *    globalalloc. That will cause duplicate cluster problem
>> +		 *    which could be identified by fsck.ocfs2.
>> +		 *
>> +		 * So we should return -EIO in case of ruining atomicity
>> +		 * and consistency of space reclaim.
>>   		 */
>>   		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>> -			clear_buffer_write_io_error(bh);
>> -			set_buffer_uptodate(bh);
>> -		}
>> -
>> -		if (!buffer_uptodate(bh)) {
>> +			mlog(ML_ERROR, "A previous attempt to write this "
>> +				"buffer head failed\n");
>>   			unlock_buffer(bh);
>>   			return -EIO;
>>   		}
>>
> .
>
Changwei Ge Jan. 25, 2018, 12:30 p.m. UTC | #5
Hi Jun,

On 2018/1/25 20:18, piaojun wrote:
> Hi Changwei,
> 
> On 2018/1/25 19:59, Changwei Ge wrote:
>> Hi Jun,
>>
>> On 2018/1/25 10:41, piaojun wrote:
>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>> situation:
>>>
>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>> Quick questions:
>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>
>> Afterwards, I can understand the issue your change log is describing better.
>>
>> Thanks,
>> Changwei
>>
> Yes, I mean they are in the same transaction as below:
> 
> ocfs2_remove_btree_range
>    ocfs2_remove_extent // modify extent records
>      ocfs2_truncate_log_append // modify truncate log

If so I think the transaction including operations on extent and truncate log won't be committed.
And journal should already be aborted if interval transaction commit thread has been woken.
So no metadata will be changed.
And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
Are you sure this is the root cause of your problem?
I feel a little strange for it.

Thanks,
Changwei

> 
> thanks,
> Jun
> 
>>>      truncate log at the same time, and hand them over to jbd2.
>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>> 3. After a while the storage link become normal. Truncate log flush
>>>      worker triggered by the next space reclaiming found the dirty bh of
>>>      truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>      in __ocfs2_journal_access():
>>>
>>> ocfs2_truncate_log_worker
>>>     ocfs2_flush_truncate_log
>>>       __ocfs2_flush_truncate_log
>>>         ocfs2_replay_truncate_records
>>>           ocfs2_journal_access_di
>>>             __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>
>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>      extent rec is still in error state, and unfortunately nobody will
>>>      take care of it.
>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>      flush worker have given it back to globalalloc. That will cause
>>>      duplicate cluster problem which could be identified by fsck.ocfs2.
>>>
>>> So we should return -EIO in case of ruining atomicity and consistency
>>> of space reclaim.
>>>
>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>
>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>> ---
>>>    fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>    1 file changed, 35 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>> index 3630443..d769ca2 100644
>>> --- a/fs/ocfs2/journal.c
>>> +++ b/fs/ocfs2/journal.c
>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>    	/* we can safely remove this assertion after testing. */
>>>    	if (!buffer_uptodate(bh)) {
>>>    		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>> -		     (unsigned long long)bh->b_blocknr);
>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>
>>>    		lock_buffer(bh);
>>>    		/*
>>> -		 * A previous attempt to write this buffer head failed.
>>> -		 * Nothing we can do but to retry the write and hope for
>>> -		 * the best.
>>> +		 * We should not reuse the dirty bh directly due to the
>>> +		 * following situation:
>>> +		 *
>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>> +		 *    extent rec and truncate log at the same time, and
>>> +		 *    hand them over to jbd2.
>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>> +		 *    storage link.
>>> +		 * 3. After a while the storage link become normal.
>>> +		 *    Truncate log flush worker triggered by the next
>>> +		 *    space reclaiming found the dirty bh of truncate log
>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>> +		 *    in __ocfs2_journal_access():
>>> +		 *
>>> +		 *    ocfs2_truncate_log_worker
>>> +		 *      ocfs2_flush_truncate_log
>>> +		 *        __ocfs2_flush_truncate_log
>>> +		 *          ocfs2_replay_truncate_records
>>> +		 *            ocfs2_journal_access_di
>>> +		 *              __ocfs2_journal_access
>>> +		 *
>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>> +		 *    but the bh of extent rec is still in error state, and
>>> +		 *    unfortunately nobody will take care of it.
>>> +		 * 5. At last the space of extent rec was not reduced,
>>> +		 *    but truncate log flush worker have given it back to
>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>> +		 *    which could be identified by fsck.ocfs2.
>>> +		 *
>>> +		 * So we should return -EIO in case of ruining atomicity
>>> +		 * and consistency of space reclaim.
>>>    		 */
>>>    		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>> -			clear_buffer_write_io_error(bh);
>>> -			set_buffer_uptodate(bh);
>>> -		}
>>> -
>>> -		if (!buffer_uptodate(bh)) {
>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>> +				"buffer head failed\n");
>>>    			unlock_buffer(bh);
>>>    			return -EIO;
>>>    		}
>>>
>> .
>>
>
piaojun Jan. 25, 2018, 12:44 p.m. UTC | #6
Hi Changwei,

On 2018/1/25 20:30, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/25 20:18, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/25 19:59, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/25 10:41, piaojun wrote:
>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>> situation:
>>>>
>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>> Quick questions:
>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>
>>> Afterwards, I can understand the issue your change log is describing better.
>>>
>>> Thanks,
>>> Changwei
>>>
>> Yes, I mean they are in the same transaction as below:
>>
>> ocfs2_remove_btree_range
>>    ocfs2_remove_extent // modify extent records
>>      ocfs2_truncate_log_append // modify truncate log
> 
> If so I think the transaction including operations on extent and truncate log won't be committed.
> And journal should already be aborted if interval transaction commit thread has been woken.
> So no metadata will be changed.
> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
> Are you sure this is the root cause of your problem?
> I feel a little strange for it.
> 
> Thanks,
> Changwei
> 
As you said, the transaction was not committed, but after a while the
bh of truncate log was committed in another transaction. I'm sure for
the cause and after applying this patch, the duplicate cluster problem
is gone. I have tested it a few month.

thanks,
Jun

>>
>> thanks,
>> Jun
>>
>>>>      truncate log at the same time, and hand them over to jbd2.
>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>      worker triggered by the next space reclaiming found the dirty bh of
>>>>      truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>      in __ocfs2_journal_access():
>>>>
>>>> ocfs2_truncate_log_worker
>>>>     ocfs2_flush_truncate_log
>>>>       __ocfs2_flush_truncate_log
>>>>         ocfs2_replay_truncate_records
>>>>           ocfs2_journal_access_di
>>>>             __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>
>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>      extent rec is still in error state, and unfortunately nobody will
>>>>      take care of it.
>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>      flush worker have given it back to globalalloc. That will cause
>>>>      duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>
>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>> of space reclaim.
>>>>
>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>
>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>> ---
>>>>    fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>    1 file changed, 35 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>> index 3630443..d769ca2 100644
>>>> --- a/fs/ocfs2/journal.c
>>>> +++ b/fs/ocfs2/journal.c
>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>    	/* we can safely remove this assertion after testing. */
>>>>    	if (!buffer_uptodate(bh)) {
>>>>    		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>> -		     (unsigned long long)bh->b_blocknr);
>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>
>>>>    		lock_buffer(bh);
>>>>    		/*
>>>> -		 * A previous attempt to write this buffer head failed.
>>>> -		 * Nothing we can do but to retry the write and hope for
>>>> -		 * the best.
>>>> +		 * We should not reuse the dirty bh directly due to the
>>>> +		 * following situation:
>>>> +		 *
>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>> +		 *    extent rec and truncate log at the same time, and
>>>> +		 *    hand them over to jbd2.
>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>> +		 *    storage link.
>>>> +		 * 3. After a while the storage link become normal.
>>>> +		 *    Truncate log flush worker triggered by the next
>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>> +		 *    in __ocfs2_journal_access():
>>>> +		 *
>>>> +		 *    ocfs2_truncate_log_worker
>>>> +		 *      ocfs2_flush_truncate_log
>>>> +		 *        __ocfs2_flush_truncate_log
>>>> +		 *          ocfs2_replay_truncate_records
>>>> +		 *            ocfs2_journal_access_di
>>>> +		 *              __ocfs2_journal_access
>>>> +		 *
>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>> +		 *    but the bh of extent rec is still in error state, and
>>>> +		 *    unfortunately nobody will take care of it.
>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>> +		 *    but truncate log flush worker have given it back to
>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>> +		 *    which could be identified by fsck.ocfs2.
>>>> +		 *
>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>> +		 * and consistency of space reclaim.
>>>>    		 */
>>>>    		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>> -			clear_buffer_write_io_error(bh);
>>>> -			set_buffer_uptodate(bh);
>>>> -		}
>>>> -
>>>> -		if (!buffer_uptodate(bh)) {
>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>> +				"buffer head failed\n");
>>>>    			unlock_buffer(bh);
>>>>    			return -EIO;
>>>>    		}
>>>>
>>> .
>>>
>>
> .
>
Changwei Ge Jan. 26, 2018, 1 a.m. UTC | #7
Hi Jun,
Good morning:-)

On 2018/1/25 20:45, piaojun wrote:
> Hi Changwei,
> 
> On 2018/1/25 20:30, Changwei Ge wrote:
>> Hi Jun,
>>
>> On 2018/1/25 20:18, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>> Hi Jun,
>>>>
>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>> situation:
>>>>>
>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>> Quick questions:
>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>
>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>> Yes, I mean they are in the same transaction as below:
>>>
>>> ocfs2_remove_btree_range
>>>     ocfs2_remove_extent // modify extent records
>>>       ocfs2_truncate_log_append // modify truncate log
>>
>> If so I think the transaction including operations on extent and truncate log won't be committed.
>> And journal should already be aborted if interval transaction commit thread has been woken.
>> So no metadata will be changed.
>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>> Are you sure this is the root cause of your problem?
>> I feel a little strange for it.
>>
>> Thanks,
>> Changwei
>>
> As you said, the transaction was not committed, but after a while the
> bh of truncate log was committed in another transaction. I'm sure for
> the cause and after applying this patch, the duplicate cluster problem
> is gone. I have tested it a few month.

I think we are talking about two jbd2/transactions. right?
One is for moving clusters from extent to truncate log. Let's name it T1.
Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.

If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
All following starting transaction and commit transaction will fail.

So, how can the T2 be committed while T1 fails?

Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.

Thanks,
Changwei    

> 
> thanks,
> Jun
> 
>>>
>>> thanks,
>>> Jun
>>>
>>>>>       truncate log at the same time, and hand them over to jbd2.
>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>       worker triggered by the next space reclaiming found the dirty bh of
>>>>>       truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>       in __ocfs2_journal_access():
>>>>>
>>>>> ocfs2_truncate_log_worker
>>>>>      ocfs2_flush_truncate_log
>>>>>        __ocfs2_flush_truncate_log
>>>>>          ocfs2_replay_truncate_records
>>>>>            ocfs2_journal_access_di
>>>>>              __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>
>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>       extent rec is still in error state, and unfortunately nobody will
>>>>>       take care of it.
>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>       flush worker have given it back to globalalloc. That will cause
>>>>>       duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>
>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>> of space reclaim.
>>>>>
>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>
>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>> ---
>>>>>     fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>     1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>> index 3630443..d769ca2 100644
>>>>> --- a/fs/ocfs2/journal.c
>>>>> +++ b/fs/ocfs2/journal.c
>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>     	/* we can safely remove this assertion after testing. */
>>>>>     	if (!buffer_uptodate(bh)) {
>>>>>     		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>
>>>>>     		lock_buffer(bh);
>>>>>     		/*
>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>> -		 * the best.
>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>> +		 * following situation:
>>>>> +		 *
>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>> +		 *    hand them over to jbd2.
>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>> +		 *    storage link.
>>>>> +		 * 3. After a while the storage link become normal.
>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>> +		 *    in __ocfs2_journal_access():
>>>>> +		 *
>>>>> +		 *    ocfs2_truncate_log_worker
>>>>> +		 *      ocfs2_flush_truncate_log
>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>> +		 *          ocfs2_replay_truncate_records
>>>>> +		 *            ocfs2_journal_access_di
>>>>> +		 *              __ocfs2_journal_access
>>>>> +		 *
>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>> +		 *    unfortunately nobody will take care of it.
>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>> +		 *    but truncate log flush worker have given it back to
>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>> +		 *
>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>> +		 * and consistency of space reclaim.
>>>>>     		 */
>>>>>     		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>> -			clear_buffer_write_io_error(bh);
>>>>> -			set_buffer_uptodate(bh);
>>>>> -		}
>>>>> -
>>>>> -		if (!buffer_uptodate(bh)) {
>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>> +				"buffer head failed\n");
>>>>>     			unlock_buffer(bh);
>>>>>     			return -EIO;
>>>>>     		}
>>>>>
>>>> .
>>>>
>>>
>> .
>>
>
piaojun Jan. 26, 2018, 1:45 a.m. UTC | #8
Hi Changwei,

On 2018/1/26 9:00, Changwei Ge wrote:
> Hi Jun,
> Good morning:-)
> 
> On 2018/1/25 20:45, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/25 20:30, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2018/1/25 20:18, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>> situation:
>>>>>>
>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>> Quick questions:
>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>
>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>> Yes, I mean they are in the same transaction as below:
>>>>
>>>> ocfs2_remove_btree_range
>>>>     ocfs2_remove_extent // modify extent records
>>>>       ocfs2_truncate_log_append // modify truncate log
>>>
>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>> So no metadata will be changed.
>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>> Are you sure this is the root cause of your problem?
>>> I feel a little strange for it.
>>>
>>> Thanks,
>>> Changwei
>>>
>> As you said, the transaction was not committed, but after a while the
>> bh of truncate log was committed in another transaction. I'm sure for
>> the cause and after applying this patch, the duplicate cluster problem
>> is gone. I have tested it a few month.
> 
> I think we are talking about two jbd2/transactions. right?
yes, two transactions involved.

> One is for moving clusters from extent to truncate log. Let's name it T1.
> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
> 
> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
> All following starting transaction and commit transaction will fail.
> 
> So, how can the T2 be committed while T1 fails?
>From my testing jbd2 won't be aborted when encounter IO error, and I
print the bh->b_state = 0x44828 = 1000100100000101000. That means the
bh has been submitted but write IO, and still in jbd2 according to
'bh_state_bits' and 'jbd_state_bits'.

>
> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
> 
Currently this problem needs user umount and mount again to recover,
and I'm glad to hear your advice.

thanks,
Jun

> Thanks,
> Changwei    
> 

>>
>> thanks,
>> Jun
>>
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>>>>       truncate log at the same time, and hand them over to jbd2.
>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>       worker triggered by the next space reclaiming found the dirty bh of
>>>>>>       truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>       in __ocfs2_journal_access():
>>>>>>
>>>>>> ocfs2_truncate_log_worker
>>>>>>      ocfs2_flush_truncate_log
>>>>>>        __ocfs2_flush_truncate_log
>>>>>>          ocfs2_replay_truncate_records
>>>>>>            ocfs2_journal_access_di
>>>>>>              __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>
>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>       extent rec is still in error state, and unfortunately nobody will
>>>>>>       take care of it.
>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>       flush worker have given it back to globalalloc. That will cause
>>>>>>       duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>
>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>> of space reclaim.
>>>>>>
>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>
>>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>> ---
>>>>>>     fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>     1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>> index 3630443..d769ca2 100644
>>>>>> --- a/fs/ocfs2/journal.c
>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>     	/* we can safely remove this assertion after testing. */
>>>>>>     	if (!buffer_uptodate(bh)) {
>>>>>>     		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>
>>>>>>     		lock_buffer(bh);
>>>>>>     		/*
>>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>>> -		 * the best.
>>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>>> +		 * following situation:
>>>>>> +		 *
>>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>>> +		 *    hand them over to jbd2.
>>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>>> +		 *    storage link.
>>>>>> +		 * 3. After a while the storage link become normal.
>>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>> +		 *    in __ocfs2_journal_access():
>>>>>> +		 *
>>>>>> +		 *    ocfs2_truncate_log_worker
>>>>>> +		 *      ocfs2_flush_truncate_log
>>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>>> +		 *          ocfs2_replay_truncate_records
>>>>>> +		 *            ocfs2_journal_access_di
>>>>>> +		 *              __ocfs2_journal_access
>>>>>> +		 *
>>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>>> +		 *    unfortunately nobody will take care of it.
>>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>>> +		 *    but truncate log flush worker have given it back to
>>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>>> +		 *
>>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>>> +		 * and consistency of space reclaim.
>>>>>>     		 */
>>>>>>     		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>> -			clear_buffer_write_io_error(bh);
>>>>>> -			set_buffer_uptodate(bh);
>>>>>> -		}
>>>>>> -
>>>>>> -		if (!buffer_uptodate(bh)) {
>>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>>> +				"buffer head failed\n");
>>>>>>     			unlock_buffer(bh);
>>>>>>     			return -EIO;
>>>>>>     		}
>>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
> .
>
Changwei Ge Jan. 26, 2018, 2:03 a.m. UTC | #9
On 2018/1/26 9:45, piaojun wrote:
> Hi Changwei,
> 
> On 2018/1/26 9:00, Changwei Ge wrote:
>> Hi Jun,
>> Good morning:-)
>>
>> On 2018/1/25 20:45, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>> Hi Jun,
>>>>
>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>> Hi Changwei,
>>>>>
>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>> Hi Jun,
>>>>>>
>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>> situation:
>>>>>>>
>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>> Quick questions:
>>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>>
>>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>>
>>>>>> Thanks,
>>>>>> Changwei
>>>>>>
>>>>> Yes, I mean they are in the same transaction as below:
>>>>>
>>>>> ocfs2_remove_btree_range
>>>>>      ocfs2_remove_extent // modify extent records
>>>>>        ocfs2_truncate_log_append // modify truncate log
>>>>
>>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>>> So no metadata will be changed.
>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>>> Are you sure this is the root cause of your problem?
>>>> I feel a little strange for it.
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>> As you said, the transaction was not committed, but after a while the
>>> bh of truncate log was committed in another transaction. I'm sure for
>>> the cause and after applying this patch, the duplicate cluster problem
>>> is gone. I have tested it a few month.
>>
>> I think we are talking about two jbd2/transactions. right?
> yes, two transactions involved.
> 
>> One is for moving clusters from extent to truncate log. Let's name it T1.
>> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
>>
>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
>> All following starting transaction and commit transaction will fail.
>>
>> So, how can the T2 be committed while T1 fails?
>  From my testing jbd2 won't be aborted when encounter IO error, and I
> print the bh->b_state = 0x44828 = 1000100100000101000. That means the
> bh has been submitted but write IO, and still in jbd2 according to
> 'bh_state_bits' and 'jbd_state_bits'.

Um... Strange :(
T1 fails to be committed but journal is still normal?
The T2 is even committed successfully?

I add Jan to our discussion.
Hopefully, he can help clear our doubts. :)

> 
>>
>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
>>
> Currently this problem needs user umount and mount again to recover,
> and I'm glad to hear your advice.
> 

I think it's better to do so for now.
Moreover, ocfs2 will fence the problematic node out.

Thanks,
Changwei

> thanks,
> Jun
> 
>> Thanks,
>> Changwei
>>
> 
>>>
>>> thanks,
>>> Jun
>>>
>>>>>
>>>>> thanks,
>>>>> Jun
>>>>>
>>>>>>>        truncate log at the same time, and hand them over to jbd2.
>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>>        worker triggered by the next space reclaiming found the dirty bh of
>>>>>>>        truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>        in __ocfs2_journal_access():
>>>>>>>
>>>>>>> ocfs2_truncate_log_worker
>>>>>>>       ocfs2_flush_truncate_log
>>>>>>>         __ocfs2_flush_truncate_log
>>>>>>>           ocfs2_replay_truncate_records
>>>>>>>             ocfs2_journal_access_di
>>>>>>>               __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>>
>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>>        extent rec is still in error state, and unfortunately nobody will
>>>>>>>        take care of it.
>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>>        flush worker have given it back to globalalloc. That will cause
>>>>>>>        duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>>
>>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>>> of space reclaim.
>>>>>>>
>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>>
>>>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>> ---
>>>>>>>      fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>>      1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>>> index 3630443..d769ca2 100644
>>>>>>> --- a/fs/ocfs2/journal.c
>>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>>      	/* we can safely remove this assertion after testing. */
>>>>>>>      	if (!buffer_uptodate(bh)) {
>>>>>>>      		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>>
>>>>>>>      		lock_buffer(bh);
>>>>>>>      		/*
>>>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>>>> -		 * the best.
>>>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>>>> +		 * following situation:
>>>>>>> +		 *
>>>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>>>> +		 *    hand them over to jbd2.
>>>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>>>> +		 *    storage link.
>>>>>>> +		 * 3. After a while the storage link become normal.
>>>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>> +		 *    in __ocfs2_journal_access():
>>>>>>> +		 *
>>>>>>> +		 *    ocfs2_truncate_log_worker
>>>>>>> +		 *      ocfs2_flush_truncate_log
>>>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>>>> +		 *          ocfs2_replay_truncate_records
>>>>>>> +		 *            ocfs2_journal_access_di
>>>>>>> +		 *              __ocfs2_journal_access
>>>>>>> +		 *
>>>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>>>> +		 *    unfortunately nobody will take care of it.
>>>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>>>> +		 *    but truncate log flush worker have given it back to
>>>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>>>> +		 *
>>>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>>>> +		 * and consistency of space reclaim.
>>>>>>>      		 */
>>>>>>>      		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>>> -			clear_buffer_write_io_error(bh);
>>>>>>> -			set_buffer_uptodate(bh);
>>>>>>> -		}
>>>>>>> -
>>>>>>> -		if (!buffer_uptodate(bh)) {
>>>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>>>> +				"buffer head failed\n");
>>>>>>>      			unlock_buffer(bh);
>>>>>>>      			return -EIO;
>>>>>>>      		}
>>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>> .
>>>>
>>>
>> .
>>
>
Gang He Jan. 26, 2018, 2:16 a.m. UTC | #10
Hi Jun,


>>> 
> Hi Gang,
> 
> Filesystem won't become readonly and journal remains normal, 
How does the user aware this kind of issue happen?  watch the kernel message? but some users do not like watch the kernel message except 
there is serious problem (e.g. crash).  
so this problem needs user umount and mount again to recover. 
If the user skip this error prints in the kernel message, there will be further damage to the file system?
If yes, we should let the user know this problem more obviously.
And I'm thinking
> about if we should set readonly to notice user.
> 
> thanks,
> Jun
> 
> On 2018/1/25 16:40, Gang He wrote:
>> Hi Jun,
>> 
>> If we return -EIO here, what is the consequence?
>> make the journal aborted and file system will become read-only?
>> 
>> Thanks
>> Gang
>> 
>> 
>>>>>
>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>> situation:
>>>
>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>    truncate log at the same time, and hand them over to jbd2.
>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>> 3. After a while the storage link become normal. Truncate log flush
>>>    worker triggered by the next space reclaiming found the dirty bh of
>>>    truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>    in __ocfs2_journal_access():
>>>
>>> ocfs2_truncate_log_worker
>>>   ocfs2_flush_truncate_log
>>>     __ocfs2_flush_truncate_log
>>>       ocfs2_replay_truncate_records
>>>         ocfs2_journal_access_di
>>>           __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>>> uptodata.
>>>
>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>    extent rec is still in error state, and unfortunately nobody will
>>>    take care of it.
>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>    flush worker have given it back to globalalloc. That will cause
>>>    duplicate cluster problem which could be identified by fsck.ocfs2.
>>>
>>> So we should return -EIO in case of ruining atomicity and consistency
>>> of space reclaim.
>>>
>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>>> __ocfs2_journal_access")
>>>
>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>> ---
>>>  fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>> index 3630443..d769ca2 100644
>>> --- a/fs/ocfs2/journal.c
>>> +++ b/fs/ocfs2/journal.c
>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>  	/* we can safely remove this assertion after testing. */
>>>  	if (!buffer_uptodate(bh)) {
>>>  		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>> -		     (unsigned long long)bh->b_blocknr);
>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>
>>>  		lock_buffer(bh);
>>>  		/*
>>> -		 * A previous attempt to write this buffer head failed.
>>> -		 * Nothing we can do but to retry the write and hope for
>>> -		 * the best.
>>> +		 * We should not reuse the dirty bh directly due to the
>>> +		 * following situation:
>>> +		 *
>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>> +		 *    extent rec and truncate log at the same time, and
>>> +		 *    hand them over to jbd2.
>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>> +		 *    storage link.
>>> +		 * 3. After a while the storage link become normal.
>>> +		 *    Truncate log flush worker triggered by the next
>>> +		 *    space reclaiming found the dirty bh of truncate log
>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>> +		 *    in __ocfs2_journal_access():
>>> +		 *
>>> +		 *    ocfs2_truncate_log_worker
>>> +		 *      ocfs2_flush_truncate_log
>>> +		 *        __ocfs2_flush_truncate_log
>>> +		 *          ocfs2_replay_truncate_records
>>> +		 *            ocfs2_journal_access_di
>>> +		 *              __ocfs2_journal_access
>>> +		 *
>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>> +		 *    but the bh of extent rec is still in error state, and
>>> +		 *    unfortunately nobody will take care of it.
>>> +		 * 5. At last the space of extent rec was not reduced,
>>> +		 *    but truncate log flush worker have given it back to
>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>> +		 *    which could be identified by fsck.ocfs2.
>>> +		 *
>>> +		 * So we should return -EIO in case of ruining atomicity
>>> +		 * and consistency of space reclaim.
>>>  		 */
>>>  		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>> -			clear_buffer_write_io_error(bh);
>>> -			set_buffer_uptodate(bh);
>>> -		}
>>> -
>>> -		if (!buffer_uptodate(bh)) {
>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>> +				"buffer head failed\n");
>>>  			unlock_buffer(bh);
>>>  			return -EIO;
>>>  		}
>>> -- 
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel@oss.oracle.com 
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel 
>> .
>>
piaojun Jan. 26, 2018, 2:31 a.m. UTC | #11
Hi Gang,

On 2018/1/26 10:16, Gang He wrote:
> Hi Jun,
> 
> 
>>>>
>> Hi Gang,
>>
>> Filesystem won't become readonly and journal remains normal, 
> How does the user aware this kind of issue happen?  watch the kernel message? but some users do not like watch the kernel message except 
> there is serious problem (e.g. crash).  
Sadlly, user could only identify this problem by kernel message.

> so this problem needs user umount and mount again to recover. 
> If the user skip this error prints in the kernel message, there will be further damage to the file system?
> If yes, we should let the user know this problem more obviously.
> And I'm thinking
This error won't cause further damage, but will make the operations
accessing truncate log bh failed, such as 'rm', 'truncate'. Before the
patch(acf8fdbe6afb), we set BUG here, but I think it's a little rude.
Perhaps we could set read-only or set jbd2 aborted.

thanks,
Jun

>> about if we should set readonly to notice user.
>>
>> thanks,
>> Jun
>>
>> On 2018/1/25 16:40, Gang He wrote:
>>> Hi Jun,
>>>
>>> If we return -EIO here, what is the consequence?
>>> make the journal aborted and file system will become read-only?
>>>
>>> Thanks
>>> Gang
>>>
>>>
>>>>>>
>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>> situation:
>>>>
>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>    truncate log at the same time, and hand them over to jbd2.
>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>    worker triggered by the next space reclaiming found the dirty bh of
>>>>    truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>    in __ocfs2_journal_access():
>>>>
>>>> ocfs2_truncate_log_worker
>>>>   ocfs2_flush_truncate_log
>>>>     __ocfs2_flush_truncate_log
>>>>       ocfs2_replay_truncate_records
>>>>         ocfs2_journal_access_di
>>>>           __ocfs2_journal_access // here we clear io_error and set 'tl_bh' 
>>>> uptodata.
>>>>
>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>    extent rec is still in error state, and unfortunately nobody will
>>>>    take care of it.
>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>    flush worker have given it back to globalalloc. That will cause
>>>>    duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>
>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>> of space reclaim.
>>>>
>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in 
>>>> __ocfs2_journal_access")
>>>>
>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>> ---
>>>>  fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>> index 3630443..d769ca2 100644
>>>> --- a/fs/ocfs2/journal.c
>>>> +++ b/fs/ocfs2/journal.c
>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>  	/* we can safely remove this assertion after testing. */
>>>>  	if (!buffer_uptodate(bh)) {
>>>>  		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>> -		     (unsigned long long)bh->b_blocknr);
>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>
>>>>  		lock_buffer(bh);
>>>>  		/*
>>>> -		 * A previous attempt to write this buffer head failed.
>>>> -		 * Nothing we can do but to retry the write and hope for
>>>> -		 * the best.
>>>> +		 * We should not reuse the dirty bh directly due to the
>>>> +		 * following situation:
>>>> +		 *
>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>> +		 *    extent rec and truncate log at the same time, and
>>>> +		 *    hand them over to jbd2.
>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>> +		 *    storage link.
>>>> +		 * 3. After a while the storage link become normal.
>>>> +		 *    Truncate log flush worker triggered by the next
>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>> +		 *    in __ocfs2_journal_access():
>>>> +		 *
>>>> +		 *    ocfs2_truncate_log_worker
>>>> +		 *      ocfs2_flush_truncate_log
>>>> +		 *        __ocfs2_flush_truncate_log
>>>> +		 *          ocfs2_replay_truncate_records
>>>> +		 *            ocfs2_journal_access_di
>>>> +		 *              __ocfs2_journal_access
>>>> +		 *
>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>> +		 *    but the bh of extent rec is still in error state, and
>>>> +		 *    unfortunately nobody will take care of it.
>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>> +		 *    but truncate log flush worker have given it back to
>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>> +		 *    which could be identified by fsck.ocfs2.
>>>> +		 *
>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>> +		 * and consistency of space reclaim.
>>>>  		 */
>>>>  		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>> -			clear_buffer_write_io_error(bh);
>>>> -			set_buffer_uptodate(bh);
>>>> -		}
>>>> -
>>>> -		if (!buffer_uptodate(bh)) {
>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>> +				"buffer head failed\n");
>>>>  			unlock_buffer(bh);
>>>>  			return -EIO;
>>>>  		}
>>>> -- 
>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel@oss.oracle.com 
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel 
>>> .
>>>
> .
>
piaojun Jan. 27, 2018, 3:51 a.m. UTC | #12
Hi Jan and Changwei,

I will describle the scenario again as below:

1. The bhs of truncate log and extent rec are submitted to jbd2 area
   successfully.
2. Then write-back thread of device help flush the bhs to disk but
   encounter write error due to abnormal storage link.
3. After a while the storage link become normal. Truncate log flush
   worker triggered by the next space reclaiming found the dirty bh of
   truncate log and clear its 'BH_Write_EIO' and then set it uptodate
   in __ocfs2_journal_access().
4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
   extent rec is still in error state, and unfortunately nobody will
   take care of it.
5. At last the space of extent rec was not reduced, but truncate log
   flush worker have given it back to globalalloc. That will cause
   duplicate cluster problem which could be identified by fsck.ocfs2.

I suggest ocfs2 should handle this problem.

thanks,
Jun

On 2018/1/26 10:03, Changwei Ge wrote:
> On 2018/1/26 9:45, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/1/26 9:00, Changwei Ge wrote:
>>> Hi Jun,
>>> Good morning:-)
>>>
>>> On 2018/1/25 20:45, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>>> Hi Jun,
>>>>>>>
>>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>>> situation:
>>>>>>>>
>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>>> Quick questions:
>>>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>>>
>>>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Changwei
>>>>>>>
>>>>>> Yes, I mean they are in the same transaction as below:
>>>>>>
>>>>>> ocfs2_remove_btree_range
>>>>>>      ocfs2_remove_extent // modify extent records
>>>>>>        ocfs2_truncate_log_append // modify truncate log
>>>>>
>>>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>>>> So no metadata will be changed.
>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>>>> Are you sure this is the root cause of your problem?
>>>>> I feel a little strange for it.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>> As you said, the transaction was not committed, but after a while the
>>>> bh of truncate log was committed in another transaction. I'm sure for
>>>> the cause and after applying this patch, the duplicate cluster problem
>>>> is gone. I have tested it a few month.
>>>
>>> I think we are talking about two jbd2/transactions. right?
>> yes, two transactions involved.
>>
>>> One is for moving clusters from extent to truncate log. Let's name it T1.
>>> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
>>>
>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
>>> All following starting transaction and commit transaction will fail.
>>>
>>> So, how can the T2 be committed while T1 fails?
>>  From my testing jbd2 won't be aborted when encounter IO error, and I
>> print the bh->b_state = 0x44828 = 1000100100000101000. That means the
>> bh has been submitted but write IO, and still in jbd2 according to
>> 'bh_state_bits' and 'jbd_state_bits'.
> 
> Um... Strange :(
> T1 fails to be committed but journal is still normal?
> The T2 is even committed successfully?
> 
> I add Jan to our discussion.
> Hopefully, he can help clear our doubts. :)
> 
>>
>>>
>>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
>>>
>> Currently this problem needs user umount and mount again to recover,
>> and I'm glad to hear your advice.
>>
> 
> I think it's better to do so for now.
> Moreover, ocfs2 will fence the problematic node out.
> 
> Thanks,
> Changwei
> 
>> thanks,
>> Jun
>>
>>> Thanks,
>>> Changwei
>>>
>>
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>>>>
>>>>>> thanks,
>>>>>> Jun
>>>>>>
>>>>>>>>        truncate log at the same time, and hand them over to jbd2.
>>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>>>        worker triggered by the next space reclaiming found the dirty bh of
>>>>>>>>        truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>>        in __ocfs2_journal_access():
>>>>>>>>
>>>>>>>> ocfs2_truncate_log_worker
>>>>>>>>       ocfs2_flush_truncate_log
>>>>>>>>         __ocfs2_flush_truncate_log
>>>>>>>>           ocfs2_replay_truncate_records
>>>>>>>>             ocfs2_journal_access_di
>>>>>>>>               __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>>>
>>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>>>        extent rec is still in error state, and unfortunately nobody will
>>>>>>>>        take care of it.
>>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>>>        flush worker have given it back to globalalloc. That will cause
>>>>>>>>        duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>>>
>>>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>>>> of space reclaim.
>>>>>>>>
>>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>>>
>>>>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>>> ---
>>>>>>>>      fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>>>      1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>>>> index 3630443..d769ca2 100644
>>>>>>>> --- a/fs/ocfs2/journal.c
>>>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>>>      	/* we can safely remove this assertion after testing. */
>>>>>>>>      	if (!buffer_uptodate(bh)) {
>>>>>>>>      		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>>>
>>>>>>>>      		lock_buffer(bh);
>>>>>>>>      		/*
>>>>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>>>>> -		 * the best.
>>>>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>>>>> +		 * following situation:
>>>>>>>> +		 *
>>>>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>>>>> +		 *    hand them over to jbd2.
>>>>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>>>>> +		 *    storage link.
>>>>>>>> +		 * 3. After a while the storage link become normal.
>>>>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>> +		 *    in __ocfs2_journal_access():
>>>>>>>> +		 *
>>>>>>>> +		 *    ocfs2_truncate_log_worker
>>>>>>>> +		 *      ocfs2_flush_truncate_log
>>>>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>>>>> +		 *          ocfs2_replay_truncate_records
>>>>>>>> +		 *            ocfs2_journal_access_di
>>>>>>>> +		 *              __ocfs2_journal_access
>>>>>>>> +		 *
>>>>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>>>>> +		 *    unfortunately nobody will take care of it.
>>>>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>>>>> +		 *    but truncate log flush worker have given it back to
>>>>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>>>>> +		 *
>>>>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>>>>> +		 * and consistency of space reclaim.
>>>>>>>>      		 */
>>>>>>>>      		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>>>> -			clear_buffer_write_io_error(bh);
>>>>>>>> -			set_buffer_uptodate(bh);
>>>>>>>> -		}
>>>>>>>> -
>>>>>>>> -		if (!buffer_uptodate(bh)) {
>>>>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>>>>> +				"buffer head failed\n");
>>>>>>>>      			unlock_buffer(bh);
>>>>>>>>      			return -EIO;
>>>>>>>>      		}
>>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
> .
>
Changwei Ge Jan. 27, 2018, 5:17 a.m. UTC | #13
Hi Jun,

On 2018/1/27 11:52, piaojun wrote:
> Hi Jan and Changwei,
> 
> I will describle the scenario again as below:
> 
> 1. The bhs of truncate log and extent rec are submitted to jbd2 area
>     successfully.
> 2. Then write-back thread of device help flush the bhs to disk but
>     encounter write error due to abnormal storage link.

Now, your problem description makes sense.
It seems you have withdrawn your last version of patch from -mm tree.
I will look at your next version.

Thanks,
Changwei

> 3. After a while the storage link become normal. Truncate log flush
>     worker triggered by the next space reclaiming found the dirty bh of
>     truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>     in __ocfs2_journal_access().
> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>     extent rec is still in error state, and unfortunately nobody will
>     take care of it.
> 5. At last the space of extent rec was not reduced, but truncate log
>     flush worker have given it back to globalalloc. That will cause
>     duplicate cluster problem which could be identified by fsck.ocfs2.
> 
> I suggest ocfs2 should handle this problem.
> 
> thanks,
> Jun
> 
> On 2018/1/26 10:03, Changwei Ge wrote:
>> On 2018/1/26 9:45, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2018/1/26 9:00, Changwei Ge wrote:
>>>> Hi Jun,
>>>> Good morning:-)
>>>>
>>>> On 2018/1/25 20:45, piaojun wrote:
>>>>> Hi Changwei,
>>>>>
>>>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>>>> Hi Jun,
>>>>>>
>>>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>>>> Hi Changwei,
>>>>>>>
>>>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>>>> Hi Jun,
>>>>>>>>
>>>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>>>> situation:
>>>>>>>>>
>>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>>>> Quick questions:
>>>>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>>>>
>>>>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Changwei
>>>>>>>>
>>>>>>> Yes, I mean they are in the same transaction as below:
>>>>>>>
>>>>>>> ocfs2_remove_btree_range
>>>>>>>       ocfs2_remove_extent // modify extent records
>>>>>>>         ocfs2_truncate_log_append // modify truncate log
>>>>>>
>>>>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>>>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>>>>> So no metadata will be changed.
>>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>>>>> Are you sure this is the root cause of your problem?
>>>>>> I feel a little strange for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Changwei
>>>>>>
>>>>> As you said, the transaction was not committed, but after a while the
>>>>> bh of truncate log was committed in another transaction. I'm sure for
>>>>> the cause and after applying this patch, the duplicate cluster problem
>>>>> is gone. I have tested it a few month.
>>>>
>>>> I think we are talking about two jbd2/transactions. right?
>>> yes, two transactions involved.
>>>
>>>> One is for moving clusters from extent to truncate log. Let's name it T1.
>>>> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
>>>>
>>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
>>>> All following starting transaction and commit transaction will fail.
>>>>
>>>> So, how can the T2 be committed while T1 fails?
>>>   From my testing jbd2 won't be aborted when encounter IO error, and I
>>> print the bh->b_state = 0x44828 = 1000100100000101000. That means the
>>> bh has been submitted but write IO, and still in jbd2 according to
>>> 'bh_state_bits' and 'jbd_state_bits'.
>>
>> Um... Strange :(
>> T1 fails to be committed but journal is still normal?
>> The T2 is even committed successfully?
>>
>> I add Jan to our discussion.
>> Hopefully, he can help clear our doubts. :)
>>
>>>
>>>>
>>>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
>>>>
>>> Currently this problem needs user umount and mount again to recover,
>>> and I'm glad to hear your advice.
>>>
>>
>> I think it's better to do so for now.
>> Moreover, ocfs2 will fence the problematic node out.
>>
>> Thanks,
>> Changwei
>>
>>> thanks,
>>> Jun
>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>>
>>>>>
>>>>> thanks,
>>>>> Jun
>>>>>
>>>>>>>
>>>>>>> thanks,
>>>>>>> Jun
>>>>>>>
>>>>>>>>>         truncate log at the same time, and hand them over to jbd2.
>>>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>>>>         worker triggered by the next space reclaiming found the dirty bh of
>>>>>>>>>         truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>>>         in __ocfs2_journal_access():
>>>>>>>>>
>>>>>>>>> ocfs2_truncate_log_worker
>>>>>>>>>        ocfs2_flush_truncate_log
>>>>>>>>>          __ocfs2_flush_truncate_log
>>>>>>>>>            ocfs2_replay_truncate_records
>>>>>>>>>              ocfs2_journal_access_di
>>>>>>>>>                __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>>>>
>>>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>>>>         extent rec is still in error state, and unfortunately nobody will
>>>>>>>>>         take care of it.
>>>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>>>>         flush worker have given it back to globalalloc. That will cause
>>>>>>>>>         duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>>>>
>>>>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>>>>> of space reclaim.
>>>>>>>>>
>>>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>>>> ---
>>>>>>>>>       fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>>>>       1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>>>>> index 3630443..d769ca2 100644
>>>>>>>>> --- a/fs/ocfs2/journal.c
>>>>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>>>>       	/* we can safely remove this assertion after testing. */
>>>>>>>>>       	if (!buffer_uptodate(bh)) {
>>>>>>>>>       		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>>>>
>>>>>>>>>       		lock_buffer(bh);
>>>>>>>>>       		/*
>>>>>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>>>>>> -		 * the best.
>>>>>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>>>>>> +		 * following situation:
>>>>>>>>> +		 *
>>>>>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>>>>>> +		 *    hand them over to jbd2.
>>>>>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>>>>>> +		 *    storage link.
>>>>>>>>> +		 * 3. After a while the storage link become normal.
>>>>>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>>> +		 *    in __ocfs2_journal_access():
>>>>>>>>> +		 *
>>>>>>>>> +		 *    ocfs2_truncate_log_worker
>>>>>>>>> +		 *      ocfs2_flush_truncate_log
>>>>>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>>>>>> +		 *          ocfs2_replay_truncate_records
>>>>>>>>> +		 *            ocfs2_journal_access_di
>>>>>>>>> +		 *              __ocfs2_journal_access
>>>>>>>>> +		 *
>>>>>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>>>>>> +		 *    unfortunately nobody will take care of it.
>>>>>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>>>>>> +		 *    but truncate log flush worker have given it back to
>>>>>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>>>>>> +		 *
>>>>>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>>>>>> +		 * and consistency of space reclaim.
>>>>>>>>>       		 */
>>>>>>>>>       		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>>>>> -			clear_buffer_write_io_error(bh);
>>>>>>>>> -			set_buffer_uptodate(bh);
>>>>>>>>> -		}
>>>>>>>>> -
>>>>>>>>> -		if (!buffer_uptodate(bh)) {
>>>>>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>>>>>> +				"buffer head failed\n");
>>>>>>>>>       			unlock_buffer(bh);
>>>>>>>>>       			return -EIO;
>>>>>>>>>       		}
>>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>> .
>>>>
>>>
>> .
>>
>
piaojun Jan. 27, 2018, 6:36 a.m. UTC | #14
Hi Changwei,

Thanks for quick repling, Gang and I are thinking about how to notice
user to recover this problem, and later I will post patch v2.

thanks
Jun

On 2018/1/27 13:17, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/1/27 11:52, piaojun wrote:
>> Hi Jan and Changwei,
>>
>> I will describle the scenario again as below:
>>
>> 1. The bhs of truncate log and extent rec are submitted to jbd2 area
>>     successfully.
>> 2. Then write-back thread of device help flush the bhs to disk but
>>     encounter write error due to abnormal storage link.
> 
> Now, your problem description makes sense.
> It seems you have withdrawn your last version of patch from -mm tree.
> I will look at your next version.
> 
> Thanks,
> Changwei
> 
>> 3. After a while the storage link become normal. Truncate log flush
>>     worker triggered by the next space reclaiming found the dirty bh of
>>     truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>     in __ocfs2_journal_access().
>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>     extent rec is still in error state, and unfortunately nobody will
>>     take care of it.
>> 5. At last the space of extent rec was not reduced, but truncate log
>>     flush worker have given it back to globalalloc. That will cause
>>     duplicate cluster problem which could be identified by fsck.ocfs2.
>>
>> I suggest ocfs2 should handle this problem.
>>
>> thanks,
>> Jun
>>
>> On 2018/1/26 10:03, Changwei Ge wrote:
>>> On 2018/1/26 9:45, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2018/1/26 9:00, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>> Good morning:-)
>>>>>
>>>>> On 2018/1/25 20:45, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 2018/1/25 20:30, Changwei Ge wrote:
>>>>>>> Hi Jun,
>>>>>>>
>>>>>>> On 2018/1/25 20:18, piaojun wrote:
>>>>>>>> Hi Changwei,
>>>>>>>>
>>>>>>>> On 2018/1/25 19:59, Changwei Ge wrote:
>>>>>>>>> Hi Jun,
>>>>>>>>>
>>>>>>>>> On 2018/1/25 10:41, piaojun wrote:
>>>>>>>>>> We should not reuse the dirty bh in jbd2 directly due to the following
>>>>>>>>>> situation:
>>>>>>>>>>
>>>>>>>>>> 1. When removing extent rec, we will dirty the bhs of extent rec and
>>>>>>>>> Quick questions:
>>>>>>>>> Do you mean current code puts modifying extent records belonging to a certain file and modifying truncate log into the same transaction?
>>>>>>>>> If so, forgive me since I didn't figure it out. Could you point out in your following sequence diagram?
>>>>>>>>>
>>>>>>>>> Afterwards, I can understand the issue your change log is describing better.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Changwei
>>>>>>>>>
>>>>>>>> Yes, I mean they are in the same transaction as below:
>>>>>>>>
>>>>>>>> ocfs2_remove_btree_range
>>>>>>>>       ocfs2_remove_extent // modify extent records
>>>>>>>>         ocfs2_truncate_log_append // modify truncate log
>>>>>>>
>>>>>>> If so I think the transaction including operations on extent and truncate log won't be committed.
>>>>>>> And journal should already be aborted if interval transaction commit thread has been woken.
>>>>>>> So no metadata will be changed.
>>>>>>> And later, ocfs2_truncate_log_worker shouldn't see any inode on truncate log.
>>>>>>> Are you sure this is the root cause of your problem?
>>>>>>> I feel a little strange for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Changwei
>>>>>>>
>>>>>> As you said, the transaction was not committed, but after a while the
>>>>>> bh of truncate log was committed in another transaction. I'm sure for
>>>>>> the cause and after applying this patch, the duplicate cluster problem
>>>>>> is gone. I have tested it a few month.
>>>>>
>>>>> I think we are talking about two jbd2/transactions. right?
>>>> yes, two transactions involved.
>>>>
>>>>> One is for moving clusters from extent to truncate log. Let's name it T1.
>>>>> Anther is for declaiming clusters from truncate log and returning them back to global bitmap. Let's name it T2.
>>>>>
>>>>> If jbd2 fails to commit T1 due to an IO error, the whole jbd2/journal will be aborted which means it can't work any more.
>>>>> All following starting transaction and commit transaction will fail.
>>>>>
>>>>> So, how can the T2 be committed while T1 fails?
>>>>   From my testing jbd2 won't be aborted when encounter IO error, and I
>>>> print the bh->b_state = 0x44828 = 1000100100000101000. That means the
>>>> bh has been submitted but write IO, and still in jbd2 according to
>>>> 'bh_state_bits' and 'jbd_state_bits'.
>>>
>>> Um... Strange :(
>>> T1 fails to be committed but journal is still normal?
>>> The T2 is even committed successfully?
>>>
>>> I add Jan to our discussion.
>>> Hopefully, he can help clear our doubts. :)
>>>
>>>>
>>>>>
>>>>> Otherwise, did you ever try to recover jbd2/journal? If so, I think your patch here is not fit for mainline yet.
>>>>>
>>>> Currently this problem needs user umount and mount again to recover,
>>>> and I'm glad to hear your advice.
>>>>
>>>
>>> I think it's better to do so for now.
>>> Moreover, ocfs2 will fence the problematic node out.
>>>
>>> Thanks,
>>> Changwei
>>>
>>>> thanks,
>>>> Jun
>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>
>>>>>>
>>>>>> thanks,
>>>>>> Jun
>>>>>>
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> Jun
>>>>>>>>
>>>>>>>>>>         truncate log at the same time, and hand them over to jbd2.
>>>>>>>>>> 2. The bhs are not flushed to disk due to abnormal storage link.
>>>>>>>>>> 3. After a while the storage link become normal. Truncate log flush
>>>>>>>>>>         worker triggered by the next space reclaiming found the dirty bh of
>>>>>>>>>>         truncate log and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>>>>         in __ocfs2_journal_access():
>>>>>>>>>>
>>>>>>>>>> ocfs2_truncate_log_worker
>>>>>>>>>>        ocfs2_flush_truncate_log
>>>>>>>>>>          __ocfs2_flush_truncate_log
>>>>>>>>>>            ocfs2_replay_truncate_records
>>>>>>>>>>              ocfs2_journal_access_di
>>>>>>>>>>                __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
>>>>>>>>>>
>>>>>>>>>> 4. Then jbd2 will flush the bh of truncate log to disk, but the bh of
>>>>>>>>>>         extent rec is still in error state, and unfortunately nobody will
>>>>>>>>>>         take care of it.
>>>>>>>>>> 5. At last the space of extent rec was not reduced, but truncate log
>>>>>>>>>>         flush worker have given it back to globalalloc. That will cause
>>>>>>>>>>         duplicate cluster problem which could be identified by fsck.ocfs2.
>>>>>>>>>>
>>>>>>>>>> So we should return -EIO in case of ruining atomicity and consistency
>>>>>>>>>> of space reclaim.
>>>>>>>>>>
>>>>>>>>>> Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jun Piao <piaojun@huawei.com>
>>>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>>>>> ---
>>>>>>>>>>       fs/ocfs2/journal.c | 45 +++++++++++++++++++++++++++++++++++----------
>>>>>>>>>>       1 file changed, 35 insertions(+), 10 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>>>>>>>> index 3630443..d769ca2 100644
>>>>>>>>>> --- a/fs/ocfs2/journal.c
>>>>>>>>>> +++ b/fs/ocfs2/journal.c
>>>>>>>>>> @@ -666,21 +666,46 @@ static int __ocfs2_journal_access(handle_t *handle,
>>>>>>>>>>       	/* we can safely remove this assertion after testing. */
>>>>>>>>>>       	if (!buffer_uptodate(bh)) {
>>>>>>>>>>       		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
>>>>>>>>>> -		mlog(ML_ERROR, "b_blocknr=%llu\n",
>>>>>>>>>> -		     (unsigned long long)bh->b_blocknr);
>>>>>>>>>> +		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
>>>>>>>>>> +		     (unsigned long long)bh->b_blocknr, bh->b_state);
>>>>>>>>>>
>>>>>>>>>>       		lock_buffer(bh);
>>>>>>>>>>       		/*
>>>>>>>>>> -		 * A previous attempt to write this buffer head failed.
>>>>>>>>>> -		 * Nothing we can do but to retry the write and hope for
>>>>>>>>>> -		 * the best.
>>>>>>>>>> +		 * We should not reuse the dirty bh directly due to the
>>>>>>>>>> +		 * following situation:
>>>>>>>>>> +		 *
>>>>>>>>>> +		 * 1. When removing extent rec, we will dirty the bhs of
>>>>>>>>>> +		 *    extent rec and truncate log at the same time, and
>>>>>>>>>> +		 *    hand them over to jbd2.
>>>>>>>>>> +		 * 2. The bhs are not flushed to disk due to abnormal
>>>>>>>>>> +		 *    storage link.
>>>>>>>>>> +		 * 3. After a while the storage link become normal.
>>>>>>>>>> +		 *    Truncate log flush worker triggered by the next
>>>>>>>>>> +		 *    space reclaiming found the dirty bh of truncate log
>>>>>>>>>> +		 *    and clear its 'BH_Write_EIO' and then set it uptodate
>>>>>>>>>> +		 *    in __ocfs2_journal_access():
>>>>>>>>>> +		 *
>>>>>>>>>> +		 *    ocfs2_truncate_log_worker
>>>>>>>>>> +		 *      ocfs2_flush_truncate_log
>>>>>>>>>> +		 *        __ocfs2_flush_truncate_log
>>>>>>>>>> +		 *          ocfs2_replay_truncate_records
>>>>>>>>>> +		 *            ocfs2_journal_access_di
>>>>>>>>>> +		 *              __ocfs2_journal_access
>>>>>>>>>> +		 *
>>>>>>>>>> +		 * 4. Then jbd2 will flush the bh of truncate log to disk,
>>>>>>>>>> +		 *    but the bh of extent rec is still in error state, and
>>>>>>>>>> +		 *    unfortunately nobody will take care of it.
>>>>>>>>>> +		 * 5. At last the space of extent rec was not reduced,
>>>>>>>>>> +		 *    but truncate log flush worker have given it back to
>>>>>>>>>> +		 *    globalalloc. That will cause duplicate cluster problem
>>>>>>>>>> +		 *    which could be identified by fsck.ocfs2.
>>>>>>>>>> +		 *
>>>>>>>>>> +		 * So we should return -EIO in case of ruining atomicity
>>>>>>>>>> +		 * and consistency of space reclaim.
>>>>>>>>>>       		 */
>>>>>>>>>>       		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
>>>>>>>>>> -			clear_buffer_write_io_error(bh);
>>>>>>>>>> -			set_buffer_uptodate(bh);
>>>>>>>>>> -		}
>>>>>>>>>> -
>>>>>>>>>> -		if (!buffer_uptodate(bh)) {
>>>>>>>>>> +			mlog(ML_ERROR, "A previous attempt to write this "
>>>>>>>>>> +				"buffer head failed\n");
>>>>>>>>>>       			unlock_buffer(bh);
>>>>>>>>>>       			return -EIO;
>>>>>>>>>>       		}
>>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
> .
>

Patch
diff mbox

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 3630443..d769ca2 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -666,21 +666,46 @@  static int __ocfs2_journal_access(handle_t *handle,
 	/* we can safely remove this assertion after testing. */
 	if (!buffer_uptodate(bh)) {
 		mlog(ML_ERROR, "giving me a buffer that's not uptodate!\n");
-		mlog(ML_ERROR, "b_blocknr=%llu\n",
-		     (unsigned long long)bh->b_blocknr);
+		mlog(ML_ERROR, "b_blocknr=%llu, b_state=0x%lx\n",
+		     (unsigned long long)bh->b_blocknr, bh->b_state);

 		lock_buffer(bh);
 		/*
-		 * A previous attempt to write this buffer head failed.
-		 * Nothing we can do but to retry the write and hope for
-		 * the best.
+		 * We should not reuse the dirty bh directly due to the
+		 * following situation:
+		 *
+		 * 1. When removing extent rec, we will dirty the bhs of
+		 *    extent rec and truncate log at the same time, and
+		 *    hand them over to jbd2.
+		 * 2. The bhs are not flushed to disk due to abnormal
+		 *    storage link.
+		 * 3. After a while the storage link become normal.
+		 *    Truncate log flush worker triggered by the next
+		 *    space reclaiming found the dirty bh of truncate log
+		 *    and clear its 'BH_Write_EIO' and then set it uptodate
+		 *    in __ocfs2_journal_access():
+		 *
+		 *    ocfs2_truncate_log_worker
+		 *      ocfs2_flush_truncate_log
+		 *        __ocfs2_flush_truncate_log
+		 *          ocfs2_replay_truncate_records
+		 *            ocfs2_journal_access_di
+		 *              __ocfs2_journal_access
+		 *
+		 * 4. Then jbd2 will flush the bh of truncate log to disk,
+		 *    but the bh of extent rec is still in error state, and
+		 *    unfortunately nobody will take care of it.
+		 * 5. At last the space of extent rec was not reduced,
+		 *    but truncate log flush worker have given it back to
+		 *    globalalloc. That will cause duplicate cluster problem
+		 *    which could be identified by fsck.ocfs2.
+		 *
+		 * So we should return -EIO in case of ruining atomicity
+		 * and consistency of space reclaim.
 		 */
 		if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
-			clear_buffer_write_io_error(bh);
-			set_buffer_uptodate(bh);
-		}
-
-		if (!buffer_uptodate(bh)) {
+			mlog(ML_ERROR, "A previous attempt to write this "
+				"buffer head failed\n");
 			unlock_buffer(bh);
 			return -EIO;
 		}