diff mbox series

[v2] ocfs2: Fix handle refcount leak in two exception handling paths

Message ID 20210908102055.10168-1-cymi20@fudan.edu.cn (mailing list archive)
State New, archived
Headers show
Series [v2] ocfs2: Fix handle refcount leak in two exception handling paths | expand

Commit Message

Chenyuan Mi Sept. 8, 2021, 10:20 a.m. UTC
The reference counting issue happens in two exception handling paths
of ocfs2_replay_truncate_records(). When executing these two exception
handling paths, the function forgets to decrease the refcount of handle
increased by ocfs2_start_trans(), causing a refcount leak.

Fix this issue by using ocfs2_commit_trans() to decrease the refcount
of handle in two handling paths.

Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
---
 fs/ocfs2/alloc.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Joseph Qi Sept. 14, 2021, 2:12 a.m. UTC | #1
On 9/11/21 1:00 AM, Wengang Wang wrote:
> 
> 
>> On Sep 9, 2021, at 6:53 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
>>
>>
>>
>> On 9/10/21 1:48 AM, Wengang Wang wrote:
>>>
>>>
>>> On Sep 9, 2021, at 4:07 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote:
>>>
>>> Hi Wengang,
>>>
>>> On 9/9/21 1:12 AM, Wengang Wang wrote:
>>> Hi,
>>>
>>> Sorry for late involving, but this doesn’t look right to me.
>>>
>>> On Sep 8, 2021, at 3:51 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote:
>>>
>>>
>>>
>>> On 9/8/21 6:20 PM, Chenyuan Mi wrote:
>>> The reference counting issue happens in two exception handling paths
>>> of ocfs2_replay_truncate_records(). When executing these two exception
>>> handling paths, the function forgets to decrease the refcount of handle
>>> increased by ocfs2_start_trans(), causing a refcount leak.
>>>
>>> Fix this issue by using ocfs2_commit_trans() to decrease the refcount
>>> of handle in two handling paths.
>>>
>>> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn<mailto:cymi20@fudan.edu.cn>>
>>> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn<mailto:xiyuyang19@fudan.edu.cn>>
>>> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com<mailto:tanxin.ctf@gmail.com>>
>>>
>>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>>
>>> ---
>>> fs/ocfs2/alloc.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>>> index f1cc8258d34a..b05fde7edc3a 100644
>>> --- a/fs/ocfs2/alloc.c
>>> +++ b/fs/ocfs2/alloc.c
>>> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>> status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh,
>>> OCFS2_JOURNAL_ACCESS_WRITE);
>>> if (status < 0) {
>>> + ocfs2_commit_trans(osb, handle);
>>> mlog_errno(status);
>>> goto bail;
>>> }
>>> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>>     data_alloc_bh, start_blk,
>>>     num_clusters);
>>> if (status < 0) {
>>> + ocfs2_commit_trans(osb, handle);
>>>
>>> As a transaction, stuff expected to be in the same handle should be treated as atomic.
>>> Here the stuff includes the tl_bh and other metadata block which will be modified in ocfs2_free_clusters().
>>> Coming here, some of related meta blocks may be in the handle but others are not due to the error happened.
>>> If you do a commit, partial meta blocks are committed to log. — that breaks the atomic idea, it will cause FS inconsistency.
>>> So what’s reason you want to commit the meta block changes, which is not all of expected, in this handle to journal log?
>>>
>>> Do you really see a hit on the failure? or just you detected the refcount leak by code review?
>>>
>>> You may want to look at ocfs2_journal_dirty() for the error handling part.
>>>
>>>
>>> For the first error handling, since we don't call ocfs2_journal_dirty()
>>> yet, so won't be a problem.
>>> For the second error handling, I think we don't have a better way. Look
>>> at other callers of ocfs2_free_clusters(), we simply ignore the error
>>> code.
>>> Anyway, we should commit transaction if starts, otherwise journal will
>>> be abnormal.
>>>
>>> I don't think so. If error happened, we should fail ocfs2, rather than do a partial committing.
>>>
>>
>> Umm... not exactly...
>> Take ocfs2_free_clusters() for example, when it fails in case of EIO or
>> ENOMEM, we can't just abort journal in such cases, because it is not so
>> serious, only a bit blocks still occupied and they will recovery during
>> the next mount. 
> 
> So are you sure a partial journal commitment won’t cause FS inconsistency? any proof for that?
> Problem is not just if we can try to free the blocks again or not. The problem is FS in inconsistent state.
> 
> I didn’t look into ocfs2_free_clusters() further, but can image this case:
> 
> 1) We are going to free some clusters/blocks. 
> 2) We may need a new (not merging into existing) record to remember the new free extent. 
> 3) The new record needs to be inserted into free extent tree.
> 4) the block (block A) where the new record to be inserted could be already full thus no space for the new record.
> 5) then we need at least a new block (block B) to store the new record. (to maintain the free block btree, maybe another block, block C is needed too).
> 6) So we need to save the pointer (block number) of block B in block A and store the new record in block B.
> 7) In this case we need to make sure block A and block B either both in journal log, or none of them in journal log.  We don’t allow block A is in journal bot block B is not, right?
> 
> go back to question,   Error could after block B is added to journal handle but before block A is added. In case we do a journal commit when hitting that error, we are committing block B to journal but leaving block A not in.
> If panic happened, block A could never has the pointer pointing to block B. The result is block B is permanently lost (we can never reuse this block again). 
> Or if we add block A to journal first before adding block B and error happens After block A is added and before block B is added.  Then we have the pointer pointing to block B in block A after panic, but block B doesn’t contain valid contents. The result is that we will hit problem when visiting block B as a valid btree block.
> 
> 
>> That's why we have "errors=continue" in most filesystems, we should always
>> consider the business continuity first.
>> Also you can look at ext4_free_blocks() for reference.
> 
> OCFS2 doesn’t support ERRORS_CONT, or to say it just ignore that option.  No matter ERRORS_CONT is supported or not by OCFS2,
> The key is not to leave the FS in inconsistent state.
> 

I didn't say it won't cause inconsistency, but "don't have a better way".
IIUC, ocfs2_free_cluster() mainly clears the bitmap and mark them free again.
So the lost space is some what a cost for "please don't abort business if
error happens but not be so serious". I think that's why other callers will
also commit transaction even ocfs2_free_cluster() fails.

Thanks,
Joseph
Wengang Wang Sept. 14, 2021, 3:07 a.m. UTC | #2
> On Sep 13, 2021, at 7:12 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
> 
> 
> 
> On 9/11/21 1:00 AM, Wengang Wang wrote:
>> 
>> 
>>> On Sep 9, 2021, at 6:53 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote:
>>> 
>>> 
>>> 
>>> On 9/10/21 1:48 AM, Wengang Wang wrote:
>>>> 
>>>> 
>>>> On Sep 9, 2021, at 4:07 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote:
>>>> 
>>>> Hi Wengang,
>>>> 
>>>> On 9/9/21 1:12 AM, Wengang Wang wrote:
>>>> Hi,
>>>> 
>>>> Sorry for late involving, but this doesn’t look right to me.
>>>> 
>>>> On Sep 8, 2021, at 3:51 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 9/8/21 6:20 PM, Chenyuan Mi wrote:
>>>> The reference counting issue happens in two exception handling paths
>>>> of ocfs2_replay_truncate_records(). When executing these two exception
>>>> handling paths, the function forgets to decrease the refcount of handle
>>>> increased by ocfs2_start_trans(), causing a refcount leak.
>>>> 
>>>> Fix this issue by using ocfs2_commit_trans() to decrease the refcount
>>>> of handle in two handling paths.
>>>> 
>>>> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn<mailto:cymi20@fudan.edu.cn>>
>>>> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn<mailto:xiyuyang19@fudan.edu.cn>>
>>>> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com<mailto:tanxin.ctf@gmail.com>>
>>>> 
>>>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>>
>>>> ---
>>>> fs/ocfs2/alloc.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>> 
>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>>>> index f1cc8258d34a..b05fde7edc3a 100644
>>>> --- a/fs/ocfs2/alloc.c
>>>> +++ b/fs/ocfs2/alloc.c
>>>> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>>> status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh,
>>>> OCFS2_JOURNAL_ACCESS_WRITE);
>>>> if (status < 0) {
>>>> + ocfs2_commit_trans(osb, handle);
>>>> mlog_errno(status);
>>>> goto bail;
>>>> }
>>>> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>>>    data_alloc_bh, start_blk,
>>>>    num_clusters);
>>>> if (status < 0) {
>>>> + ocfs2_commit_trans(osb, handle);
>>>> 
>>>> As a transaction, stuff expected to be in the same handle should be treated as atomic.
>>>> Here the stuff includes the tl_bh and other metadata block which will be modified in ocfs2_free_clusters().
>>>> Coming here, some of related meta blocks may be in the handle but others are not due to the error happened.
>>>> If you do a commit, partial meta blocks are committed to log. — that breaks the atomic idea, it will cause FS inconsistency.
>>>> So what’s reason you want to commit the meta block changes, which is not all of expected, in this handle to journal log?
>>>> 
>>>> Do you really see a hit on the failure? or just you detected the refcount leak by code review?
>>>> 
>>>> You may want to look at ocfs2_journal_dirty() for the error handling part.
>>>> 
>>>> 
>>>> For the first error handling, since we don't call ocfs2_journal_dirty()
>>>> yet, so won't be a problem.
>>>> For the second error handling, I think we don't have a better way. Look
>>>> at other callers of ocfs2_free_clusters(), we simply ignore the error
>>>> code.
>>>> Anyway, we should commit transaction if starts, otherwise journal will
>>>> be abnormal.
>>>> 
>>>> I don't think so. If error happened, we should fail ocfs2, rather than do a partial committing.
>>>> 
>>> 
>>> Umm... not exactly...
>>> Take ocfs2_free_clusters() for example, when it fails in case of EIO or
>>> ENOMEM, we can't just abort journal in such cases, because it is not so
>>> serious, only a bit blocks still occupied and they will recovery during
>>> the next mount. 
>> 
>> So are you sure a partial journal commitment won’t cause FS inconsistency? any proof for that?
>> Problem is not just if we can try to free the blocks again or not. The problem is FS in inconsistent state.
>> 
>> I didn’t look into ocfs2_free_clusters() further, but can image this case:
>> 
>> 1) We are going to free some clusters/blocks. 
>> 2) We may need a new (not merging into existing) record to remember the new free extent. 
>> 3) The new record needs to be inserted into free extent tree.
>> 4) the block (block A) where the new record to be inserted could be already full thus no space for the new record.
>> 5) then we need at least a new block (block B) to store the new record. (to maintain the free block btree, maybe another block, block C is needed too).
>> 6) So we need to save the pointer (block number) of block B in block A and store the new record in block B.
>> 7) In this case we need to make sure block A and block B either both in journal log, or none of them in journal log.  We don’t allow block A is in journal bot block B is not, right?
>> 
>> go back to question,   Error could after block B is added to journal handle but before block A is added. In case we do a journal commit when hitting that error, we are committing block B to journal but leaving block A not in.
>> If panic happened, block A could never has the pointer pointing to block B. The result is block B is permanently lost (we can never reuse this block again). 
>> Or if we add block A to journal first before adding block B and error happens After block A is added and before block B is added.  Then we have the pointer pointing to block B in block A after panic, but block B doesn’t contain valid contents. The result is that we will hit problem when visiting block B as a valid btree block.
>> 
>> 
>>> That's why we have "errors=continue" in most filesystems, we should always
>>> consider the business continuity first.
>>> Also you can look at ext4_free_blocks() for reference.
>> 
>> OCFS2 doesn’t support ERRORS_CONT, or to say it just ignore that option.  No matter ERRORS_CONT is supported or not by OCFS2,
>> The key is not to leave the FS in inconsistent state.
>> 
> 
> I didn't say it won't cause inconsistency, but "don't have a better way".
> IIUC, ocfs2_free_cluster() mainly clears the bitmap and mark them free again.
> So the lost space is some what a cost for "please don't abort business if
> error happens but not be so serious". I think that's why other callers will
> also commit transaction even ocfs2_free_cluster() fails.
> 

Hm.. OK, seems it needs a big effort to “have a better way”.
And Maybe freeing blocks is not “so serious”, I let it go.

thanks,
wengang
Joseph Qi Sept. 15, 2021, 2:36 a.m. UTC | #3
Hi Andrew,
Now there is no objection on this patch.
Would you please pick it into your -mm tree? Thanks.

Joseph

On 9/8/21 6:51 PM, Joseph Qi wrote:
> 
> 
> On 9/8/21 6:20 PM, Chenyuan Mi wrote:
>> The reference counting issue happens in two exception handling paths
>> of ocfs2_replay_truncate_records(). When executing these two exception
>> handling paths, the function forgets to decrease the refcount of handle
>> increased by ocfs2_start_trans(), causing a refcount leak.
>>
>> Fix this issue by using ocfs2_commit_trans() to decrease the refcount
>> of handle in two handling paths.
>>
>> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn>
>> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
>> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
> 
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>> ---
>>  fs/ocfs2/alloc.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>> index f1cc8258d34a..b05fde7edc3a 100644
>> --- a/fs/ocfs2/alloc.c
>> +++ b/fs/ocfs2/alloc.c
>> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>  		status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh,
>>  						 OCFS2_JOURNAL_ACCESS_WRITE);
>>  		if (status < 0) {
>> +			ocfs2_commit_trans(osb, handle);
>>  			mlog_errno(status);
>>  			goto bail;
>>  		}
>> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
>>  						     data_alloc_bh, start_blk,
>>  						     num_clusters);
>>  			if (status < 0) {
>> +				ocfs2_commit_trans(osb, handle);
>>  				mlog_errno(status);
>>  				goto bail;
>>  			}
>>
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
diff mbox series

Patch

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f1cc8258d34a..b05fde7edc3a 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5940,6 +5940,7 @@  static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
 		status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh,
 						 OCFS2_JOURNAL_ACCESS_WRITE);
 		if (status < 0) {
+			ocfs2_commit_trans(osb, handle);
 			mlog_errno(status);
 			goto bail;
 		}
@@ -5964,6 +5965,7 @@  static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
 						     data_alloc_bh, start_blk,
 						     num_clusters);
 			if (status < 0) {
+				ocfs2_commit_trans(osb, handle);
 				mlog_errno(status);
 				goto bail;
 			}