Message ID | 20210908102055.10168-1-cymi20@fudan.edu.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] ocfs2: Fix handle refcount leak in two exception handling paths | expand |
On 9/11/21 1:00 AM, Wengang Wang wrote: > > >> On Sep 9, 2021, at 6:53 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote: >> >> >> >> On 9/10/21 1:48 AM, Wengang Wang wrote: >>> >>> >>> On Sep 9, 2021, at 4:07 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote: >>> >>> Hi Wengang, >>> >>> On 9/9/21 1:12 AM, Wengang Wang wrote: >>> Hi, >>> >>> Sorry for late involving, but this doesn’t look right to me. >>> >>> On Sep 8, 2021, at 3:51 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote: >>> >>> >>> >>> On 9/8/21 6:20 PM, Chenyuan Mi wrote: >>> The reference counting issue happens in two exception handling paths >>> of ocfs2_replay_truncate_records(). When executing these two exception >>> handling paths, the function forgets to decrease the refcount of handle >>> increased by ocfs2_start_trans(), causing a refcount leak. >>> >>> Fix this issue by using ocfs2_commit_trans() to decrease the refcount >>> of handle in two handling paths. >>> >>> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn<mailto:cymi20@fudan.edu.cn>> >>> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn<mailto:xiyuyang19@fudan.edu.cn>> >>> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com<mailto:tanxin.ctf@gmail.com>> >>> >>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> >>> --- >>> fs/ocfs2/alloc.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c >>> index f1cc8258d34a..b05fde7edc3a 100644 >>> --- a/fs/ocfs2/alloc.c >>> +++ b/fs/ocfs2/alloc.c >>> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >>> status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh, >>> OCFS2_JOURNAL_ACCESS_WRITE); >>> if (status < 0) { >>> + ocfs2_commit_trans(osb, handle); >>> mlog_errno(status); >>> goto bail; >>> } >>> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >>> data_alloc_bh, start_blk, >>> num_clusters); >>> if (status < 0) { >>> + ocfs2_commit_trans(osb, handle); >>> >>> As a transaction, stuff expected to be in the same handle should be treated as atomic. >>> Here the stuff includes the tl_bh and other metadata block which will be modified in ocfs2_free_clusters(). >>> Coming here, some of related meta blocks may be in the handle but others are not due to the error happened. >>> If you do a commit, partial meta blocks are committed to log. — that breaks the atomic idea, it will cause FS inconsistency. >>> So what’s reason you want to commit the meta block changes, which is not all of expected, in this handle to journal log? >>> >>> Do you really see a hit on the failure? or just you detected the refcount leak by code review? >>> >>> You may want to look at ocfs2_journal_dirty() for the error handling part. >>> >>> >>> For the first error handling, since we don't call ocfs2_journal_dirty() >>> yet, so won't be a problem. >>> For the second error handling, I think we don't have a better way. Look >>> at other callers of ocfs2_free_clusters(), we simply ignore the error >>> code. >>> Anyway, we should commit transaction if starts, otherwise journal will >>> be abnormal. >>> >>> I don't think so. If error happened, we should fail ocfs2, rather than do a partial committing. >>> >> >> Umm... not exactly... >> Take ocfs2_free_clusters() for example, when it fails in case of EIO or >> ENOMEM, we can't just abort journal in such cases, because it is not so >> serious, only a bit blocks still occupied and they will recovery during >> the next mount. > > So are you sure a partial journal commitment won’t cause FS inconsistency? any proof for that? > Problem is not just if we can try to free the blocks again or not. The problem is FS in inconsistent state. > > I didn’t look into ocfs2_free_clusters() further, but can image this case: > > 1) We are going to free some clusters/blocks. > 2) We may need a new (not merging into existing) record to remember the new free extent. > 3) The new record needs to be inserted into free extent tree. > 4) the block (block A) where the new record to be inserted could be already full thus no space for the new record. > 5) then we need at least a new block (block B) to store the new record. (to maintain the free block btree, maybe another block, block C is needed too). > 6) So we need to save the pointer (block number) of block B in block A and store the new record in block B. > 7) In this case we need to make sure block A and block B either both in journal log, or none of them in journal log. We don’t allow block A is in journal bot block B is not, right? > > go back to question, Error could after block B is added to journal handle but before block A is added. In case we do a journal commit when hitting that error, we are committing block B to journal but leaving block A not in. > If panic happened, block A could never has the pointer pointing to block B. The result is block B is permanently lost (we can never reuse this block again). > Or if we add block A to journal first before adding block B and error happens After block A is added and before block B is added. Then we have the pointer pointing to block B in block A after panic, but block B doesn’t contain valid contents. The result is that we will hit problem when visiting block B as a valid btree block. > > >> That's why we have "errors=continue" in most filesystems, we should always >> consider the business continuity first. >> Also you can look at ext4_free_blocks() for reference. > > OCFS2 doesn’t support ERRORS_CONT, or to say it just ignore that option. No matter ERRORS_CONT is supported or not by OCFS2, > The key is not to leave the FS in inconsistent state. > I didn't say it won't cause inconsistency, but "don't have a better way". IIUC, ocfs2_free_cluster() mainly clears the bitmap and mark them free again. So the lost space is some what a cost for "please don't abort business if error happens but not be so serious". I think that's why other callers will also commit transaction even ocfs2_free_cluster() fails. Thanks, Joseph
> On Sep 13, 2021, at 7:12 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote: > > > > On 9/11/21 1:00 AM, Wengang Wang wrote: >> >> >>> On Sep 9, 2021, at 6:53 PM, Joseph Qi <joseph.qi@linux.alibaba.com> wrote: >>> >>> >>> >>> On 9/10/21 1:48 AM, Wengang Wang wrote: >>>> >>>> >>>> On Sep 9, 2021, at 4:07 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote: >>>> >>>> Hi Wengang, >>>> >>>> On 9/9/21 1:12 AM, Wengang Wang wrote: >>>> Hi, >>>> >>>> Sorry for late involving, but this doesn’t look right to me. >>>> >>>> On Sep 8, 2021, at 3:51 AM, Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> wrote: >>>> >>>> >>>> >>>> On 9/8/21 6:20 PM, Chenyuan Mi wrote: >>>> The reference counting issue happens in two exception handling paths >>>> of ocfs2_replay_truncate_records(). When executing these two exception >>>> handling paths, the function forgets to decrease the refcount of handle >>>> increased by ocfs2_start_trans(), causing a refcount leak. >>>> >>>> Fix this issue by using ocfs2_commit_trans() to decrease the refcount >>>> of handle in two handling paths. >>>> >>>> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn<mailto:cymi20@fudan.edu.cn>> >>>> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn<mailto:xiyuyang19@fudan.edu.cn>> >>>> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com<mailto:tanxin.ctf@gmail.com>> >>>> >>>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com<mailto:joseph.qi@linux.alibaba.com>> >>>> --- >>>> fs/ocfs2/alloc.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c >>>> index f1cc8258d34a..b05fde7edc3a 100644 >>>> --- a/fs/ocfs2/alloc.c >>>> +++ b/fs/ocfs2/alloc.c >>>> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >>>> status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh, >>>> OCFS2_JOURNAL_ACCESS_WRITE); >>>> if (status < 0) { >>>> + ocfs2_commit_trans(osb, handle); >>>> mlog_errno(status); >>>> goto bail; >>>> } >>>> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >>>> data_alloc_bh, start_blk, >>>> num_clusters); >>>> if (status < 0) { >>>> + ocfs2_commit_trans(osb, handle); >>>> >>>> As a transaction, stuff expected to be in the same handle should be treated as atomic. >>>> Here the stuff includes the tl_bh and other metadata block which will be modified in ocfs2_free_clusters(). >>>> Coming here, some of related meta blocks may be in the handle but others are not due to the error happened. >>>> If you do a commit, partial meta blocks are committed to log. — that breaks the atomic idea, it will cause FS inconsistency. >>>> So what’s reason you want to commit the meta block changes, which is not all of expected, in this handle to journal log? >>>> >>>> Do you really see a hit on the failure? or just you detected the refcount leak by code review? >>>> >>>> You may want to look at ocfs2_journal_dirty() for the error handling part. >>>> >>>> >>>> For the first error handling, since we don't call ocfs2_journal_dirty() >>>> yet, so won't be a problem. >>>> For the second error handling, I think we don't have a better way. Look >>>> at other callers of ocfs2_free_clusters(), we simply ignore the error >>>> code. >>>> Anyway, we should commit transaction if starts, otherwise journal will >>>> be abnormal. >>>> >>>> I don't think so. If error happened, we should fail ocfs2, rather than do a partial committing. >>>> >>> >>> Umm... not exactly... >>> Take ocfs2_free_clusters() for example, when it fails in case of EIO or >>> ENOMEM, we can't just abort journal in such cases, because it is not so >>> serious, only a bit blocks still occupied and they will recovery during >>> the next mount. >> >> So are you sure a partial journal commitment won’t cause FS inconsistency? any proof for that? >> Problem is not just if we can try to free the blocks again or not. The problem is FS in inconsistent state. >> >> I didn’t look into ocfs2_free_clusters() further, but can image this case: >> >> 1) We are going to free some clusters/blocks. >> 2) We may need a new (not merging into existing) record to remember the new free extent. >> 3) The new record needs to be inserted into free extent tree. >> 4) the block (block A) where the new record to be inserted could be already full thus no space for the new record. >> 5) then we need at least a new block (block B) to store the new record. (to maintain the free block btree, maybe another block, block C is needed too). >> 6) So we need to save the pointer (block number) of block B in block A and store the new record in block B. >> 7) In this case we need to make sure block A and block B either both in journal log, or none of them in journal log. We don’t allow block A is in journal bot block B is not, right? >> >> go back to question, Error could after block B is added to journal handle but before block A is added. In case we do a journal commit when hitting that error, we are committing block B to journal but leaving block A not in. >> If panic happened, block A could never has the pointer pointing to block B. The result is block B is permanently lost (we can never reuse this block again). >> Or if we add block A to journal first before adding block B and error happens After block A is added and before block B is added. Then we have the pointer pointing to block B in block A after panic, but block B doesn’t contain valid contents. The result is that we will hit problem when visiting block B as a valid btree block. >> >> >>> That's why we have "errors=continue" in most filesystems, we should always >>> consider the business continuity first. >>> Also you can look at ext4_free_blocks() for reference. >> >> OCFS2 doesn’t support ERRORS_CONT, or to say it just ignore that option. No matter ERRORS_CONT is supported or not by OCFS2, >> The key is not to leave the FS in inconsistent state. >> > > I didn't say it won't cause inconsistency, but "don't have a better way". > IIUC, ocfs2_free_cluster() mainly clears the bitmap and mark them free again. > So the lost space is some what a cost for "please don't abort business if > error happens but not be so serious". I think that's why other callers will > also commit transaction even ocfs2_free_cluster() fails. > Hm.. OK, seems it needs a big effort to “have a better way”. And Maybe freeing blocks is not “so serious”, I let it go. thanks, wengang
Hi Andrew, Now there is no objection on this patch. Would you please pick it into your -mm tree? Thanks. Joseph On 9/8/21 6:51 PM, Joseph Qi wrote: > > > On 9/8/21 6:20 PM, Chenyuan Mi wrote: >> The reference counting issue happens in two exception handling paths >> of ocfs2_replay_truncate_records(). When executing these two exception >> handling paths, the function forgets to decrease the refcount of handle >> increased by ocfs2_start_trans(), causing a refcount leak. >> >> Fix this issue by using ocfs2_commit_trans() to decrease the refcount >> of handle in two handling paths. >> >> Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn> >> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> >> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> > > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> >> --- >> fs/ocfs2/alloc.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c >> index f1cc8258d34a..b05fde7edc3a 100644 >> --- a/fs/ocfs2/alloc.c >> +++ b/fs/ocfs2/alloc.c >> @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >> status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh, >> OCFS2_JOURNAL_ACCESS_WRITE); >> if (status < 0) { >> + ocfs2_commit_trans(osb, handle); >> mlog_errno(status); >> goto bail; >> } >> @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, >> data_alloc_bh, start_blk, >> num_clusters); >> if (status < 0) { >> + ocfs2_commit_trans(osb, handle); >> mlog_errno(status); >> goto bail; >> } >> > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel >
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index f1cc8258d34a..b05fde7edc3a 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh, OCFS2_JOURNAL_ACCESS_WRITE); if (status < 0) { + ocfs2_commit_trans(osb, handle); mlog_errno(status); goto bail; } @@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records(struct ocfs2_super *osb, data_alloc_bh, start_blk, num_clusters); if (status < 0) { + ocfs2_commit_trans(osb, handle); mlog_errno(status); goto bail; }