diff mbox series

ocfs2: call ocfs2_abort when journal abort

Message ID 20231030120057.928280-1-srivathsa.d.dara@oracle.com (mailing list archive)
State New
Headers show
Series ocfs2: call ocfs2_abort when journal abort | expand

Commit Message

Srivathsa Dara Oct. 30, 2023, noon UTC
From: Ryan Ding <ryan.ding@oracle.com>


journal can not recover from abort state, so we should take following 
action to prevent file system from corruption:

1. change to readonly filesystem when local mount. We can not afford 
   further write, so change to RO state is reasonable.

2. panic when cluster mount. Because we can not release lock resource in
   this state, other node will hung when it require a lock owned by this
   node. So panic and remaster is a reasonable choise.

ocfs2_abort() will do all the above work.

Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
---
 fs/ocfs2/journal.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

Comments

Anthony Iliopoulos Oct. 31, 2023, 11:23 p.m. UTC | #1
On Mon, Oct 30, 2023 at 12:00:57PM +0000, Srivathsa Dara wrote:
> From: Ryan Ding <ryan.ding@oracle.com>
> 
> 
> journal can not recover from abort state, so we should take following 
> action to prevent file system from corruption:

Could you elaborate on how corruption can occur when the filesystem is
in this particular state? How could this be reproduced?

> 1. change to readonly filesystem when local mount. We can not afford 
>    further write, so change to RO state is reasonable.

This is generally the strategy, but unrelated to the code that this
commit is touching. The ocfs2_commit_thread() only applies to non-local
mounts.

> 2. panic when cluster mount. Because we can not release lock resource in
>    this state, other node will hung when it require a lock owned by this
>    node. So panic and remaster is a reasonable choise.

Panicking is never a reasonable choice :-) Why couldn't a remastering be
initiated without going through a reboot instead?

> ocfs2_abort() will do all the above work.
> 
> Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
> Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
> ---
>  fs/ocfs2/journal.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index ce215565d061..6dace475f019 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -14,7 +14,6 @@
>  #include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/random.h>
> -#include <linux/delay.h>
>  #include <linux/writeback.h>
>  
>  #include <cluster/masklog.h>
> @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct ocfs2_super *osb, int quota)
>  
>  static int ocfs2_commit_thread(void *arg)
>  {
> -	int status;
> +	int status = 0;
>  	struct ocfs2_super *osb = arg;
>  	struct ocfs2_journal *journal = osb->journal;
>  
> @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
>  		wait_event_interruptible(osb->checkpoint_event,
>  					 atomic_read(&journal->j_num_trans)
>  					 || kthread_should_stop());
> +		if (status < 0) {
> +			/* As we can not terminate by ourself, just enter an
> +			 * empty loop to wait for stop.
> +			 */
> +			continue;
> +		}

the above is unnecessary, given that below will induce panic upon
failure.

>  		status = ocfs2_commit_cache(osb);
>  		if (status < 0) {
> -			static unsigned long abort_warn_time;
> -
> -			/* Warn about this once per minute */
> -			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> -				mlog(ML_ERROR, "status = %d, journal is "
> -						"already aborted.\n", status);
>  			/*
> -			 * After ocfs2_commit_cache() fails, j_num_trans has a
> -			 * non-zero value.  Sleep here to avoid a busy-wait
> -			 * loop.
> +			 * journal can not recover from abort state, there is
> +			 * no need to keep commit cache. So we should either
> +			 * change to readonly(local mount) or just panic
> +			 * (cluster mount).

commit cache doesn't apply to local mounts.

> +			 * We should also clear j_num_trans to prevent further
> +			 * commit.

that's redundant given that we're about to panic.

>  			 */
> -			msleep_interruptible(1000);
> +			atomic_set(&journal->j_num_trans, 0);

same here.

> +			ocfs2_abort(osb->sb, "Detected aborted journal");

ocfs2_commit_cache returning with an error does not necessarily mean
that the journal has been aborted; jbd2_journal_flush will return an
error for other conditions too, some of which could be recoverable.

Therefore unconditionally aborting here is not necessarily warranted,
and presumably this is why the current code is a retry loop.

But even if the journal has been aborted at that point, there's no
reason why the node cannot self-fence and effectively shutdown the
local fs, instead of aborting with a panic.

Regards,
Anthony
Joseph Qi Nov. 1, 2023, 1:13 a.m. UTC | #2
On 10/30/23 8:00 PM, Srivathsa Dara wrote:
> From: Ryan Ding <ryan.ding@oracle.com>
> 
> 
> journal can not recover from abort state, so we should take following 
> action to prevent file system from corruption:
> 
> 1. change to readonly filesystem when local mount. We can not afford 
>    further write, so change to RO state is reasonable.
> 
> 2. panic when cluster mount. Because we can not release lock resource in
>    this state, other node will hung when it require a lock owned by this
>    node. So panic and remaster is a reasonable choise.
> 
> ocfs2_abort() will do all the above work.
> 
> Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
> Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
> ---
>  fs/ocfs2/journal.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index ce215565d061..6dace475f019 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -14,7 +14,6 @@
>  #include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/random.h>
> -#include <linux/delay.h>
>  #include <linux/writeback.h>
>  
>  #include <cluster/masklog.h>
> @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct ocfs2_super *osb, int quota)
>  
>  static int ocfs2_commit_thread(void *arg)
>  {
> -	int status;
> +	int status = 0;
>  	struct ocfs2_super *osb = arg;
>  	struct ocfs2_journal *journal = osb->journal;
>  
> @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
>  		wait_event_interruptible(osb->checkpoint_event,
>  					 atomic_read(&journal->j_num_trans)
>  					 || kthread_should_stop());
> +		if (status < 0) {
> +			/* As we can not terminate by ourself, just enter an
> +			 * empty loop to wait for stop.
> +			 */
> +			continue;
> +		}
>  
>  		status = ocfs2_commit_cache(osb);
>  		if (status < 0) {
> -			static unsigned long abort_warn_time;
> -
> -			/* Warn about this once per minute */
> -			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> -				mlog(ML_ERROR, "status = %d, journal is "
> -						"already aborted.\n", status);
>  			/*
> -			 * After ocfs2_commit_cache() fails, j_num_trans has a
> -			 * non-zero value.  Sleep here to avoid a busy-wait
> -			 * loop.
> +			 * journal can not recover from abort state, there is
> +			 * no need to keep commit cache. So we should either
> +			 * change to readonly(local mount) or just panic
> +			 * (cluster mount).
> +			 * We should also clear j_num_trans to prevent further
> +			 * commit.
>  			 */
> -			msleep_interruptible(1000);
> +			atomic_set(&journal->j_num_trans, 0);

Unconditionally clear 'j_num_trans' here seems buggy.
This may cause other nodes corrupt filesystem.

Thanks,
Joseph

> +			ocfs2_abort(osb->sb, "Detected aborted journal");
>  		}
>  
>  		if (kthread_should_stop() && atomic_read(&journal->j_num_trans)){
Heming Zhao Nov. 1, 2023, 2:42 a.m. UTC | #3
On 11/1/23 09:13, Joseph Qi wrote:
> 
> 
> On 10/30/23 8:00 PM, Srivathsa Dara wrote:
>> From: Ryan Ding <ryan.ding@oracle.com>
>>
>>
>> journal can not recover from abort state, so we should take following
>> action to prevent file system from corruption:
>>
>> 1. change to readonly filesystem when local mount. We can not afford
>>     further write, so change to RO state is reasonable.
>>
>> 2. panic when cluster mount. Because we can not release lock resource in
>>     this state, other node will hung when it require a lock owned by this
>>     node. So panic and remaster is a reasonable choise.
>>
>> ocfs2_abort() will do all the above work.
>>
>> Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
>> Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
>> ---
>>   fs/ocfs2/journal.c | 27 +++++++++++++++------------
>>   1 file changed, 15 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index ce215565d061..6dace475f019 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -14,7 +14,6 @@
>>   #include <linux/kthread.h>
>>   #include <linux/time.h>
>>   #include <linux/random.h>
>> -#include <linux/delay.h>
>>   #include <linux/writeback.h>
>>   
>>   #include <cluster/masklog.h>
>> @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct ocfs2_super *osb, int quota)
>>   
>>   static int ocfs2_commit_thread(void *arg)
>>   {
>> -	int status;
>> +	int status = 0;
>>   	struct ocfs2_super *osb = arg;
>>   	struct ocfs2_journal *journal = osb->journal;
>>   
>> @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
>>   		wait_event_interruptible(osb->checkpoint_event,
>>   					 atomic_read(&journal->j_num_trans)
>>   					 || kthread_should_stop());
>> +		if (status < 0) {
>> +			/* As we can not terminate by ourself, just enter an
>> +			 * empty loop to wait for stop.
>> +			 */
>> +			continue;
>> +		}
>>   
>>   		status = ocfs2_commit_cache(osb);
>>   		if (status < 0) {
>> -			static unsigned long abort_warn_time;
>> -
>> -			/* Warn about this once per minute */
>> -			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
>> -				mlog(ML_ERROR, "status = %d, journal is "
>> -						"already aborted.\n", status);
>>   			/*
>> -			 * After ocfs2_commit_cache() fails, j_num_trans has a
>> -			 * non-zero value.  Sleep here to avoid a busy-wait
>> -			 * loop.
>> +			 * journal can not recover from abort state, there is
>> +			 * no need to keep commit cache. So we should either
>> +			 * change to readonly(local mount) or just panic
>> +			 * (cluster mount).
>> +			 * We should also clear j_num_trans to prevent further
>> +			 * commit.
>>   			 */
>> -			msleep_interruptible(1000);
>> +			atomic_set(&journal->j_num_trans, 0);
> 
> Unconditionally clear 'j_num_trans' here seems buggy.
> This may cause other nodes corrupt filesystem.

Yes, you described this buggy condition during review my patch:
https://lore.kernel.org/ocfs2-devel/2b90546e-d23e-7d45-22dd-a887a73f3e61@linux.alibaba.com/

btw, I ever send a v2 patch [1] about handling journal abort:
[1] https://lore.kernel.org/ocfs2-devel/20230626112453.22571-1-heming.zhao@suse.com/
This patch never get any feedback.

And I also send another mail [2] related with above patch:
[2] https://lore.kernel.org/ocfs2-devel/20230626150916.je3egonzb3crvbl5@p15/
please note, [2] is the problems from fsck.ocfs2, which is completely unrelated to [1].

Heming
> 
> Thanks,
> Joseph
> 
>> +			ocfs2_abort(osb->sb, "Detected aborted journal");
>>   		}
>>   
>>   		if (kthread_should_stop() && atomic_read(&journal->j_num_trans)){
>
Srivathsa Dara Nov. 9, 2023, 7:26 p.m. UTC | #4
Hi Anthony,
thanks for the review

-----Original Message-----
From: Anthony Iliopoulos <ailiop@suse.com> 
Sent: Wednesday, November 1, 2023 4:53 AM
To: Srivathsa Dara <srivathsa.d.dara@oracle.com>
Cc: mark@fasheh.com; jlbec@evilplan.org; joseph.qi@linux.alibaba.com; 
Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; 
Junxiao Bi <junxiao.bi@oracle.com>; ocfs2-devel@lists.linux.dev
Subject: Re: [PATCH] ocfs2: call ocfs2_abort when journal abort

On Mon, Oct 30, 2023 at 12:00:57PM +0000, Srivathsa Dara wrote:
> From: Ryan Ding <ryan.ding@oracle.com>
> 
> 
> journal can not recover from abort state, so we should take following 
> action to prevent file system from corruption:

Could you elaborate on how corruption can occur when the filesystem is
in this particular state? How could this be reproduced?

[Srivathsa]: Corruption can occur due to storage failure

> 1. change to readonly filesystem when local mount. We can not afford 
>    further write, so change to RO state is reasonable.

This is generally the strategy, but unrelated to the code that this commit 
is touching. The ocfs2_commit_thread() only applies to non-local mounts.

[Srivathsa]: Yes, you are right, this is only for cluster mode. I will remove 
all local mount related comments.

> 2. panic when cluster mount. Because we can not release lock resource in
>    this state, other node will hung when it require a lock owned by this
>    node. So panic and remaster is a reasonable choise.

Panicking is never a reasonable choice :-) Why couldn't a remastering be
initiated without going through a reboot instead?

[Srivathsa]: Without reboot, we run into a journal failure, the only option
is to have fs in read-only mode. Could you suggest any other way to make
this node release the locks held by it?

> ocfs2_abort() will do all the above work.
> 
> Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
> Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
> ---
>  fs/ocfs2/journal.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 
> ce215565d061..6dace475f019 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -14,7 +14,6 @@
>  #include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/random.h>
> -#include <linux/delay.h>
>  #include <linux/writeback.h>
>  
>  #include <cluster/masklog.h>
> @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct 
> ocfs2_super *osb, int quota)
>  
>  static int ocfs2_commit_thread(void *arg)  {
> -	int status;
> +	int status = 0;
>  	struct ocfs2_super *osb = arg;
>  	struct ocfs2_journal *journal = osb->journal;
>  
> @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
>  		wait_event_interruptible(osb->checkpoint_event,
>  					 atomic_read(&journal->j_num_trans)
>  					 || kthread_should_stop());
> +		if (status < 0) {
> +			/* As we can not terminate by ourself, just enter an
> +			 * empty loop to wait for stop.
> +			 */
> +			continue;
> +		}

the above is unnecessary, given that below will induce panic upon failure.

>  		status = ocfs2_commit_cache(osb);
>  		if (status < 0) {
> -			static unsigned long abort_warn_time;
> -
> -			/* Warn about this once per minute */
> -			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> -				mlog(ML_ERROR, "status = %d, journal is "
> -						"already aborted.\n", status);
>  			/*
> -			 * After ocfs2_commit_cache() fails, j_num_trans has a
> -			 * non-zero value.  Sleep here to avoid a busy-wait
> -			 * loop.
> +			 * journal can not recover from abort state, there is
> +			 * no need to keep commit cache. So we should either
> +			 * change to readonly(local mount) or just panic
> +			 * (cluster mount).

commit cache doesn't apply to local mounts.

> +			 * We should also clear j_num_trans to prevent further
> +			 * commit.

that's redundant given that we're about to panic.

>  			 */
> -			msleep_interruptible(1000);
> +			atomic_set(&journal->j_num_trans, 0);

same here.

> +			ocfs2_abort(osb->sb, "Detected aborted journal");

ocfs2_commit_cache returning with an error does not necessarily mean that
the journal has been aborted; jbd2_journal_flush will return an error for other 
conditions too, some of which could be recoverable.

Therefore unconditionally aborting here is not necessarily warranted, and
presumably this is why the current code is a retry loop.

[Srivathsa]: Even the journal is not aborted, there is some serious 
error with the journal, keeping it running at this stage could lead
to corruption.

But even if the journal has been aborted at that point, there's no reason why
the node cannot self-fence and effectively shutdown the local fs, instead of 
aborting with a panic.

[Srivathsa]: By self-fence, did you mean system panic/reboot due 
to heartbeat loss? If so, that can't be guaranteed that it will happen
because the storage failure can be intermittent. Even if that happens,
it will panic or reboot, how effectively it will shutdown the local fs,
considering the journal is failing?

[Srivathsa]: will send a v2 addressing the redundant code.

Regards,
Anthony

Thanks,
Srivathsa
Anthony Iliopoulos Nov. 9, 2023, 11:58 p.m. UTC | #5
On Thu, Nov 09, 2023 at 07:26:16PM +0000, Srivathsa Dara wrote:
> Hi Anthony,
> thanks for the review
> 
> -----Original Message-----
> From: Anthony Iliopoulos <ailiop@suse.com> 
> Sent: Wednesday, November 1, 2023 4:53 AM
> To: Srivathsa Dara <srivathsa.d.dara@oracle.com>
> Cc: mark@fasheh.com; jlbec@evilplan.org; joseph.qi@linux.alibaba.com; 
> Rajesh Sivaramasubramaniom <rajesh.sivaramasubramaniom@oracle.com>; 
> Junxiao Bi <junxiao.bi@oracle.com>; ocfs2-devel@lists.linux.dev
> Subject: Re: [PATCH] ocfs2: call ocfs2_abort when journal abort
> 
> On Mon, Oct 30, 2023 at 12:00:57PM +0000, Srivathsa Dara wrote:
> > From: Ryan Ding <ryan.ding@oracle.com>
> > 
> > 
> > journal can not recover from abort state, so we should take following 
> > action to prevent file system from corruption:
> 
> Could you elaborate on how corruption can occur when the filesystem is
> in this particular state? How could this be reproduced?
> 
> [Srivathsa]: Corruption can occur due to storage failure

How does that failure lead from aborting the journal to causing fs
corruption? What exactly gets corrupted after journal abort? Can you
provide more details and possibly a reproducer on this?

> > 1. change to readonly filesystem when local mount. We can not afford 
> >    further write, so change to RO state is reasonable.
> 
> This is generally the strategy, but unrelated to the code that this commit 
> is touching. The ocfs2_commit_thread() only applies to non-local mounts.
> 
> [Srivathsa]: Yes, you are right, this is only for cluster mode. I will remove 
> all local mount related comments.
> 
> > 2. panic when cluster mount. Because we can not release lock resource in
> >    this state, other node will hung when it require a lock owned by this
> >    node. So panic and remaster is a reasonable choise.
> 
> Panicking is never a reasonable choice :-) Why couldn't a remastering be
> initiated without going through a reboot instead?
> 
> [Srivathsa]: Without reboot, we run into a journal failure, the only option
> is to have fs in read-only mode. Could you suggest any other way to make
> this node release the locks held by it?

Switching to read-only, and self-fencing so that another node will take
over the recovery.

> > ocfs2_abort() will do all the above work.
> > 
> > Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
> > Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
> > ---
> >  fs/ocfs2/journal.c | 27 +++++++++++++++------------
> >  1 file changed, 15 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 
> > ce215565d061..6dace475f019 100644
> > --- a/fs/ocfs2/journal.c
> > +++ b/fs/ocfs2/journal.c
> > @@ -14,7 +14,6 @@
> >  #include <linux/kthread.h>
> >  #include <linux/time.h>
> >  #include <linux/random.h>
> > -#include <linux/delay.h>
> >  #include <linux/writeback.h>
> >  
> >  #include <cluster/masklog.h>
> > @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct 
> > ocfs2_super *osb, int quota)
> >  
> >  static int ocfs2_commit_thread(void *arg)  {
> > -	int status;
> > +	int status = 0;
> >  	struct ocfs2_super *osb = arg;
> >  	struct ocfs2_journal *journal = osb->journal;
> >  
> > @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
> >  		wait_event_interruptible(osb->checkpoint_event,
> >  					 atomic_read(&journal->j_num_trans)
> >  					 || kthread_should_stop());
> > +		if (status < 0) {
> > +			/* As we can not terminate by ourself, just enter an
> > +			 * empty loop to wait for stop.
> > +			 */
> > +			continue;
> > +		}
> 
> the above is unnecessary, given that below will induce panic upon failure.
> 
> >  		status = ocfs2_commit_cache(osb);
> >  		if (status < 0) {
> > -			static unsigned long abort_warn_time;
> > -
> > -			/* Warn about this once per minute */
> > -			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> > -				mlog(ML_ERROR, "status = %d, journal is "
> > -						"already aborted.\n", status);
> >  			/*
> > -			 * After ocfs2_commit_cache() fails, j_num_trans has a
> > -			 * non-zero value.  Sleep here to avoid a busy-wait
> > -			 * loop.
> > +			 * journal can not recover from abort state, there is
> > +			 * no need to keep commit cache. So we should either
> > +			 * change to readonly(local mount) or just panic
> > +			 * (cluster mount).
> 
> commit cache doesn't apply to local mounts.
> 
> > +			 * We should also clear j_num_trans to prevent further
> > +			 * commit.
> 
> that's redundant given that we're about to panic.
> 
> >  			 */
> > -			msleep_interruptible(1000);
> > +			atomic_set(&journal->j_num_trans, 0);
> 
> same here.
> 
> > +			ocfs2_abort(osb->sb, "Detected aborted journal");
> 
> ocfs2_commit_cache returning with an error does not necessarily mean that
> the journal has been aborted; jbd2_journal_flush will return an error for other 
> conditions too, some of which could be recoverable.
> 
> Therefore unconditionally aborting here is not necessarily warranted, and
> presumably this is why the current code is a retry loop.
> 
> [Srivathsa]: Even the journal is not aborted, there is some serious 
> error with the journal, keeping it running at this stage could lead
> to corruption.

How, precisely? Irrespective of the journal state, this should not be
leading to any corruptions, the ocfs2_commit_thread will just keep
unsuccessfully attempting a journal flush. Otherwise there's a bug
somewhere that needs to be directly addressed.

> But even if the journal has been aborted at that point, there's no reason why
> the node cannot self-fence and effectively shutdown the local fs, instead of 
> aborting with a panic.
> 
> [Srivathsa]: By self-fence, did you mean system panic/reboot due 
> to heartbeat loss? If so, that can't be guaranteed that it will happen
> because the storage failure can be intermittent. Even if that happens,
> it will panic or reboot, how effectively it will shutdown the local fs,
> considering the journal is failing?

I mean intentionally stopping the heartbeat and switching to read-only.

> [Srivathsa]: will send a v2 addressing the redundant code.

Apart from the redundant code, I see two separate issues:

- the claim that there is a potential fs corruption bug after a journal
  abort in ocfs2_commit_thread, which needs to be clarified and
  addressed properly.

- the method of handling irrecoverable journal errors.

Regards,
Anthony
Srivathsa Dara Nov. 21, 2023, 12:19 p.m. UTC | #6
From: Anthony Iliopoulos ailiop@suse.com
Sent: Friday, November 10, 2023 5:28 AM
To: Srivathsa Dara srivathsa.d.dara@oracle.com
Cc: mark@fasheh.com mark@fasheh.com; jlbec@evilplan.org jlbec@evilplan.org; joseph.qi@linux.alibaba.com joseph.qi@linux.alibaba.com; Rajesh Sivaramasubramaniom rajesh.sivaramasubramaniom@oracle.com; Junxiao Bi junxiao.bi@oracle.com; ocfs2-devel@lists.linux.dev ocfs2-devel@lists.linux.dev; Gautham Ananthakrishna gautham.ananthakrishna@oracle.com
Subject: [External] : Re: [PATCH] ocfs2: call ocfs2_abort when journal abort
 
On Thu, Nov 09, 2023 at 07:26:16PM +0000, Srivathsa Dara wrote:
> Hi Anthony,
> thanks for the review
>
> -----Original Message-----
> From: Anthony Iliopoulos ailiop@suse.com
> Sent: Wednesday, November 1, 2023 4:53 AM
> To: Srivathsa Dara srivathsa.d.dara@oracle.com
> Cc: mark@fasheh.com; jlbec@evilplan.org; joseph.qi@linux.alibaba.com;
> Rajesh Sivaramasubramaniom rajesh.sivaramasubramaniom@oracle.com;
> Junxiao Bi junxiao.bi@oracle.com; ocfs2-devel@lists.linux.dev
> Subject: Re: [PATCH] ocfs2: call ocfs2_abort when journal abort
>
> On Mon, Oct 30, 2023 at 12:00:57PM +0000, Srivathsa Dara wrote:
> > From: Ryan Ding ryan.ding@oracle.com
> >
> >
> > journal can not recover from abort state, so we should take following
> > action to prevent file system from corruption:
>
> Could you elaborate on how corruption can occur when the filesystem is
> in this particular state? How could this be reproduced?
>
> [Srivathsa]: Corruption can occur due to storage failure

How does that failure lead from aborting the journal to causing fs
corruption? What exactly gets corrupted after journal abort? Can you
provide more details and possibly a reproducer on this?

[Srivathsa]: Isn't it a bad idea to leave the filesystem like a way it is
with a journal error. Because journal is not working as expected, we
might run into some trouble in future.

> > 1. change to readonly filesystem when local mount. We can not afford
> >    further write, so change to RO state is reasonable.
>
> This is generally the strategy, but unrelated to the code that this commit
> is touching. The ocfs2_commit_thread() only applies to non-local mounts.
>
> [Srivathsa]: Yes, you are right, this is only for cluster mode. I will remove
> all local mount related comments.
>
> > 2. panic when cluster mount. Because we can not release lock resource in
> >    this state, other node will hung when it require a lock owned by this
> >    node. So panic and remaster is a reasonable choise.
>
> Panicking is never a reasonable choice :-) Why couldn't a remastering be
> initiated without going through a reboot instead?
>
> [Srivathsa]: Without reboot, we run into a journal failure, the only option
> is to have fs in read-only mode. Could you suggest any other way to make
> this node release the locks held by it?

Switching to read-only, and self-fencing so that another node will take
over the recovery.
[Srivathsa]: ok got it.

> > ocfs2_abort() will do all the above work.
> >
> > Signed-off-by: Ryan Ding ryan.ding@oracle.com
> > Signed-off-by: Srivathsa Dara srivathsa.d.dara@oracle.com
> > ---
> >  fs/ocfs2/journal.c | 27 +++++++++++++++------------
> >  1 file changed, 15 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index
> > ce215565d061..6dace475f019 100644
> > --- a/fs/ocfs2/journal.c
> > +++ b/fs/ocfs2/journal.c
> > @@ -14,7 +14,6 @@
> >  #include <linux/kthread.h>
> >  #include <linux/time.h>
> >  #include <linux/random.h>
> > -#include <linux/delay.h>
> >  #include <linux/writeback.h>
> >
> >  #include <cluster/masklog.h>
> > @@ -2326,7 +2325,7 @@ static int __ocfs2_wait_on_mount(struct
> > ocfs2_super *osb, int quota)
> >
> >  static int ocfs2_commit_thread(void *arg)  {
> > -   int status;
> > +   int status = 0;
> >      struct ocfs2_super *osb = arg;
> >      struct ocfs2_journal *journal = osb->journal;
> >
> > @@ -2340,21 +2339,25 @@ static int ocfs2_commit_thread(void *arg)
> >              wait_event_interruptible(osb->checkpoint_event,
> >                                       atomic_read(&journal->j_num_trans)
> >                                       || kthread_should_stop());
> > +           if (status < 0) {
> > +                   /* As we can not terminate by ourself, just enter an
> > +                    * empty loop to wait for stop.
> > +                    */
> > +                   continue;
> > +           }
>
> the above is unnecessary, given that below will induce panic upon failure.
>
> >              status = ocfs2_commit_cache(osb);
> >              if (status < 0) {
> > -                   static unsigned long abort_warn_time;
> > -
> > -                   /* Warn about this once per minute */
> > -                   if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> > -                           mlog(ML_ERROR, "status = %d, journal is "
> > -                                           "already aborted.\n", status);
> >                      /*
> > -                    * After ocfs2_commit_cache() fails, j_num_trans has a
> > -                    * non-zero value.  Sleep here to avoid a busy-wait
> > -                    * loop.
> > +                    * journal can not recover from abort state, there is
> > +                    * no need to keep commit cache. So we should either
> > +                    * change to readonly(local mount) or just panic
> > +                    * (cluster mount).
>
> commit cache doesn't apply to local mounts.
>
> > +                    * We should also clear j_num_trans to prevent further
> > +                    * commit.
>
> that's redundant given that we're about to panic.
>
> >                       */
> > -                   msleep_interruptible(1000);
> > +                   atomic_set(&journal->j_num_trans, 0);
>
> same here.
>
> > +                   ocfs2_abort(osb->sb, "Detected aborted journal");
>
> ocfs2_commit_cache returning with an error does not necessarily mean that
> the journal has been aborted; jbd2_journal_flush will return an error for other
> conditions too, some of which could be recoverable.
>
> Therefore unconditionally aborting here is not necessarily warranted, and
> presumably this is why the current code is a retry loop.
>
> [Srivathsa]: Even the journal is not aborted, there is some serious
> error with the journal, keeping it running at this stage could lead
> to corruption.

How, precisely? Irrespective of the journal state, this should not be
leading to any corruptions, the ocfs2_commit_thread will just keep
unsuccessfully attempting a journal flush. Otherwise there's a bug
somewhere that needs to be directly addressed.

> But even if the journal has been aborted at that point, there's no reason why
> the node cannot self-fence and effectively shutdown the local fs, instead of
> aborting with a panic.
>
> [Srivathsa]: By self-fence, did you mean system panic/reboot due
> to heartbeat loss? If so, that can't be guaranteed that it will happen
> because the storage failure can be intermittent. Even if that happens,
> it will panic or reboot, how effectively it will shutdown the local fs,
> considering the journal is failing?

I mean intentionally stopping the heartbeat and switching to read-only.

> [Srivathsa]: will send a v2 addressing the redundant code.

Apart from the redundant code, I see two separate issues:

- the claim that there is a potential fs corruption bug after a journal
  abort in ocfs2_commit_thread, which needs to be clarified and
  addressed properly.

- the method of handling irrecoverable journal errors.
[Srivathsa]:  Switching to read-only and self fencing affect
performance in production systems.

Regards,
Anthony
Anthony Iliopoulos Nov. 21, 2023, 12:39 p.m. UTC | #7
On Tue, Nov 21, 2023 at 12:19:03PM +0000, Srivathsa Dara wrote:
> [Srivathsa]: Isn't it a bad idea to leave the filesystem like a way it is
> with a journal error. Because journal is not working as expected, we
> might run into some trouble in future.

The journal being in an aborted state is part of the expected design. It
implies that the filesystem accordingly should be in read-only state.

Anything else is a bug, and you'd have to be much more specific which
exact bug you're trying to address. Resorting to panic is simply
obscuring any potential issues that may actually need fixing.

> [Srivathsa]:  Switching to read-only and self fencing affect
> performance in production systems.

How so? Again, please elaborate with specific technical details.

Regards,
Anthony
diff mbox series

Patch

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index ce215565d061..6dace475f019 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -14,7 +14,6 @@ 
 #include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/random.h>
-#include <linux/delay.h>
 #include <linux/writeback.h>
 
 #include <cluster/masklog.h>
@@ -2326,7 +2325,7 @@  static int __ocfs2_wait_on_mount(struct ocfs2_super *osb, int quota)
 
 static int ocfs2_commit_thread(void *arg)
 {
-	int status;
+	int status = 0;
 	struct ocfs2_super *osb = arg;
 	struct ocfs2_journal *journal = osb->journal;
 
@@ -2340,21 +2339,25 @@  static int ocfs2_commit_thread(void *arg)
 		wait_event_interruptible(osb->checkpoint_event,
 					 atomic_read(&journal->j_num_trans)
 					 || kthread_should_stop());
+		if (status < 0) {
+			/* As we can not terminate by ourself, just enter an
+			 * empty loop to wait for stop.
+			 */
+			continue;
+		}
 
 		status = ocfs2_commit_cache(osb);
 		if (status < 0) {
-			static unsigned long abort_warn_time;
-
-			/* Warn about this once per minute */
-			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
-				mlog(ML_ERROR, "status = %d, journal is "
-						"already aborted.\n", status);
 			/*
-			 * After ocfs2_commit_cache() fails, j_num_trans has a
-			 * non-zero value.  Sleep here to avoid a busy-wait
-			 * loop.
+			 * journal can not recover from abort state, there is
+			 * no need to keep commit cache. So we should either
+			 * change to readonly(local mount) or just panic
+			 * (cluster mount).
+			 * We should also clear j_num_trans to prevent further
+			 * commit.
 			 */
-			msleep_interruptible(1000);
+			atomic_set(&journal->j_num_trans, 0);
+			ocfs2_abort(osb->sb, "Detected aborted journal");
 		}
 
 		if (kthread_should_stop() && atomic_read(&journal->j_num_trans)){