diff mbox

ocfs2: limit printk when journal is aborted

Message ID 534FB63A.9090402@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Joseph Qi April 17, 2014, 11:08 a.m. UTC
Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
ocfs2_commit_thread. Then it will get into a loop with mass logs. This
will meaninglessly consume a larger number of resource and may lead to
system hung at last.
So limit printk in this case.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
---
 fs/ocfs2/journal.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Mark Fasheh April 17, 2014, 9:01 p.m. UTC | #1
On Thu, Apr 17, 2014 at 07:08:42PM +0800, Joseph Qi wrote:
> 
> Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
> ocfs2_commit_thread. Then it will get into a loop with mass logs. This
> will meaninglessly consume a larger number of resource and may lead to
> system hung at last.
> So limit printk in this case.
> 
> Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
> ---
>  fs/ocfs2/journal.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 44fc3e5..cfefbd1 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -30,6 +30,7 @@
>  #include <linux/kthread.h>
>  #include <linux/time.h>
>  #include <linux/random.h>
> +#include <linux/delay.h>
>  
>  #include <cluster/masklog.h>
>  
> @@ -2191,8 +2192,15 @@ static int ocfs2_commit_thread(void *arg)
>  					 || kthread_should_stop());
>  
>  		status = ocfs2_commit_cache(osb);
> -		if (status < 0)
> -			mlog_errno(status);
> +		if (status < 0) {
> +			static unsigned long abort_warn_time;
> +
> +			/* Warn about this once per minute */
> +			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> +				mlog(ML_ERROR, "status = %d, journal is "
> +						"already aborted.\n", status);
> +			msleep_interruptible(1000);
> +		}

Why the msleep? ocfs2_commit_thread will wait on the checkpoint_event queue
right after this anyway - is there a problem with it waiting on that?

Generally I really don't like peppering msleep() into the code where we
might need to sleep - there is often a more elegant solution available.

Thanks,
	--Mark

--
Mark Fasheh
Joseph Qi April 18, 2014, 1:02 a.m. UTC | #2
On 2014/4/18 5:01, Mark Fasheh wrote:
> On Thu, Apr 17, 2014 at 07:08:42PM +0800, Joseph Qi wrote:
>>
>> Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
>> ocfs2_commit_thread. Then it will get into a loop with mass logs. This
>> will meaninglessly consume a larger number of resource and may lead to
>> system hung at last.
>> So limit printk in this case.
>>
>> Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
>> ---
>>  fs/ocfs2/journal.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index 44fc3e5..cfefbd1 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -30,6 +30,7 @@
>>  #include <linux/kthread.h>
>>  #include <linux/time.h>
>>  #include <linux/random.h>
>> +#include <linux/delay.h>
>>  
>>  #include <cluster/masklog.h>
>>  
>> @@ -2191,8 +2192,15 @@ static int ocfs2_commit_thread(void *arg)
>>  					 || kthread_should_stop());
>>  
>>  		status = ocfs2_commit_cache(osb);
>> -		if (status < 0)
>> -			mlog_errno(status);
>> +		if (status < 0) {
>> +			static unsigned long abort_warn_time;
>> +
>> +			/* Warn about this once per minute */
>> +			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
>> +				mlog(ML_ERROR, "status = %d, journal is "
>> +						"already aborted.\n", status);
>> +			msleep_interruptible(1000);
>> +		}
> 
> Why the msleep? ocfs2_commit_thread will wait on the checkpoint_event queue
> right after this anyway - is there a problem with it waiting on that?
> 
Since jbd2 is already aborted, commit cache is meaningless.

> Generally I really don't like peppering msleep() into the code where we
> might need to sleep - there is often a more elegant solution available.
> 
> Thanks,
> 	--Mark
> 
> --
> Mark Fasheh
> 
>
Mark Fasheh April 18, 2014, 2:45 a.m. UTC | #3
On Fri, Apr 18, 2014 at 09:02:33AM +0800, Joseph Qi wrote:
> On 2014/4/18 5:01, Mark Fasheh wrote:
> > On Thu, Apr 17, 2014 at 07:08:42PM +0800, Joseph Qi wrote:
> >>
> >> Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
> >> ocfs2_commit_thread. Then it will get into a loop with mass logs. This
> >> will meaninglessly consume a larger number of resource and may lead to
> >> system hung at last.
> >> So limit printk in this case.
> >>
> >> Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
> >> ---
> >>  fs/ocfs2/journal.c | 12 ++++++++++--
> >>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> >> index 44fc3e5..cfefbd1 100644
> >> --- a/fs/ocfs2/journal.c
> >> +++ b/fs/ocfs2/journal.c
> >> @@ -30,6 +30,7 @@
> >>  #include <linux/kthread.h>
> >>  #include <linux/time.h>
> >>  #include <linux/random.h>
> >> +#include <linux/delay.h>
> >>  
> >>  #include <cluster/masklog.h>
> >>  
> >> @@ -2191,8 +2192,15 @@ static int ocfs2_commit_thread(void *arg)
> >>  					 || kthread_should_stop());
> >>  
> >>  		status = ocfs2_commit_cache(osb);
> >> -		if (status < 0)
> >> -			mlog_errno(status);
> >> +		if (status < 0) {
> >> +			static unsigned long abort_warn_time;
> >> +
> >> +			/* Warn about this once per minute */
> >> +			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
> >> +				mlog(ML_ERROR, "status = %d, journal is "
> >> +						"already aborted.\n", status);
> >> +			msleep_interruptible(1000);
> >> +		}
> > 
> > Why the msleep? ocfs2_commit_thread will wait on the checkpoint_event queue
> > right after this anyway - is there a problem with it waiting on that?
> > 
> Since jbd2 is already aborted, commit cache is meaningless.

I understand that, but I'm asking why the msleep and whether we can avoid
that. To go back to my question:

"ocfs2_commit_thread will wait on the checkpoint_event queue right after
this anyway - is there a problem with it waiting on that?"

Thanks,
	--Mark

--
Mark Fasheh
Joseph Qi April 18, 2014, 9:18 a.m. UTC | #4
On 2014/4/18 10:45, Mark Fasheh wrote:
> On Fri, Apr 18, 2014 at 09:02:33AM +0800, Joseph Qi wrote:
>> On 2014/4/18 5:01, Mark Fasheh wrote:
>>> On Thu, Apr 17, 2014 at 07:08:42PM +0800, Joseph Qi wrote:
>>>>
>>>> Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
>>>> ocfs2_commit_thread. Then it will get into a loop with mass logs. This
>>>> will meaninglessly consume a larger number of resource and may lead to
>>>> system hung at last.
>>>> So limit printk in this case.
>>>>
>>>> Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
>>>> ---
>>>>  fs/ocfs2/journal.c | 12 ++++++++++--
>>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>> index 44fc3e5..cfefbd1 100644
>>>> --- a/fs/ocfs2/journal.c
>>>> +++ b/fs/ocfs2/journal.c
>>>> @@ -30,6 +30,7 @@
>>>>  #include <linux/kthread.h>
>>>>  #include <linux/time.h>
>>>>  #include <linux/random.h>
>>>> +#include <linux/delay.h>
>>>>  
>>>>  #include <cluster/masklog.h>
>>>>  
>>>> @@ -2191,8 +2192,15 @@ static int ocfs2_commit_thread(void *arg)
>>>>  					 || kthread_should_stop());
>>>>  
>>>>  		status = ocfs2_commit_cache(osb);
>>>> -		if (status < 0)
>>>> -			mlog_errno(status);
>>>> +		if (status < 0) {
>>>> +			static unsigned long abort_warn_time;
>>>> +
>>>> +			/* Warn about this once per minute */
>>>> +			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
>>>> +				mlog(ML_ERROR, "status = %d, journal is "
>>>> +						"already aborted.\n", status);
>>>> +			msleep_interruptible(1000);
>>>> +		}
>>>
>>> Why the msleep? ocfs2_commit_thread will wait on the checkpoint_event queue
>>> right after this anyway - is there a problem with it waiting on that?
>>>
>> Since jbd2 is already aborted, commit cache is meaningless.
> 
> I understand that, but I'm asking why the msleep and whether we can avoid
> that. To go back to my question:
> 
> "ocfs2_commit_thread will wait on the checkpoint_event queue right after
> this anyway - is there a problem with it waiting on that?"
> 
> Thanks,
> 	--Mark
Sorry for my obscure description.
If ocfs2_commit_cache fails because of JBD2_ABORT, j_num_trans won't be cleared.
Then the condition of checkpoint event still evaluates true, so it won't wait.
> 
> --
> Mark Fasheh
> 
> .
>
diff mbox

Patch

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 44fc3e5..cfefbd1 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -30,6 +30,7 @@ 
 #include <linux/kthread.h>
 #include <linux/time.h>
 #include <linux/random.h>
+#include <linux/delay.h>
 
 #include <cluster/masklog.h>
 
@@ -2191,8 +2192,15 @@  static int ocfs2_commit_thread(void *arg)
 					 || kthread_should_stop());
 
 		status = ocfs2_commit_cache(osb);
-		if (status < 0)
-			mlog_errno(status);
+		if (status < 0) {
+			static unsigned long abort_warn_time;
+
+			/* Warn about this once per minute */
+			if (printk_timed_ratelimit(&abort_warn_time, 60*HZ))
+				mlog(ML_ERROR, "status = %d, journal is "
+						"already aborted.\n", status);
+			msleep_interruptible(1000);
+		}
 
 		if (kthread_should_stop() && atomic_read(&journal->j_num_trans)){
 			mlog(ML_KTHREAD,