ocfs2/cluster: unlock the o2hb_live_lock before the o2nm_depend_item()
diff mbox

Message ID 59F86F83.7010501@huawei.com
State New
Headers show

Commit Message

zhendong chen Oct. 31, 2017, 12:41 p.m. UTC
In the following situation, the down_write() will be called under
the spin_lock(), which may lead a soft lockup:
o2hb_region_inc_user
 spin_lock(&o2hb_live_lock)
  o2hb_region_pin
   o2nm_depend_item
    configfs_depend_item
     inode_lock
      down_write
      -->here may sleep and reschedule

So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
get item reference in advance to prevent the region to be released.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
---
 fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
 1 file changed, 8 insertions(+)

-- 1.9.5.msysgit.1

Comments

Joseph Qi Nov. 1, 2017, 1:03 a.m. UTC | #1
Hi Alex,

On 17/10/31 20:41, alex chen wrote:
> In the following situation, the down_write() will be called under
> the spin_lock(), which may lead a soft lockup:
> o2hb_region_inc_user
>  spin_lock(&o2hb_live_lock)
>   o2hb_region_pin
>    o2nm_depend_item
>     configfs_depend_item
>      inode_lock
>       down_write
>       -->here may sleep and reschedule
> 
> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
> get item reference in advance to prevent the region to be released.
> 
> Signed-off-by: Alex Chen <alex.chen@huawei.com>
> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
> Reviewed-by: Jun Piao <piaojun@huawei.com>
> ---
>  fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
> index d020604..f1142a9 100644
> --- a/fs/ocfs2/cluster/heartbeat.c
> +++ b/fs/ocfs2/cluster/heartbeat.c
> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>  		if (reg->hr_item_pinned || reg->hr_item_dropped)
>  			goto skip_pin;
> 
> +		config_item_get(&reg->hr_item);
> +		spin_unlock(&o2hb_live_lock);
> +
If unlock here, the iteration of o2hb_all_regions is no longer safe.

Thanks,
Joseph

>  		/* Ignore ENOENT only for local hb (userdlm domain) */
>  		ret = o2nm_depend_item(&reg->hr_item);
>  		if (!ret) {
> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>  			else {
>  				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>  				     uuid, ret);
> +				config_item_put(&reg->hr_item);
> +				spin_lock(&o2hb_live_lock);
>  				break;
>  			}
>  		}
> +
> +		config_item_put(&reg->hr_item);
> +		spin_lock(&o2hb_live_lock);
>  skip_pin:
>  		if (found)
>  			break;
> -- 1.9.5.msysgit.1
> 
>
Changwei Ge Nov. 1, 2017, 1:14 a.m. UTC | #2
On 2017/11/1 9:05, Joseph Qi wrote:
> Hi Alex,
> 
> On 17/10/31 20:41, alex chen wrote:
>> In the following situation, the down_write() will be called under
>> the spin_lock(), which may lead a soft lockup:
>> o2hb_region_inc_user
>>   spin_lock(&o2hb_live_lock)
>>    o2hb_region_pin
>>     o2nm_depend_item
>>      configfs_depend_item
>>       inode_lock
>>        down_write
>>        -->here may sleep and reschedule
>>
>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>> get item reference in advance to prevent the region to be released.

Hi Alex,
Actually, I don't figure why this code path will lead to a soft lockup 
yet. Can you elaborate further?

And I agree with Joseph that simply unlocking o2hb_live_lock will 
introduce a race here since we are waking through that global list.

Thanks,
Changwei

>>
>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>> ---
>>   fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>> index d020604..f1142a9 100644
>> --- a/fs/ocfs2/cluster/heartbeat.c
>> +++ b/fs/ocfs2/cluster/heartbeat.c
>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>   		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>   			goto skip_pin;
>>
>> +		config_item_get(&reg->hr_item);
>> +		spin_unlock(&o2hb_live_lock);
>> +
> If unlock here, the iteration of o2hb_all_regions is no longer safe.
> 
> Thanks,
> Joseph
> 
>>   		/* Ignore ENOENT only for local hb (userdlm domain) */
>>   		ret = o2nm_depend_item(&reg->hr_item);
>>   		if (!ret) {
>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>   			else {
>>   				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>   				     uuid, ret);
>> +				config_item_put(&reg->hr_item);
>> +				spin_lock(&o2hb_live_lock);
>>   				break;
>>   			}
>>   		}
>> +
>> +		config_item_put(&reg->hr_item);
>> +		spin_lock(&o2hb_live_lock);
>>   skip_pin:
>>   		if (found)
>>   			break;
>> -- 1.9.5.msysgit.1
>>
>>
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
zhendong chen Nov. 1, 2017, 7:01 a.m. UTC | #3
Hi Joseph and Changwei,

It's our basic principle that the function in which may sleep can't be called
within spinlock hold.

On 2017/11/1 9:03, Joseph Qi wrote:
> Hi Alex,
> 
> On 17/10/31 20:41, alex chen wrote:
>> In the following situation, the down_write() will be called under
>> the spin_lock(), which may lead a soft lockup:
>> o2hb_region_inc_user
>>  spin_lock(&o2hb_live_lock)
>>   o2hb_region_pin
>>    o2nm_depend_item
>>     configfs_depend_item
>>      inode_lock
>>       down_write
>>       -->here may sleep and reschedule
>>
>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>> get item reference in advance to prevent the region to be released.
>>
>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>> ---
>>  fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>> index d020604..f1142a9 100644
>> --- a/fs/ocfs2/cluster/heartbeat.c
>> +++ b/fs/ocfs2/cluster/heartbeat.c
>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>  		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>  			goto skip_pin;
>>
>> +		config_item_get(&reg->hr_item);
>> +		spin_unlock(&o2hb_live_lock);
>> +
> If unlock here, the iteration of o2hb_all_regions is no longer safe.
> 
> Thanks,
> Joseph
> 

In local heartbeat mode, here we already found the region and will break the loop after
depending item, we get the item reference before spin_unlock(), that means the region will
never be released by the o2hb_region_release() until we put the item reference after
spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
In global heartbeat mode, it doesn't matter that some regions may be deleted after
spin_unlock(), because we just pin all the active regions.

Thanks,
Alex

>>  		/* Ignore ENOENT only for local hb (userdlm domain) */
>>  		ret = o2nm_depend_item(&reg->hr_item);
>>  		if (!ret) {
>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>  			else {
>>  				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>  				     uuid, ret);
>> +				config_item_put(&reg->hr_item);
>> +				spin_lock(&o2hb_live_lock);
>>  				break;
>>  			}
>>  		}
>> +
>> +		config_item_put(&reg->hr_item);
>> +		spin_lock(&o2hb_live_lock);
>>  skip_pin:
>>  		if (found)
>>  			break;
>> -- 1.9.5.msysgit.1
>>
>>
> 
> .
>
Changwei Ge Nov. 1, 2017, 8:28 a.m. UTC | #4
Hi Alex,

On 2017/11/1 15:05, alex chen wrote:
> Hi Joseph and Changwei,
> 
> It's our basic principle that the function in which may sleep can't be called
> within spinlock hold.

I suppose this principle is a suggestion not a restriction.

> 
> On 2017/11/1 9:03, Joseph Qi wrote:
>> Hi Alex,
>>
>> On 17/10/31 20:41, alex chen wrote:
>>> In the following situation, the down_write() will be called under
>>> the spin_lock(), which may lead a soft lockup:
>>> o2hb_region_inc_user
>>>   spin_lock(&o2hb_live_lock)
>>>    o2hb_region_pin
>>>     o2nm_depend_item
>>>      configfs_depend_item
>>>       inode_lock
>>>        down_write
>>>        -->here may sleep and reschedule
>>>
>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>> get item reference in advance to prevent the region to be released.
>>>
>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>> ---
>>>   fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>   1 file changed, 8 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>> index d020604..f1142a9 100644
>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>   		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>   			goto skip_pin;
>>>
>>> +		config_item_get(&reg->hr_item);
>>> +		spin_unlock(&o2hb_live_lock);
>>> +
>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>
>> Thanks,
>> Joseph
>>
> 
> In local heartbeat mode, here we already found the region and will break the loop after
> depending item, we get the item reference before spin_unlock(), that means the region will
> never be released by the o2hb_region_release() until we put the item reference after
> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
> In global heartbeat mode, it doesn't matter that some regions may be deleted after
> spin_unlock(), because we just pin all the active regions.

But we are still iterating elements on global list  *o2hb_all_regions*, 
right? How can we guarantee that no more elements will be attached or 
detached while the lock is unlocked.

> 
> Thanks,
> Alex
> 
>>>   		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>   		ret = o2nm_depend_item(&reg->hr_item);
>>>   		if (!ret) {
>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>   			else {
>>>   				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>   				     uuid, ret);
>>> +				config_item_put(&reg->hr_item);
>>> +				spin_lock(&o2hb_live_lock);
>>>   				break;
>>>   			}
>>>   		}
>>> +
>>> +		config_item_put(&reg->hr_item);
>>> +		spin_lock(&o2hb_live_lock);
>>>   skip_pin:
>>>   		if (found)
>>>   			break;
>>> -- 1.9.5.msysgit.1
>>>
>>>
>>
>> .
>>
> 
>
zhendong chen Nov. 1, 2017, 9:47 a.m. UTC | #5
Hi Changwei,

Thanks for you reply.

On 2017/11/1 16:28, Changwei Ge wrote:
> Hi Alex,
> 
> On 2017/11/1 15:05, alex chen wrote:
>> Hi Joseph and Changwei,
>>
>> It's our basic principle that the function in which may sleep can't be called
>> within spinlock hold.
> 
> I suppose this principle is a suggestion not a restriction.
> 

It is a very very bad idea to sleep while holding a spin lock.
If a process grabs a spinlock and goes to sleep before releasing it.
Then this process yields the control of the CPU to a second process.
Unfortunately the second process also need to grab the spinlock and it will
busy wait. On an uniprocessor machine the second process will lock the CPU not
allowing the first process to wake up and release the spinlock so the second
process can't continue, it is basically a deadlock.

>>
>> On 2017/11/1 9:03, Joseph Qi wrote:
>>> Hi Alex,
>>>
>>> On 17/10/31 20:41, alex chen wrote:
>>>> In the following situation, the down_write() will be called under
>>>> the spin_lock(), which may lead a soft lockup:
>>>> o2hb_region_inc_user
>>>>   spin_lock(&o2hb_live_lock)
>>>>    o2hb_region_pin
>>>>     o2nm_depend_item
>>>>      configfs_depend_item
>>>>       inode_lock
>>>>        down_write
>>>>        -->here may sleep and reschedule
>>>>
>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>> get item reference in advance to prevent the region to be released.
>>>>
>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>> ---
>>>>   fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>   1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>> index d020604..f1142a9 100644
>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>   		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>   			goto skip_pin;
>>>>
>>>> +		config_item_get(&reg->hr_item);
>>>> +		spin_unlock(&o2hb_live_lock);
>>>> +
>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>
>>> Thanks,
>>> Joseph
>>>
>>
>> In local heartbeat mode, here we already found the region and will break the loop after
>> depending item, we get the item reference before spin_unlock(), that means the region will
>> never be released by the o2hb_region_release() until we put the item reference after
>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>> spin_unlock(), because we just pin all the active regions.
> 
> But we are still iterating elements on global list  *o2hb_all_regions*, 
> right? How can we guarantee that no more elements will be attached or 
> detached while the lock is unlocked.
> 

In global heartbeat mode, we pin all regions if there is the first dependent user and
unpin all regions if the number of dependent users falls to zero.
So there is not exist another region will be attached or detached after spin_unlock().

Thank,
Alex
>>
>> Thanks,
>> Alex
>>
>>>>   		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>   		ret = o2nm_depend_item(&reg->hr_item);
>>>>   		if (!ret) {
>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>   			else {
>>>>   				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>   				     uuid, ret);
>>>> +				config_item_put(&reg->hr_item);
>>>> +				spin_lock(&o2hb_live_lock);
>>>>   				break;
>>>>   			}
>>>>   		}
>>>> +
>>>> +		config_item_put(&reg->hr_item);
>>>> +		spin_lock(&o2hb_live_lock);
>>>>   skip_pin:
>>>>   		if (found)
>>>>   			break;
>>>> -- 1.9.5.msysgit.1
>>>>
>>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
>
Changwei Ge Nov. 1, 2017, 9:59 a.m. UTC | #6
Hi Alex,

On 2017/11/1 17:52, alex chen wrote:
> Hi Changwei,
> 
> Thanks for you reply.
> 
> On 2017/11/1 16:28, Changwei Ge wrote:
>> Hi Alex,
>>
>> On 2017/11/1 15:05, alex chen wrote:
>>> Hi Joseph and Changwei,
>>>
>>> It's our basic principle that the function in which may sleep can't be called
>>> within spinlock hold.
>>
>> I suppose this principle is a suggestion not a restriction.
>>
> 
> It is a very very bad idea to sleep while holding a spin lock.
> If a process grabs a spinlock and goes to sleep before releasing it.
> Then this process yields the control of the CPU to a second process.
> Unfortunately the second process also need to grab the spinlock and it will
> busy wait. On an uniprocessor machine the second process will lock the CPU not
> allowing the first process to wake up and release the spinlock so the second
> process can't continue, it is basically a deadlock.

I agree. This soft lockup truly exists. This point is OK for me.

> 
>>>
>>> On 2017/11/1 9:03, Joseph Qi wrote:
>>>> Hi Alex,
>>>>
>>>> On 17/10/31 20:41, alex chen wrote:
>>>>> In the following situation, the down_write() will be called under
>>>>> the spin_lock(), which may lead a soft lockup:
>>>>> o2hb_region_inc_user
>>>>>    spin_lock(&o2hb_live_lock)
>>>>>     o2hb_region_pin
>>>>>      o2nm_depend_item
>>>>>       configfs_depend_item
>>>>>        inode_lock
>>>>>         down_write
>>>>>         -->here may sleep and reschedule
>>>>>
>>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>>> get item reference in advance to prevent the region to be released.
>>>>>
>>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>>> ---
>>>>>    fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>>    1 file changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>> index d020604..f1142a9 100644
>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>    		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>>    			goto skip_pin;
>>>>>
>>>>> +		config_item_get(&reg->hr_item);
>>>>> +		spin_unlock(&o2hb_live_lock);
>>>>> +
>>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>>
>>> In local heartbeat mode, here we already found the region and will break the loop after
>>> depending item, we get the item reference before spin_unlock(), that means the region will
>>> never be released by the o2hb_region_release() until we put the item reference after
>>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>>> spin_unlock(), because we just pin all the active regions.
>>
>> But we are still iterating elements on global list  *o2hb_all_regions*,
>> right? How can we guarantee that no more elements will be attached or
>> detached while the lock is unlocked.
>>
> 
> In global heartbeat mode, we pin all regions if there is the first dependent user and
> unpin all regions if the number of dependent users falls to zero.
> So there is not exist another region will be attached or detached after spin_unlock().

Well, as for local heartbeat mode, I think, we still need to protect 
that global list.

Thanks,
Changwei

> 
> Thank,
> Alex
>>>
>>> Thanks,
>>> Alex
>>>
>>>>>    		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>>    		ret = o2nm_depend_item(&reg->hr_item);
>>>>>    		if (!ret) {
>>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>    			else {
>>>>>    				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>>    				     uuid, ret);
>>>>> +				config_item_put(&reg->hr_item);
>>>>> +				spin_lock(&o2hb_live_lock);
>>>>>    				break;
>>>>>    			}
>>>>>    		}
>>>>> +
>>>>> +		config_item_put(&reg->hr_item);
>>>>> +		spin_lock(&o2hb_live_lock);
>>>>>    skip_pin:
>>>>>    		if (found)
>>>>>    			break;
>>>>> -- 1.9.5.msysgit.1
>>>>>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
>>
>> .
>>
> 
>
zhendong chen Nov. 2, 2017, 3:43 a.m. UTC | #7
Hi Changwei,

On 2017/11/1 17:59, Changwei Ge wrote:
> Hi Alex,
> 
> On 2017/11/1 17:52, alex chen wrote:
>> Hi Changwei,
>>
>> Thanks for you reply.
>>
>> On 2017/11/1 16:28, Changwei Ge wrote:
>>> Hi Alex,
>>>
>>> On 2017/11/1 15:05, alex chen wrote:
>>>> Hi Joseph and Changwei,
>>>>
>>>> It's our basic principle that the function in which may sleep can't be called
>>>> within spinlock hold.
>>>
>>> I suppose this principle is a suggestion not a restriction.
>>>
>>
>> It is a very very bad idea to sleep while holding a spin lock.
>> If a process grabs a spinlock and goes to sleep before releasing it.
>> Then this process yields the control of the CPU to a second process.
>> Unfortunately the second process also need to grab the spinlock and it will
>> busy wait. On an uniprocessor machine the second process will lock the CPU not
>> allowing the first process to wake up and release the spinlock so the second
>> process can't continue, it is basically a deadlock.
> 
> I agree. This soft lockup truly exists. This point is OK for me.
> 
>>
>>>>
>>>> On 2017/11/1 9:03, Joseph Qi wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On 17/10/31 20:41, alex chen wrote:
>>>>>> In the following situation, the down_write() will be called under
>>>>>> the spin_lock(), which may lead a soft lockup:
>>>>>> o2hb_region_inc_user
>>>>>>    spin_lock(&o2hb_live_lock)
>>>>>>     o2hb_region_pin
>>>>>>      o2nm_depend_item
>>>>>>       configfs_depend_item
>>>>>>        inode_lock
>>>>>>         down_write
>>>>>>         -->here may sleep and reschedule
>>>>>>
>>>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>>>> get item reference in advance to prevent the region to be released.
>>>>>>
>>>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>>>> ---
>>>>>>    fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>>>    1 file changed, 8 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>>> index d020604..f1142a9 100644
>>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>    		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>>>    			goto skip_pin;
>>>>>>
>>>>>> +		config_item_get(&reg->hr_item);
>>>>>> +		spin_unlock(&o2hb_live_lock);
>>>>>> +
>>>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>>>
>>>>
>>>> In local heartbeat mode, here we already found the region and will break the loop after
>>>> depending item, we get the item reference before spin_unlock(), that means the region will
>>>> never be released by the o2hb_region_release() until we put the item reference after
>>>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>>>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>>>> spin_unlock(), because we just pin all the active regions.
>>>
>>> But we are still iterating elements on global list  *o2hb_all_regions*,
>>> right? How can we guarantee that no more elements will be attached or
>>> detached while the lock is unlocked.
>>>
>>
>> In global heartbeat mode, we pin all regions if there is the first dependent user and
>> unpin all regions if the number of dependent users falls to zero.
>> So there is not exist another region will be attached or detached after spin_unlock().
> 
> Well, as for local heartbeat mode, I think, we still need to protect 
> that global list.
> 
> Thanks,
> Changwei
> 
For the local heartbeat mode, as I said before, here we already found the region and will
break the loop after depending item, so we just need to protect the region during depending
item. The global list doesn't need to protect anymore.

I find a problem in my patch. The reg->hr_item_pinned should be protect by the o2hb_live_lock,
so we should call spin_lock(&o2hb_live_lock) immediately after o2nm_depend_item();
I will modified the patch and resend the patch.

Thank,
Alex
>>
>> Thank,
>> Alex
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>>>>    		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>>>    		ret = o2nm_depend_item(&reg->hr_item);
>>>>>>    		if (!ret) {
>>>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>    			else {
>>>>>>    				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>>>    				     uuid, ret);
>>>>>> +				config_item_put(&reg->hr_item);
>>>>>> +				spin_lock(&o2hb_live_lock);
>>>>>>    				break;
>>>>>>    			}
>>>>>>    		}
>>>>>> +
>>>>>> +		config_item_put(&reg->hr_item);
>>>>>> +		spin_lock(&o2hb_live_lock);
>>>>>>    skip_pin:
>>>>>>    		if (found)
>>>>>>    			break;
>>>>>> -- 1.9.5.msysgit.1
>>>>>>
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
>
Changwei Ge Nov. 2, 2017, 5:58 a.m. UTC | #8
On 2017/11/2 11:45, alex chen wrote:
> Hi Changwei,
> 
> On 2017/11/1 17:59, Changwei Ge wrote:
>> Hi Alex,
>>
>> On 2017/11/1 17:52, alex chen wrote:
>>> Hi Changwei,
>>>
>>> Thanks for you reply.
>>>
>>> On 2017/11/1 16:28, Changwei Ge wrote:
>>>> Hi Alex,
>>>>
>>>> On 2017/11/1 15:05, alex chen wrote:
>>>>> Hi Joseph and Changwei,
>>>>>
>>>>> It's our basic principle that the function in which may sleep can't be called
>>>>> within spinlock hold.
>>>>
>>>> I suppose this principle is a suggestion not a restriction.
>>>>
>>>
>>> It is a very very bad idea to sleep while holding a spin lock.
>>> If a process grabs a spinlock and goes to sleep before releasing it.
>>> Then this process yields the control of the CPU to a second process.
>>> Unfortunately the second process also need to grab the spinlock and it will
>>> busy wait. On an uniprocessor machine the second process will lock the CPU not
>>> allowing the first process to wake up and release the spinlock so the second
>>> process can't continue, it is basically a deadlock.
>>
>> I agree. This soft lockup truly exists. This point is OK for me.
>>
>>>
>>>>>
>>>>> On 2017/11/1 9:03, Joseph Qi wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> On 17/10/31 20:41, alex chen wrote:
>>>>>>> In the following situation, the down_write() will be called under
>>>>>>> the spin_lock(), which may lead a soft lockup:
>>>>>>> o2hb_region_inc_user
>>>>>>>     spin_lock(&o2hb_live_lock)
>>>>>>>      o2hb_region_pin
>>>>>>>       o2nm_depend_item
>>>>>>>        configfs_depend_item
>>>>>>>         inode_lock
>>>>>>>          down_write
>>>>>>>          -->here may sleep and reschedule
>>>>>>>
>>>>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>>>>> get item reference in advance to prevent the region to be released.
>>>>>>>
>>>>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>>>>> ---
>>>>>>>     fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>>>>     1 file changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>>>> index d020604..f1142a9 100644
>>>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>     		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>>>>     			goto skip_pin;
>>>>>>>
>>>>>>> +		config_item_get(&reg->hr_item);
>>>>>>> +		spin_unlock(&o2hb_live_lock);
>>>>>>> +
>>>>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>>>>
>>>>>> Thanks,
>>>>>> Joseph
>>>>>>
>>>>>
>>>>> In local heartbeat mode, here we already found the region and will break the loop after
>>>>> depending item, we get the item reference before spin_unlock(), that means the region will
>>>>> never be released by the o2hb_region_release() until we put the item reference after
>>>>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>>>>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>>>>> spin_unlock(), because we just pin all the active regions.
>>>>
>>>> But we are still iterating elements on global list  *o2hb_all_regions*,
>>>> right? How can we guarantee that no more elements will be attached or
>>>> detached while the lock is unlocked.
>>>>
>>>
>>> In global heartbeat mode, we pin all regions if there is the first dependent user and
>>> unpin all regions if the number of dependent users falls to zero.
>>> So there is not exist another region will be attached or detached after spin_unlock().
>>
>> Well, as for local heartbeat mode, I think, we still need to protect
>> that global list.
>>
>> Thanks,
>> Changwei
>>
> For the local heartbeat mode, as I said before, here we already found the region and will
> break the loop after depending item, so we just need to protect the region during depending
> item. The global list doesn't need to protect anymore.

Hi Alex,
Um, I think this is too tricky.

> 
> I find a problem in my patch. The reg->hr_item_pinned should be protect by the o2hb_live_lock,
> so we should call spin_lock(&o2hb_live_lock) immediately after o2nm_depend_item();
> I will modified the patch and resend the patch
Well, if you do so to revise your patch, how can you keep the 
consistency between setting ::hr_item_pinned and pining ::hr_item.

What will ::hr_item_pinned stand for against your revision?
I think this fix is fragile.

Besides that the issue reported by Fungguang seems to be a crash issue 
rather than a lockup one.

Thanks,
Changwei

> 
> Thank,
> Alex
>>>
>>> Thank,
>>> Alex
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>>>>     		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>>>>     		ret = o2nm_depend_item(&reg->hr_item);
>>>>>>>     		if (!ret) {
>>>>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>     			else {
>>>>>>>     				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>>>>     				     uuid, ret);
>>>>>>> +				config_item_put(&reg->hr_item);
>>>>>>> +				spin_lock(&o2hb_live_lock);
>>>>>>>     				break;
>>>>>>>     			}
>>>>>>>     		}
>>>>>>> +
>>>>>>> +		config_item_put(&reg->hr_item);
>>>>>>> +		spin_lock(&o2hb_live_lock);
>>>>>>>     skip_pin:
>>>>>>>     		if (found)
>>>>>>>     			break;
>>>>>>> -- 1.9.5.msysgit.1
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
>>
>> .
>>
> 
>
zhendong chen Nov. 2, 2017, 10:50 a.m. UTC | #9
Hi Changwei,

Thanks for you reply.

On 2017/11/2 13:58, Changwei Ge wrote:
> On 2017/11/2 11:45, alex chen wrote:
>> Hi Changwei,
>>
>> On 2017/11/1 17:59, Changwei Ge wrote:
>>> Hi Alex,
>>>
>>> On 2017/11/1 17:52, alex chen wrote:
>>>> Hi Changwei,
>>>>
>>>> Thanks for you reply.
>>>>
>>>> On 2017/11/1 16:28, Changwei Ge wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On 2017/11/1 15:05, alex chen wrote:
>>>>>> Hi Joseph and Changwei,
>>>>>>
>>>>>> It's our basic principle that the function in which may sleep can't be called
>>>>>> within spinlock hold.
>>>>>
>>>>> I suppose this principle is a suggestion not a restriction.
>>>>>
>>>>
>>>> It is a very very bad idea to sleep while holding a spin lock.
>>>> If a process grabs a spinlock and goes to sleep before releasing it.
>>>> Then this process yields the control of the CPU to a second process.
>>>> Unfortunately the second process also need to grab the spinlock and it will
>>>> busy wait. On an uniprocessor machine the second process will lock the CPU not
>>>> allowing the first process to wake up and release the spinlock so the second
>>>> process can't continue, it is basically a deadlock.
>>>
>>> I agree. This soft lockup truly exists. This point is OK for me.
>>>
>>>>
>>>>>>
>>>>>> On 2017/11/1 9:03, Joseph Qi wrote:
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> On 17/10/31 20:41, alex chen wrote:
>>>>>>>> In the following situation, the down_write() will be called under
>>>>>>>> the spin_lock(), which may lead a soft lockup:
>>>>>>>> o2hb_region_inc_user
>>>>>>>>     spin_lock(&o2hb_live_lock)
>>>>>>>>      o2hb_region_pin
>>>>>>>>       o2nm_depend_item
>>>>>>>>        configfs_depend_item
>>>>>>>>         inode_lock
>>>>>>>>          down_write
>>>>>>>>          -->here may sleep and reschedule
>>>>>>>>
>>>>>>>> So we should unlock the o2hb_live_lock before the o2nm_depend_item(), and
>>>>>>>> get item reference in advance to prevent the region to be released.
>>>>>>>>
>>>>>>>> Signed-off-by: Alex Chen <alex.chen@huawei.com>
>>>>>>>> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>>>>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>>>>>>>> ---
>>>>>>>>     fs/ocfs2/cluster/heartbeat.c | 8 ++++++++
>>>>>>>>     1 file changed, 8 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> index d020604..f1142a9 100644
>>>>>>>> --- a/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> +++ b/fs/ocfs2/cluster/heartbeat.c
>>>>>>>> @@ -2399,6 +2399,9 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>>     		if (reg->hr_item_pinned || reg->hr_item_dropped)
>>>>>>>>     			goto skip_pin;
>>>>>>>>
>>>>>>>> +		config_item_get(&reg->hr_item);
>>>>>>>> +		spin_unlock(&o2hb_live_lock);
>>>>>>>> +
>>>>>>> If unlock here, the iteration of o2hb_all_regions is no longer safe.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joseph
>>>>>>>
>>>>>>
>>>>>> In local heartbeat mode, here we already found the region and will break the loop after
>>>>>> depending item, we get the item reference before spin_unlock(), that means the region will
>>>>>> never be released by the o2hb_region_release() until we put the item reference after
>>>>>> spin_lock(&o2hb_live_lock), so we can safely iterate over the list.
>>>>>> In global heartbeat mode, it doesn't matter that some regions may be deleted after
>>>>>> spin_unlock(), because we just pin all the active regions.
>>>>>
>>>>> But we are still iterating elements on global list  *o2hb_all_regions*,
>>>>> right? How can we guarantee that no more elements will be attached or
>>>>> detached while the lock is unlocked.
>>>>>
>>>>
>>>> In global heartbeat mode, we pin all regions if there is the first dependent user and
>>>> unpin all regions if the number of dependent users falls to zero.
>>>> So there is not exist another region will be attached or detached after spin_unlock().
>>>
>>> Well, as for local heartbeat mode, I think, we still need to protect
>>> that global list.
>>>
>>> Thanks,
>>> Changwei
>>>
>> For the local heartbeat mode, as I said before, here we already found the region and will
>> break the loop after depending item, so we just need to protect the region during depending
>> item. The global list doesn't need to protect anymore.
> 
> Hi Alex,
> Um, I think this is too tricky.
>

The o2hb_live_lock mainly protect the "o2hb_all_regions" not the region and we will not iterate
the list o2hb_all_regions any more after we find the region. So we can unlock here.
If you have any other opinions, please let me know and I really appreciate you if you can express
your opinions as detailed as possible.

>>
>> I find a problem in my patch. The reg->hr_item_pinned should be protect by the o2hb_live_lock,
>> so we should call spin_lock(&o2hb_live_lock) immediately after o2nm_depend_item();
>> I will modified the patch and resend the patch
> Well, if you do so to revise your patch, how can you keep the 
> consistency between setting ::hr_item_pinned and pining ::hr_item.
> 
> What will ::hr_item_pinned stand for against your revision?
> I think this fix is fragile.
> 

Firstly, the ::hr_item_pinned is set in o2hb_region_pin() when mounting filesystem and cleared
in o2hb_region_unpin() when umounting filesystem. mount and umount the filesystem can only be called
one by one. So the ::hr_item_pinned will not be set and cleared concurrently.
Secondly, Although the o2hb_region_pin() is multiple calls, but are called serial in the function
dlm_register_domain_handlers(). So the ::hr_item_pinned will not be set concurrently.
So we don't need to worry about the consistency of the ::hr_item_pinned.

> Besides that the issue reported by Fungguang seems to be a crash issue 
> rather than a lockup one.
> 
> Thanks,
> Changwei
> 
Um, I think the CONFIG_DEBUG_ATOMIC_SLEEP had been defined in the kernel tested by Fengguang,
so the warning "BUG: sleeping function..." could be printed. But normally, the
CONFIG_DEBUG_ATOMIC_SLEEP would not be defined in the released kernel. so it is a soft lockup.
You can read the function ___might_sleep() for more details.

Thanks,
Alex
>>
>> Thank,
>> Alex
>>>>
>>>> Thank,
>>>> Alex
>>>>>>
>>>>>> Thanks,
>>>>>> Alex
>>>>>>
>>>>>>>>     		/* Ignore ENOENT only for local hb (userdlm domain) */
>>>>>>>>     		ret = o2nm_depend_item(&reg->hr_item);
>>>>>>>>     		if (!ret) {
>>>>>>>> @@ -2410,9 +2413,14 @@ static int o2hb_region_pin(const char *region_uuid)
>>>>>>>>     			else {
>>>>>>>>     				mlog(ML_ERROR, "Pin region %s fails with %d\n",
>>>>>>>>     				     uuid, ret);
>>>>>>>> +				config_item_put(&reg->hr_item);
>>>>>>>> +				spin_lock(&o2hb_live_lock);
>>>>>>>>     				break;
>>>>>>>>     			}
>>>>>>>>     		}
>>>>>>>> +
>>>>>>>> +		config_item_put(&reg->hr_item);
>>>>>>>> +		spin_lock(&o2hb_live_lock);
>>>>>>>>     skip_pin:
>>>>>>>>     		if (found)
>>>>>>>>     			break;
>>>>>>>> -- 1.9.5.msysgit.1
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
>

Patch
diff mbox

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index d020604..f1142a9 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -2399,6 +2399,9 @@  static int o2hb_region_pin(const char *region_uuid)
 		if (reg->hr_item_pinned || reg->hr_item_dropped)
 			goto skip_pin;

+		config_item_get(&reg->hr_item);
+		spin_unlock(&o2hb_live_lock);
+
 		/* Ignore ENOENT only for local hb (userdlm domain) */
 		ret = o2nm_depend_item(&reg->hr_item);
 		if (!ret) {
@@ -2410,9 +2413,14 @@  static int o2hb_region_pin(const char *region_uuid)
 			else {
 				mlog(ML_ERROR, "Pin region %s fails with %d\n",
 				     uuid, ret);
+				config_item_put(&reg->hr_item);
+				spin_lock(&o2hb_live_lock);
 				break;
 			}
 		}
+
+		config_item_put(&reg->hr_item);
+		spin_lock(&o2hb_live_lock);
 skip_pin:
 		if (found)
 			break;