Revert "ocfs2: mount shared volume without ha stack"

Message ID	20220603222801.42488-1-junxiao.bi@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <bounces+ocfs2-devel=archiver.kernel.org@phx1.rp.oracleemaildelivery.com> To: ocfs2-devel@oss.oracle.com Date: Fri, 3 Jun 2022 15:28:01 -0700 Message-id: <20220603222801.42488-1-junxiao.bi@oracle.com> MIME-version: 1.0 Subject: [Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack" Precedence: list From: Junxiao Bi via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Reply-to: Junxiao Bi <junxiao.bi@oracle.com> Content-type: text/plain; charset="us-ascii" Content-transfer-encoding: 7bit Errors-to: ocfs2-devel-bounces@oss.oracle.com Reporting-Meta: AAFJ00xwt72T3WVP1ZVM3O1t9lrW+mxaaOsZO6Kke+sbK1JuPhA1tgY//juhL5Hq LOx7qcu3dXOsKhlSXxOJvwLDACyzW+R17E8m/BOhHr6y6xzwWbYBrQnZDawbfS9b 2Qf4O+E31ddOt8E7f3+BJACDhEqwUH1yuPXCLse+UvgCMAuj8giDla+WzYnNi7TU Zj/O5mhviyPqtxJEnv52uG9FOsJ6SbnL2lhzuNb1kim8rZfxAsZYEarRqsR2uI0P 5+2uT0qbV7Orr5zXpGipKwTg77wg5lQmDttqv+jUWXlEMs446+rToqfPJrDUn1DZ e+AIF4FyGacKS2gC71bSwdPA0sEUoOPnqPyYhSYVMINiHYm3o91sayr4BfG7iiTs HNn81xIb1D63+oetXJpTE9wpzSc+3sm3BpyOKo/e5cE+KBNU/Gd+i8zFnsPu1U87 IX2DK6+fi7rkVoEt+bjmp9R+SKnYoXEwFbvy67j2ePOIXxGQbDFbQ6nVDXMXNfeA vCoR4XHW7v7qUltKKF/7KhOu6+VYBEhHGhIjy8wWUok/
Series	Revert "ocfs2: mount shared volume without ha stack" \| expand Revert "ocfs2: mount shared volume without ha stack"

Message ID

20220603222801.42488-1-junxiao.bi@oracle.com (mailing list archive)

State

New, archived

Headers

To: ocfs2-devel@oss.oracle.com
Date: Fri,  3 Jun 2022 15:28:01 -0700
Message-id: <20220603222801.42488-1-junxiao.bi@oracle.com>
MIME-version: 1.0
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Jun 2022 22:28:04.0025 (UTC)
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.517,
 18.0.874 definitions=2022-06-03_08:2022-06-02, 2022-06-03 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0
 bulkscore=0 suspectscore=0 mlxlogscore=999 mlxscore=0 spamscore=0
 adultscore=0
 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2204290000 definitions=main-2206030088
Subject: [Ocfs2-devel] [PATCH] Revert
 "ocfs2: mount shared volume without ha	stack"
X-BeenThere: ocfs2-devel@oss.oracle.com
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <ocfs2-devel.oss.oracle.com>
List-Unsubscribe: <https://oss.oracle.com/mailman/options/ocfs2-devel>,
 <mailto:ocfs2-devel-request@oss.oracle.com?subject=unsubscribe>
List-Archive: <http://oss.oracle.com/pipermail/ocfs2-devel/>
List-Post: <mailto:ocfs2-devel@oss.oracle.com>
List-Help: <mailto:ocfs2-devel-request@oss.oracle.com?subject=help>
List-Subscribe: <https://oss.oracle.com/mailman/listinfo/ocfs2-devel>,
 <mailto:ocfs2-devel-request@oss.oracle.com?subject=subscribe>
From: Junxiao Bi via Ocfs2-devel <ocfs2-devel@oss.oracle.com>
Reply-to: Junxiao Bi <junxiao.bi@oracle.com>
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: 7bit
Errors-to: ocfs2-devel-bounces@oss.oracle.com
X-ClientProxiedBy: SJ0PR05CA0160.namprd05.prod.outlook.com
 (2603:10b6:a03:339::15) To SJ0PR10MB4752.namprd10.prod.outlook.com
 (2603:10b6:a03:2d7::19)
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id: 8100e114-9cb4-4968-a855-08da45b05316
X-MS-TrafficTypeDiagnostic: BN0PR10MB5128:EE_
X-Oracle-Tenancy: 1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 EwwXU9R132tJSdST2Nc3SyqumcXuFFgqcLCbMHA59wlg5MYJNDtJsAvaWszXMM0/mGQlJD7/PIn7aSRkBZ1nFjQfUx485nqrKokw1fPPvI3xM5J/LYf+XntgceCGJIbwj8zDsSlTiDJkgVh4MvqqGSnhlnS8qMntCtfoMlquWKsa11YHNFugN0wNCwwTQ5+/3YfBKawLP/BXzzeG9u+lqlRJ4sBpp6uLRzhsC+SVNiM468uK10Z7U/p9KjnBXL0wjKg+BDlV1Uw/nTBFj8Wcy+OKv6yjSpFW/P5zAo5BBdO3zpba7UhvK7HiPlMJjUg0om2gT10WmoDlJh3lUGlionIigUQtmZ7yMV0bkFXY7Qa5Kx4tLkZBeEujd5Y4gxtbj3s0hXC2pTRiBELvRiIkaRb2w2mP8aa7PdWoHWmFDgd+G9B22546NlRVJwBi90y8cdbScQdD/lm5Vkh4CsZv31oC47ioi2JsZEKY+LiBYLpUQpdD51ZNavkcwXH0j/tbREmBwydKFGu6H3kIlFWsfsIUWxsjKmU0Ls+o2daDGGGDBC7U51L9GaVbfIzdjp5bwstahzrIOwGla3+e5jLXCLqc6FqfHWnorvzyp8CyNrs=
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 8100e114-9cb4-4968-a855-08da45b05316
X-MS-Exchange-CrossTenant-AuthSource: SJ0PR10MB4752.namprd10.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 knvvJA7hjaoG7QSyh0V9MTH8tDBWf9AVbJ3phI2Pfa/aHROFQDBRlL158FJeFoAdDjtAep6920EdGB/+S69Vlw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN0PR10MB5128
X-Proofpoint-GUID: NlrZ5dSau_HtTACa4rBqCfFCkoiWz7vz
X-Proofpoint-ORIG-GUID: NlrZ5dSau_HtTACa4rBqCfFCkoiWz7vz
Reporting-Meta: 
 AAFJ00xwt72T3WVP1ZVM3O1t9lrW+mxaaOsZO6Kke+sbK1JuPhA1tgY//juhL5Hq
 LOx7qcu3dXOsKhlSXxOJvwLDACyzW+R17E8m/BOhHr6y6xzwWbYBrQnZDawbfS9b
 2Qf4O+E31ddOt8E7f3+BJACDhEqwUH1yuPXCLse+UvgCMAuj8giDla+WzYnNi7TU
 Zj/O5mhviyPqtxJEnv52uG9FOsJ6SbnL2lhzuNb1kim8rZfxAsZYEarRqsR2uI0P
 5+2uT0qbV7Orr5zXpGipKwTg77wg5lQmDttqv+jUWXlEMs446+rToqfPJrDUn1DZ
 e+AIF4FyGacKS2gC71bSwdPA0sEUoOPnqPyYhSYVMINiHYm3o91sayr4BfG7iiTs
 HNn81xIb1D63+oetXJpTE9wpzSc+3sm3BpyOKo/e5cE+KBNU/Gd+i8zFnsPu1U87
 IX2DK6+fi7rkVoEt+bjmp9R+SKnYoXEwFbvy67j2ePOIXxGQbDFbQ6nVDXMXNfeA
 vCoR4XHW7v7qUltKKF/7KhOu6+VYBEhHGhIjy8wWUok/

Series

Revert "ocfs2: mount shared volume without ha stack" | expand

Commit Message

Junxiao Bi June 3, 2022, 10:28 p.m. UTC

This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.

This commit introduced a regression that can cause mount hung.
The changes in __ocfs2_find_empty_slot causes that any node with
none-zero node number can grab the slot that was already taken by
node 0, so node 1 will access the same journal with node 0, when it
try to grab journal cluster lock, it will hung because it was already
acquired by node 0.
It's very easy to reproduce this, in one cluster, mount node 0 first,
then node 1, you will see the following call trace from node 1.

[13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
[13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
[13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
[13148.749354] Call Trace:
[13148.750718]  <TASK>
[13148.752019]  ? usleep_range+0x90/0x89
[13148.753882]  __schedule+0x210/0x567
[13148.755684]  schedule+0x44/0xa8
[13148.757270]  schedule_timeout+0x106/0x13c
[13148.759273]  ? __prepare_to_swait+0x53/0x78
[13148.761218]  __wait_for_common+0xae/0x163
[13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
[13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
[13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
[13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
[13148.775401]  ? iput+0x69/0xba
[13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
[13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
[13148.781756]  mount_bdev+0x190/0x1b7
[13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
[13148.785634]  legacy_get_tree+0x27/0x48
[13148.787466]  vfs_get_tree+0x25/0xd0
[13148.789270]  do_new_mount+0x18c/0x2d9
[13148.791046]  __x64_sys_mount+0x10e/0x142
[13148.792911]  do_syscall_64+0x3b/0x89
[13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[13148.797051] RIP: 0033:0x7f2309f6e26e
[13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
[13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
[13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
[13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
[13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
[13148.816564]  </TASK>

To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
introduced the feature to mount ocfs2 locally even it is cluster based,
that is a very dangerous, it can easily cause serious data corruption,
there is no way to stop other nodes mounting the fs and corrupting it.
Setup ha or other cluster-aware stack is just the cost that we have to
take for avoiding corruption, otherwise we have to do it in kernel.

Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack")
Cc: <stable@vger.kernel.org>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---
 fs/ocfs2/ocfs2.h    |  4 +---
 fs/ocfs2/slot_map.c | 46 +++++++++++++++++++--------------------------
 fs/ocfs2/super.c    | 21 ---------------------
 3 files changed, 20 insertions(+), 51 deletions(-)

Comments

David Sterba via Ocfs2-devel June 4, 2022, 8:45 a.m. UTC | #1

Hello Junxiao,

On 6/4/22 06:28, Junxiao Bi via Ocfs2-devel wrote:
> This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
> 
> This commit introduced a regression that can cause mount hung.
> The changes in __ocfs2_find_empty_slot causes that any node with
> none-zero node number can grab the slot that was already taken by
> node 0, so node 1 will access the same journal with node 0, when it
> try to grab journal cluster lock, it will hung because it was already
> acquired by node 0.
> It's very easy to reproduce this, in one cluster, mount node 0 first,
> then node 1, you will see the following call trace from node 1.

 From your description, it looks your env mixed local-mount & clustered-mount.

Could you mind to share your test/reproducible steps.
And which ha stack do you use, pmck or o2cb?

I failed to reproduce it, my test steps (with pcmk stack):
```
node1:
mount -t ocfs2 /dev/vdd /mnt

node2:
for i in {1..100}; do
  echo "mount <$i>"; mount -t ocfs2 /dev/vdd /mnt;
  sleep 3;
  echo "umount"; umount /mnt;
done
```

This local mount feature helps SUSE customers to maintain ocfs2 partition, it's useful.
I want to find whether there is a idear way to fix the hung issue.

> 
> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
> [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
> [13148.749354] Call Trace:
> [13148.750718]  <TASK>
> [13148.752019]  ? usleep_range+0x90/0x89
> [13148.753882]  __schedule+0x210/0x567
> [13148.755684]  schedule+0x44/0xa8
> [13148.757270]  schedule_timeout+0x106/0x13c
> [13148.759273]  ? __prepare_to_swait+0x53/0x78
> [13148.761218]  __wait_for_common+0xae/0x163
> [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
> [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
> [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
> [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
> [13148.775401]  ? iput+0x69/0xba
> [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
> [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
> [13148.781756]  mount_bdev+0x190/0x1b7
> [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
> [13148.785634]  legacy_get_tree+0x27/0x48
> [13148.787466]  vfs_get_tree+0x25/0xd0
> [13148.789270]  do_new_mount+0x18c/0x2d9
> [13148.791046]  __x64_sys_mount+0x10e/0x142
> [13148.792911]  do_syscall_64+0x3b/0x89
> [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
> [13148.797051] RIP: 0033:0x7f2309f6e26e
> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
> [13148.816564]  </TASK>
> 
> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
> introduced the feature to mount ocfs2 locally even it is cluster based,
> that is a very dangerous, it can easily cause serious data corruption,
> there is no way to stop other nodes mounting the fs and corrupting it.

I can't follow your meaning. When users want to use local mount feature, they MUST know
what they are doing, and how to use it.

 From mount.ocfs2 (8), there also writes *only* mount fs on *one* node at the same time.
And also tell user fs will be damaged under wrong action.

```
nocluster

   This  option  allows  users  to  mount a clustered volume without configuring the cluster

   stack.  However, you must be aware that you can only mount the file system from one  node

   at the same time, otherwise, the file system may be damaged. Please use it with caution.
```

> Setup ha or other cluster-aware stack is just the cost that we have to
> take for avoiding corruption, otherwise we have to do it in kernel.

It's a little bit serious to totally revert this commit just under lacking sanity
check. If you or maintainer think the local mount should do more jobs to prevent mix
local-mount and clustered-mount scenario, we could add more sanity check during
local mounting.

Thanks,
Heming

Junxiao Bi June 4, 2022, 4:19 p.m. UTC | #2

> 在 2022年6月4日，上午1:45，heming.zhao@suse.com 写道：
> 
> Hello Junxiao,
> 
>> On 6/4/22 06:28, Junxiao Bi via Ocfs2-devel wrote:
>> This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
>> This commit introduced a regression that can cause mount hung.
>> The changes in __ocfs2_find_empty_slot causes that any node with
>> none-zero node number can grab the slot that was already taken by
>> node 0, so node 1 will access the same journal with node 0, when it
>> try to grab journal cluster lock, it will hung because it was already
>> acquired by node 0.
>> It's very easy to reproduce this, in one cluster, mount node 0 first,
>> then node 1, you will see the following call trace from node 1.
> 
> From your description, it looks your env mixed local-mount & clustered-mount.
No, only cluster mount.
> 
> Could you mind to share your test/reproducible steps.
> And which ha stack do you use, pmck or o2cb?
> 
> I failed to reproduce it, my test steps (with pcmk stack):
> ```
> node1:
> mount -t ocfs2 /dev/vdd /mnt
> 
> node2:
> for i in {1..100}; do
> echo "mount <$i>"; mount -t ocfs2 /dev/vdd /mnt;
> sleep 3;
> echo "umount"; umount /mnt;
> done
> ```
> 
Try set one node with node number 0 and mount it there first. I used o2cb stack.
> This local mount feature helps SUSE customers to maintain ocfs2 partition, it's useful.
> I want to find whether there is a idear way to fix the hung issue.
> 
>> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
>> [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
>> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
>> [13148.749354] Call Trace:
>> [13148.750718]  <TASK>
>> [13148.752019]  ? usleep_range+0x90/0x89
>> [13148.753882]  __schedule+0x210/0x567
>> [13148.755684]  schedule+0x44/0xa8
>> [13148.757270]  schedule_timeout+0x106/0x13c
>> [13148.759273]  ? __prepare_to_swait+0x53/0x78
>> [13148.761218]  __wait_for_common+0xae/0x163
>> [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
>> [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
>> [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
>> [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
>> [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
>> [13148.775401]  ? iput+0x69/0xba
>> [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
>> [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
>> [13148.781756]  mount_bdev+0x190/0x1b7
>> [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
>> [13148.785634]  legacy_get_tree+0x27/0x48
>> [13148.787466]  vfs_get_tree+0x25/0xd0
>> [13148.789270]  do_new_mount+0x18c/0x2d9
>> [13148.791046]  __x64_sys_mount+0x10e/0x142
>> [13148.792911]  do_syscall_64+0x3b/0x89
>> [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
>> [13148.797051] RIP: 0033:0x7f2309f6e26e
>> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
>> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
>> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
>> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
>> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
>> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
>> [13148.816564]  </TASK>
>> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
>> introduced the feature to mount ocfs2 locally even it is cluster based,
>> that is a very dangerous, it can easily cause serious data corruption,
>> there is no way to stop other nodes mounting the fs and corrupting it.
> 
> I can't follow your meaning. When users want to use local mount feature, they MUST know
> what they are doing, and how to use it.
I can’t agree with you. There is no  mechanism to make sure customer will follow that, you can’t expect customer understand tech well or even read the doc.
It’s not the case that you don’t have choice, setup cluster stack is the way to stop customer doing something bad, I believe you have to educate customer to understand this is the cost to guard data security, otherwise when something bad happens, they will lose important data, maybe even no way to recover.
> 
> From mount.ocfs2 (8), there also writes *only* mount fs on *one* node at the same time.
> And also tell user fs will be damaged under wrong action.
> 
> ```
> nocluster
> 
>  This  option  allows  users  to  mount a clustered volume without configuring the cluster
> 
>  stack.  However, you must be aware that you can only mount the file system from one  node
> 
>  at the same time, otherwise, the file system may be damaged. Please use it with caution.
> ```
> 
>> Setup ha or other cluster-aware stack is just the cost that we have to
>> take for avoiding corruption, otherwise we have to do it in kernel.
> 
> It's a little bit serious to totally revert this commit just under lacking sanity
> check. If you or maintainer think the local mount should do more jobs to prevent mix
> local-mount and clustered-mount scenario, we could add more sanity check during
> local mounting.
I don’t think this should be done in kernel. Setup cluster stack is the way to forward.

Thanks,
Junxiao
> 
> Thanks,
> Heming
>

Joseph Qi June 25, 2022, 1:30 p.m. UTC | #3

Since I've missed the original mail in my mailbox so I send reply here.

As discussed in this and another long thread, this feature is incomplete
and we don't have a better fix as of now. And it has caused a regression
with the default stack o2cb, so revert it first as a quick fix.

And We can re-take this feature once it is mature in the future.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>

On 6/5/22 12:19 AM, Junxiao Bi via Ocfs2-devel wrote:
> 
> 
>> 在 2022年6月4日，上午1:45，heming.zhao@suse.com 写道：
>>
>> Hello Junxiao,
>>
>>> On 6/4/22 06:28, Junxiao Bi via Ocfs2-devel wrote:
>>> This reverts commit 912f655d78c5d4ad05eac287f23a435924df7144.
>>> This commit introduced a regression that can cause mount hung.
>>> The changes in __ocfs2_find_empty_slot causes that any node with
>>> none-zero node number can grab the slot that was already taken by
>>> node 0, so node 1 will access the same journal with node 0, when it
>>> try to grab journal cluster lock, it will hung because it was already
>>> acquired by node 0.
>>> It's very easy to reproduce this, in one cluster, mount node 0 first,
>>> then node 1, you will see the following call trace from node 1.
>>
>> From your description, it looks your env mixed local-mount & clustered-mount.
> No, only cluster mount.
>>
>> Could you mind to share your test/reproducible steps.
>> And which ha stack do you use, pmck or o2cb?
>>
>> I failed to reproduce it, my test steps (with pcmk stack):
>> ```
>> node1:
>> mount -t ocfs2 /dev/vdd /mnt
>>
>> node2:
>> for i in {1..100}; do
>> echo "mount <$i>"; mount -t ocfs2 /dev/vdd /mnt;
>> sleep 3;
>> echo "umount"; umount /mnt;
>> done
>> ```
>>
> Try set one node with node number 0 and mount it there first. I used o2cb stack.
>> This local mount feature helps SUSE customers to maintain ocfs2 partition, it's useful.
>> I want to find whether there is a idear way to fix the hung issue.
>>
>>> [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds.
>>> [13148.739691]       Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
>>> [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [13148.745846] task:mount.ocfs2     state:D stack:    0 pid:53045 ppid: 53044 flags:0x00004000
>>> [13148.749354] Call Trace:
>>> [13148.750718]  <TASK>
>>> [13148.752019]  ? usleep_range+0x90/0x89
>>> [13148.753882]  __schedule+0x210/0x567
>>> [13148.755684]  schedule+0x44/0xa8
>>> [13148.757270]  schedule_timeout+0x106/0x13c
>>> [13148.759273]  ? __prepare_to_swait+0x53/0x78
>>> [13148.761218]  __wait_for_common+0xae/0x163
>>> [13148.763144]  __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2]
>>> [13148.765780]  ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
>>> [13148.768312]  ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2]
>>> [13148.770968]  ocfs2_journal_init+0x91/0x340 [ocfs2]
>>> [13148.773202]  ocfs2_check_volume+0x39/0x461 [ocfs2]
>>> [13148.775401]  ? iput+0x69/0xba
>>> [13148.777047]  ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2]
>>> [13148.779646]  ocfs2_fill_super+0x54b/0x853 [ocfs2]
>>> [13148.781756]  mount_bdev+0x190/0x1b7
>>> [13148.783443]  ? ocfs2_remount+0x440/0x440 [ocfs2]
>>> [13148.785634]  legacy_get_tree+0x27/0x48
>>> [13148.787466]  vfs_get_tree+0x25/0xd0
>>> [13148.789270]  do_new_mount+0x18c/0x2d9
>>> [13148.791046]  __x64_sys_mount+0x10e/0x142
>>> [13148.792911]  do_syscall_64+0x3b/0x89
>>> [13148.794667]  entry_SYSCALL_64_after_hwframe+0x170/0x0
>>> [13148.797051] RIP: 0033:0x7f2309f6e26e
>>> [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
>>> [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e
>>> [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0
>>> [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820
>>> [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420
>>> [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000
>>> [13148.816564]  </TASK>
>>> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit
>>> introduced the feature to mount ocfs2 locally even it is cluster based,
>>> that is a very dangerous, it can easily cause serious data corruption,
>>> there is no way to stop other nodes mounting the fs and corrupting it.
>>
>> I can't follow your meaning. When users want to use local mount feature, they MUST know
>> what they are doing, and how to use it.
> I can’t agree with you. There is no  mechanism to make sure customer will follow that, you can’t expect customer understand tech well or even read the doc.
> It’s not the case that you don’t have choice, setup cluster stack is the way to stop customer doing something bad, I believe you have to educate customer to understand this is the cost to guard data security, otherwise when something bad happens, they will lose important data, maybe even no way to recover.
>>
>> From mount.ocfs2 (8), there also writes *only* mount fs on *one* node at the same time.
>> And also tell user fs will be damaged under wrong action.
>>
>> ```
>> nocluster
>>
>>  This  option  allows  users  to  mount a clustered volume without configuring the cluster
>>
>>  stack.  However, you must be aware that you can only mount the file system from one  node
>>
>>  at the same time, otherwise, the file system may be damaged. Please use it with caution.
>> ```
>>
>>> Setup ha or other cluster-aware stack is just the cost that we have to
>>> take for avoiding corruption, otherwise we have to do it in kernel.
>>
>> It's a little bit serious to totally revert this commit just under lacking sanity
>> check. If you or maintainer think the local mount should do more jobs to prevent mix
>> local-mount and clustered-mount scenario, we could add more sanity check during
>> local mounting.
> I don’t think this should be done in kernel. Setup cluster stack is the way to forward.

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..740b64238312 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -277,7 +277,6 @@  enum ocfs2_mount_options
 	OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15,  /* Journal Async Commit */
 	OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */
 	OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */
-	OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
 };
 
 #define OCFS2_OSB_SOFT_RO	0x0001
@@ -673,8 +672,7 @@  static inline int ocfs2_cluster_o2cb_global_heartbeat(struct ocfs2_super *osb)
 
 static inline int ocfs2_mount_local(struct ocfs2_super *osb)
 {
-	return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
-		|| (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
+	return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
 }
 
 static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index 0b0ae3ebb0cf..da7718cef735 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -252,16 +252,14 @@  static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid ||
-		    !si->si_slots[preferred].sl_node_num) {
+		if (!si->si_slots[preferred].sl_valid) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid ||
-		    !si->si_slots[i].sl_node_num) {
+		if (!si->si_slots[i].sl_valid) {
 			ret = i;
 			break;
 		}
@@ -456,30 +454,24 @@  int ocfs2_find_slot(struct ocfs2_super *osb)
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
 
-	if (ocfs2_mount_local(osb))
-		/* use slot 0 directly in local mode */
-		slot = 0;
-	else {
-		/* search for ourselves first and take the slot if it already
-		 * exists. Perhaps we need to mark this in a variable for our
-		 * own journal recovery? Possibly not, though we certainly
-		 * need to warn to the user */
-		slot = __ocfs2_node_num_to_slot(si, osb->node_num);
+	/* search for ourselves first and take the slot if it already
+	 * exists. Perhaps we need to mark this in a variable for our
+	 * own journal recovery? Possibly not, though we certainly
+	 * need to warn to the user */
+	slot = __ocfs2_node_num_to_slot(si, osb->node_num);
+	if (slot < 0) {
+		/* if no slot yet, then just take 1st available
+		 * one. */
+		slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
 		if (slot < 0) {
-			/* if no slot yet, then just take 1st available
-			 * one. */
-			slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
-			if (slot < 0) {
-				spin_unlock(&osb->osb_lock);
-				mlog(ML_ERROR, "no free slots available!\n");
-				status = -EINVAL;
-				goto bail;
-			}
-		} else
-			printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
-			       "already allocated to this node!\n",
-			       slot, osb->dev_str);
-	}
+			spin_unlock(&osb->osb_lock);
+			mlog(ML_ERROR, "no free slots available!\n");
+			status = -EINVAL;
+			goto bail;
+		}
+	} else
+		printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
+		       "allocated to this node!\n", slot, osb->dev_str);
 
 	ocfs2_set_slot(si, slot, osb->node_num);
 	osb->slot_num = slot;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f7298816d8d9..438be028935d 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -172,7 +172,6 @@  enum {
 	Opt_dir_resv_level,
 	Opt_journal_async_commit,
 	Opt_err_cont,
-	Opt_nocluster,
 	Opt_err,
 };
 
@@ -206,7 +205,6 @@  static const match_table_t tokens = {
 	{Opt_dir_resv_level, "dir_resv_level=%u"},
 	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_err_cont, "errors=continue"},
-	{Opt_nocluster, "nocluster"},
 	{Opt_err, NULL}
 };
 
@@ -618,13 +616,6 @@  static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 		goto out;
 	}
 
-	tmp = OCFS2_MOUNT_NOCLUSTER;
-	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
-		ret = -EINVAL;
-		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
-		goto out;
-	}
-
 	tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
 		OCFS2_MOUNT_HB_NONE;
 	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
@@ -865,7 +856,6 @@  static int ocfs2_verify_userspace_stack(struct ocfs2_super *osb,
 	}
 
 	if (ocfs2_userspace_stack(osb) &&
-	    !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
 	    strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
 		    OCFS2_STACK_LABEL_LEN)) {
 		mlog(ML_ERROR,
@@ -1137,11 +1127,6 @@  static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
 	       osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" :
 	       "ordered");
 
-	if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
-	   !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
-		printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
-		       "without cluster aware mode.\n", osb->dev_str);
-
 	atomic_set(&osb->vol_state, VOLUME_MOUNTED);
 	wake_up(&osb->osb_mount_event);
 
@@ -1452,9 +1437,6 @@  static int ocfs2_parse_options(struct super_block *sb,
 		case Opt_journal_async_commit:
 			mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
 			break;
-		case Opt_nocluster:
-			mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
-			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1566,9 +1548,6 @@  static int ocfs2_show_options(struct seq_file *s, struct dentry *root)
 	if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
 		seq_printf(s, ",journal_async_commit");
 
-	if (opts & OCFS2_MOUNT_NOCLUSTER)
-		seq_printf(s, ",nocluster");
-
 	return 0;
 }

Revert "ocfs2: mount shared volume without ha stack"

Commit Message

Comments

Patch