diff mbox series

ceph: force updating the msg pointer in non-split case

Message ID 20230517052404.99904-1-xiubli@redhat.com (mailing list archive)
State New, archived
Headers show
Series ceph: force updating the msg pointer in non-split case | expand

Commit Message

Xiubo Li May 17, 2023, 5:24 a.m. UTC
From: Xiubo Li <xiubli@redhat.com>

When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
request may still contain a list of 'split_realms', and we need
to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org
Cc: Frank Schilder <frans@dtu.dk>
Reported-by: Frank Schilder <frans@dtu.dk>
URL: https://tracker.ceph.com/issues/61200
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/snap.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Ilya Dryomov May 17, 2023, 10:31 a.m. UTC | #1
On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>
> From: Xiubo Li <xiubli@redhat.com>
>
> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> request may still contain a list of 'split_realms', and we need
> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>
> Cc: stable@vger.kernel.org
> Cc: Frank Schilder <frans@dtu.dk>
> Reported-by: Frank Schilder <frans@dtu.dk>
> URL: https://tracker.ceph.com/issues/61200
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/snap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index 0e59e95a96d9..d95dfe16b624 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>                                 continue;
>                         adjust_snap_realm_parent(mdsc, child, realm->ino);
>                 }
> +       } else {
> +               p += sizeof(u64) * num_split_inos;
> +               p += sizeof(u64) * num_split_realms;
>         }
>
>         /*
> --
> 2.40.1
>

Hi Xiubo,

This code appears to be very old -- it goes back to the initial commit
963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
explanation for why this popped up only now?

Has MDS always been including split_inos and split_realms arrays in
!CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
change, I'd argue that this needs to be addressed on the MDS side.

Thanks,

                Ilya
Xiubo Li May 17, 2023, 11:04 a.m. UTC | #2
On 5/17/23 18:31, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>> request may still contain a list of 'split_realms', and we need
>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>
>> Cc: stable@vger.kernel.org
>> Cc: Frank Schilder <frans@dtu.dk>
>> Reported-by: Frank Schilder <frans@dtu.dk>
>> URL: https://tracker.ceph.com/issues/61200
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/snap.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>> index 0e59e95a96d9..d95dfe16b624 100644
>> --- a/fs/ceph/snap.c
>> +++ b/fs/ceph/snap.c
>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>                                  continue;
>>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
>>                  }
>> +       } else {
>> +               p += sizeof(u64) * num_split_inos;
>> +               p += sizeof(u64) * num_split_realms;
>>          }
>>
>>          /*
>> --
>> 2.40.1
>>
> Hi Xiubo,
>
> This code appears to be very old -- it goes back to the initial commit
> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> explanation for why this popped up only now?

As I remembered we hit this before in one cu BZ last year, but I 
couldn't remember exactly which one.  But I am not sure whether @Jeff 
saw this before I joint ceph team.


> Has MDS always been including split_inos and split_realms arrays in
> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> change, I'd argue that this needs to be addressed on the MDS side.

While in MDS side for the _UPDATE op it won't send the 'split_realm' 
list just before the commit in 2017:

commit 93e7267757508520dfc22cff1ab20558bd4a44d4
Author: Yan, Zheng <zyan@redhat.com>
Date:   Fri Jul 21 21:40:46 2017 +0800

     mds: send snap related messages centrally during mds recovery

     sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
     clients centrally in MDCache::open_snaprealms()

     Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

Before this commit it will only send the 'split_realm' list for the 
_SPLIT op.


The following the snaptrace:

[Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 
00 00 00 00 01 00 00  ................
[Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 
93                                   ...`.
[Wed May 10 16:03:06 2023]  front: 00000000: 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00  ................ <<== The op is 0, which is 
'CEPH_SNAP_OP_UPDATE'
[Wed May 10 16:03:06 2023]  front: 00000010: 0c 00 00 00 88 00 00 00 d1 
c0 71 38 00 01 00 00  ..........q8....           <<== The '0c' is the 
split_realm number
[Wed May 10 16:03:06 2023]  front: 00000020: 22 c8 71 38 00 01 00 00 d7 
c7 71 38 00 01 00 00  ".q8......q8....       <<== All the 'q8' are the ino#
[Wed May 10 16:03:06 2023]  front: 00000030: d9 c7 71 38 00 01 00 00 d4 
c7 71 38 00 01 00 00  ..q8......q8....
[Wed May 10 16:03:06 2023]  front: 00000040: f1 c0 71 38 00 01 00 00 d4 
c0 71 38 00 01 00 00  ..q8......q8....
[Wed May 10 16:03:06 2023]  front: 00000050: 20 c8 71 38 00 01 00 00 1d 
c8 71 38 00 01 00 00   .q8......q8....
[Wed May 10 16:03:06 2023]  front: 00000060: ec c0 71 38 00 01 00 00 d6 
c0 71 38 00 01 00 00  ..q8......q8....
[Wed May 10 16:03:06 2023]  front: 00000070: ef c0 71 38 00 01 00 00 6a 
11 2d 1a 00 01 00 00  ..q8....j.-.....
[Wed May 10 16:03:06 2023]  front: 00000080: 01 00 00 00 00 00 00 00 01 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 00000090: ee 01 00 00 00 00 00 00 01 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000a0: 00 00 00 00 00 00 00 00 01 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000b0: 01 09 00 00 00 00 00 00 00 
00 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000c0: 01 00 00 00 00 00 00 00 02 
09 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000d0: 05 00 00 00 00 00 00 00 01 
09 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000e0: ff 08 00 00 00 00 00 00 fd 
08 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023]  front: 000000f0: fb 08 00 00 00 00 00 00 f9 
08 00 00 00 00 00 00  ................
[Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 
00 00 00 42 06 63 61  .9..........B.ca
[Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 
05                                   {K]-.


And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + 
extra snap buffer size by coincidence, the above 'corrupted' snaptrace 
will be parsed by kclient too and kclient won't give any warning, but it 
will corrupted the snaprealm and capsnap info in kclient.


Thanks

- Xiubo


> Thanks,
>
>                  Ilya
>
Xiubo Li May 17, 2023, 11:14 a.m. UTC | #3
On 5/17/23 19:04, Xiubo Li wrote:
>
> On 5/17/23 18:31, Ilya Dryomov wrote:
>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>> request may still contain a list of 'split_realms', and we need
>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>
>>> Cc: stable@vger.kernel.org
>>> Cc: Frank Schilder <frans@dtu.dk>
>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>> URL: https://tracker.ceph.com/issues/61200
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>   fs/ceph/snap.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>> index 0e59e95a96d9..d95dfe16b624 100644
>>> --- a/fs/ceph/snap.c
>>> +++ b/fs/ceph/snap.c
>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client 
>>> *mdsc,
>>>                                  continue;
>>>                          adjust_snap_realm_parent(mdsc, child, 
>>> realm->ino);
>>>                  }
>>> +       } else {
>>> +               p += sizeof(u64) * num_split_inos;
>>> +               p += sizeof(u64) * num_split_realms;
>>>          }
>>>
>>>          /*
>>> -- 
>>> 2.40.1
>>>
>> Hi Xiubo,
>>
>> This code appears to be very old -- it goes back to the initial commit
>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>> explanation for why this popped up only now?
>
> As I remembered we hit this before in one cu BZ last year, but I 
> couldn't remember exactly which one.  But I am not sure whether @Jeff 
> saw this before I joint ceph team.
>
@Venky,

Do you remember which one ? As I remembered this is why we fixed the 
snaptrace issue by blocking all the IOs and at the same time 
blocklisting the kclient before.

Before the kcleint won't dump the corrupted msg and we don't know what 
was wrong with the msg and also we added code to dump the msg in the 
above fix.

Thanks

- Xiubo

>
>> Has MDS always been including split_inos and split_realms arrays in
>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>> change, I'd argue that this needs to be addressed on the MDS side.
>
> While in MDS side for the _UPDATE op it won't send the 'split_realm' 
> list just before the commit in 2017:
>
> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> Author: Yan, Zheng <zyan@redhat.com>
> Date:   Fri Jul 21 21:40:46 2017 +0800
>
>     mds: send snap related messages centrally during mds recovery
>
>     sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>     clients centrally in MDCache::open_snaprealms()
>
>     Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>
> Before this commit it will only send the 'split_realm' list for the 
> _SPLIT op.
>
>
> The following the snaptrace:
>
> [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 
> 00 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 
> 00 00 00 00 00 01 00 00  ................
> [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 
> 93                                   ...`.
> [Wed May 10 16:03:06 2023]  front: 00000000: 00 00 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  ................ <<== The op is 0, which is 
> 'CEPH_SNAP_OP_UPDATE'
> [Wed May 10 16:03:06 2023]  front: 00000010: 0c 00 00 00 88 00 00 00 
> d1 c0 71 38 00 01 00 00  ..........q8....           <<== The '0c' is 
> the split_realm number
> [Wed May 10 16:03:06 2023]  front: 00000020: 22 c8 71 38 00 01 00 00 
> d7 c7 71 38 00 01 00 00  ".q8......q8....       <<== All the 'q8' are 
> the ino#
> [Wed May 10 16:03:06 2023]  front: 00000030: d9 c7 71 38 00 01 00 00 
> d4 c7 71 38 00 01 00 00  ..q8......q8....
> [Wed May 10 16:03:06 2023]  front: 00000040: f1 c0 71 38 00 01 00 00 
> d4 c0 71 38 00 01 00 00  ..q8......q8....
> [Wed May 10 16:03:06 2023]  front: 00000050: 20 c8 71 38 00 01 00 00 
> 1d c8 71 38 00 01 00 00   .q8......q8....
> [Wed May 10 16:03:06 2023]  front: 00000060: ec c0 71 38 00 01 00 00 
> d6 c0 71 38 00 01 00 00  ..q8......q8....
> [Wed May 10 16:03:06 2023]  front: 00000070: ef c0 71 38 00 01 00 00 
> 6a 11 2d 1a 00 01 00 00  ..q8....j.-.....
> [Wed May 10 16:03:06 2023]  front: 00000080: 01 00 00 00 00 00 00 00 
> 01 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 00000090: ee 01 00 00 00 00 00 00 
> 01 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000a0: 00 00 00 00 00 00 00 00 
> 01 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000b0: 01 09 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000c0: 01 00 00 00 00 00 00 00 
> 02 09 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000d0: 05 00 00 00 00 00 00 00 
> 01 09 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000e0: ff 08 00 00 00 00 00 00 
> fd 08 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023]  front: 000000f0: fb 08 00 00 00 00 00 00 
> f9 08 00 00 00 00 00 00  ................
> [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 
> 00 00 00 00 42 06 63 61  .9..........B.ca
> [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 
> 05                                   {K]-.
>
>
> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + 
> extra snap buffer size by coincidence, the above 'corrupted' snaptrace 
> will be parsed by kclient too and kclient won't give any warning, but 
> it will corrupted the snaprealm and capsnap info in kclient.
>
>
> Thanks
>
> - Xiubo
>
>
>> Thanks,
>>
>>                  Ilya
>>
Ilya Dryomov May 17, 2023, 11:29 a.m. UTC | #4
On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 5/17/23 18:31, Ilya Dryomov wrote:
> > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> >> From: Xiubo Li <xiubli@redhat.com>
> >>
> >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> >> request may still contain a list of 'split_realms', and we need
> >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> >>
> >> Cc: stable@vger.kernel.org
> >> Cc: Frank Schilder <frans@dtu.dk>
> >> Reported-by: Frank Schilder <frans@dtu.dk>
> >> URL: https://tracker.ceph.com/issues/61200
> >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >> ---
> >>   fs/ceph/snap.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >> index 0e59e95a96d9..d95dfe16b624 100644
> >> --- a/fs/ceph/snap.c
> >> +++ b/fs/ceph/snap.c
> >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> >>                                  continue;
> >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> >>                  }
> >> +       } else {
> >> +               p += sizeof(u64) * num_split_inos;
> >> +               p += sizeof(u64) * num_split_realms;
> >>          }
> >>
> >>          /*
> >> --
> >> 2.40.1
> >>
> > Hi Xiubo,
> >
> > This code appears to be very old -- it goes back to the initial commit
> > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > explanation for why this popped up only now?
>
> As I remembered we hit this before in one cu BZ last year, but I
> couldn't remember exactly which one.  But I am not sure whether @Jeff
> saw this before I joint ceph team.
>
>
> > Has MDS always been including split_inos and split_realms arrays in
> > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > change, I'd argue that this needs to be addressed on the MDS side.
>
> While in MDS side for the _UPDATE op it won't send the 'split_realm'
> list just before the commit in 2017:
>
> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> Author: Yan, Zheng <zyan@redhat.com>
> Date:   Fri Jul 21 21:40:46 2017 +0800
>
>      mds: send snap related messages centrally during mds recovery
>
>      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>      clients centrally in MDCache::open_snaprealms()
>
>      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>
> Before this commit it will only send the 'split_realm' list for the
> _SPLIT op.

It sounds like we have the culprit.  This should be treated as
a regression and fixed on the MDS side.  I don't see a justification
for putting useless data on the wire.

Thanks,

                Ilya
Xiubo Li May 17, 2023, 12:46 p.m. UTC | #5
On 5/17/23 19:29, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>> request may still contain a list of 'split_realms', and we need
>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>> URL: https://tracker.ceph.com/issues/61200
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    fs/ceph/snap.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>> --- a/fs/ceph/snap.c
>>>> +++ b/fs/ceph/snap.c
>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>                                   continue;
>>>>                           adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>                   }
>>>> +       } else {
>>>> +               p += sizeof(u64) * num_split_inos;
>>>> +               p += sizeof(u64) * num_split_realms;
>>>>           }
>>>>
>>>>           /*
>>>> --
>>>> 2.40.1
>>>>
>>> Hi Xiubo,
>>>
>>> This code appears to be very old -- it goes back to the initial commit
>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>> explanation for why this popped up only now?
>> As I remembered we hit this before in one cu BZ last year, but I
>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>> saw this before I joint ceph team.
>>
>>
>>> Has MDS always been including split_inos and split_realms arrays in
>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>> change, I'd argue that this needs to be addressed on the MDS side.
>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>> list just before the commit in 2017:
>>
>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>> Author: Yan, Zheng <zyan@redhat.com>
>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>
>>       mds: send snap related messages centrally during mds recovery
>>
>>       sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>       clients centrally in MDCache::open_snaprealms()
>>
>>       Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>
>> Before this commit it will only send the 'split_realm' list for the
>> _SPLIT op.
> It sounds like we have the culprit.  This should be treated as
> a regression and fixed on the MDS side.  I don't see a justification
> for putting useless data on the wire.

But we still need this patch to make it to work with the old ceph releases.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Venky Shankar May 17, 2023, 12:48 p.m. UTC | #6
Hey Xiubo,

On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 5/17/23 19:04, Xiubo Li wrote:
> >
> > On 5/17/23 18:31, Ilya Dryomov wrote:
> >> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> >>> From: Xiubo Li <xiubli@redhat.com>
> >>>
> >>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> >>> request may still contain a list of 'split_realms', and we need
> >>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> >>>
> >>> Cc: stable@vger.kernel.org
> >>> Cc: Frank Schilder <frans@dtu.dk>
> >>> Reported-by: Frank Schilder <frans@dtu.dk>
> >>> URL: https://tracker.ceph.com/issues/61200
> >>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>> ---
> >>>   fs/ceph/snap.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>> index 0e59e95a96d9..d95dfe16b624 100644
> >>> --- a/fs/ceph/snap.c
> >>> +++ b/fs/ceph/snap.c
> >>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client
> >>> *mdsc,
> >>>                                  continue;
> >>>                          adjust_snap_realm_parent(mdsc, child,
> >>> realm->ino);
> >>>                  }
> >>> +       } else {
> >>> +               p += sizeof(u64) * num_split_inos;
> >>> +               p += sizeof(u64) * num_split_realms;
> >>>          }
> >>>
> >>>          /*
> >>> --
> >>> 2.40.1
> >>>
> >> Hi Xiubo,
> >>
> >> This code appears to be very old -- it goes back to the initial commit
> >> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> >> explanation for why this popped up only now?
> >
> > As I remembered we hit this before in one cu BZ last year, but I
> > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > saw this before I joint ceph team.
> >
> @Venky,
>
> Do you remember which one ? As I remembered this is why we fixed the
> snaptrace issue by blocking all the IOs and at the same time
> blocklisting the kclient before.
>
> Before the kcleint won't dump the corrupted msg and we don't know what
> was wrong with the msg and also we added code to dump the msg in the
> above fix.

The "corrupted" snaptrace issue happened just after the mds asserted
hitting the metadata corruption (dentry first corrupted) and it
_seemed_ that this corruption somehow triggered a corrupted snaptrace
to be sent to the client.

>
> Thanks
>
> - Xiubo
>
> >
> >> Has MDS always been including split_inos and split_realms arrays in
> >> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> >> change, I'd argue that this needs to be addressed on the MDS side.
> >
> > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > list just before the commit in 2017:
> >
> > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > Author: Yan, Zheng <zyan@redhat.com>
> > Date:   Fri Jul 21 21:40:46 2017 +0800
> >
> >     mds: send snap related messages centrally during mds recovery
> >
> >     sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> >     clients centrally in MDCache::open_snaprealms()
> >
> >     Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> >
> > Before this commit it will only send the 'split_realm' list for the
> > _SPLIT op.
> >
> >
> > The following the snaptrace:
> >
> > [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01
> > 00 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00
> > 00 00 00 00 00 01 00 00  ................
> > [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60
> > 93                                   ...`.
> > [Wed May 10 16:03:06 2023]  front: 00000000: 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00  ................ <<== The op is 0, which is
> > 'CEPH_SNAP_OP_UPDATE'
> > [Wed May 10 16:03:06 2023]  front: 00000010: 0c 00 00 00 88 00 00 00
> > d1 c0 71 38 00 01 00 00  ..........q8....           <<== The '0c' is
> > the split_realm number
> > [Wed May 10 16:03:06 2023]  front: 00000020: 22 c8 71 38 00 01 00 00
> > d7 c7 71 38 00 01 00 00  ".q8......q8....       <<== All the 'q8' are
> > the ino#
> > [Wed May 10 16:03:06 2023]  front: 00000030: d9 c7 71 38 00 01 00 00
> > d4 c7 71 38 00 01 00 00  ..q8......q8....
> > [Wed May 10 16:03:06 2023]  front: 00000040: f1 c0 71 38 00 01 00 00
> > d4 c0 71 38 00 01 00 00  ..q8......q8....
> > [Wed May 10 16:03:06 2023]  front: 00000050: 20 c8 71 38 00 01 00 00
> > 1d c8 71 38 00 01 00 00   .q8......q8....
> > [Wed May 10 16:03:06 2023]  front: 00000060: ec c0 71 38 00 01 00 00
> > d6 c0 71 38 00 01 00 00  ..q8......q8....
> > [Wed May 10 16:03:06 2023]  front: 00000070: ef c0 71 38 00 01 00 00
> > 6a 11 2d 1a 00 01 00 00  ..q8....j.-.....
> > [Wed May 10 16:03:06 2023]  front: 00000080: 01 00 00 00 00 00 00 00
> > 01 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 00000090: ee 01 00 00 00 00 00 00
> > 01 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000a0: 00 00 00 00 00 00 00 00
> > 01 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000b0: 01 09 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000c0: 01 00 00 00 00 00 00 00
> > 02 09 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000d0: 05 00 00 00 00 00 00 00
> > 01 09 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000e0: ff 08 00 00 00 00 00 00
> > fd 08 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023]  front: 000000f0: fb 08 00 00 00 00 00 00
> > f9 08 00 00 00 00 00 00  ................
> > [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00
> > 00 00 00 00 42 06 63 61  .9..........B.ca
> > [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d
> > 05                                   {K]-.
> >
> >
> > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > will be parsed by kclient too and kclient won't give any warning, but
> > it will corrupted the snaprealm and capsnap info in kclient.
> >
> >
> > Thanks
> >
> > - Xiubo
> >
> >
> >> Thanks,
> >>
> >>                  Ilya
> >>
>
Venky Shankar May 17, 2023, 12:53 p.m. UTC | #7
On Wed, May 17, 2023 at 6:18 PM Venky Shankar <vshankar@redhat.com> wrote:
>
> Hey Xiubo,
>
> On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote:
> >
> >
> > On 5/17/23 19:04, Xiubo Li wrote:
> > >
> > > On 5/17/23 18:31, Ilya Dryomov wrote:
> > >> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > >>> From: Xiubo Li <xiubli@redhat.com>
> > >>>
> > >>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > >>> request may still contain a list of 'split_realms', and we need
> > >>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > >>>
> > >>> Cc: stable@vger.kernel.org
> > >>> Cc: Frank Schilder <frans@dtu.dk>
> > >>> Reported-by: Frank Schilder <frans@dtu.dk>
> > >>> URL: https://tracker.ceph.com/issues/61200
> > >>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > >>> ---
> > >>>   fs/ceph/snap.c | 3 +++
> > >>>   1 file changed, 3 insertions(+)
> > >>>
> > >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > >>> index 0e59e95a96d9..d95dfe16b624 100644
> > >>> --- a/fs/ceph/snap.c
> > >>> +++ b/fs/ceph/snap.c
> > >>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client
> > >>> *mdsc,
> > >>>                                  continue;
> > >>>                          adjust_snap_realm_parent(mdsc, child,
> > >>> realm->ino);
> > >>>                  }
> > >>> +       } else {
> > >>> +               p += sizeof(u64) * num_split_inos;
> > >>> +               p += sizeof(u64) * num_split_realms;
> > >>>          }
> > >>>
> > >>>          /*
> > >>> --
> > >>> 2.40.1
> > >>>
> > >> Hi Xiubo,
> > >>
> > >> This code appears to be very old -- it goes back to the initial commit
> > >> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > >> explanation for why this popped up only now?
> > >
> > > As I remembered we hit this before in one cu BZ last year, but I
> > > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > > saw this before I joint ceph team.
> > >
> > @Venky,
> >
> > Do you remember which one ? As I remembered this is why we fixed the
> > snaptrace issue by blocking all the IOs and at the same time
> > blocklisting the kclient before.
> >
> > Before the kcleint won't dump the corrupted msg and we don't know what
> > was wrong with the msg and also we added code to dump the msg in the
> > above fix.
>
> The "corrupted" snaptrace issue happened just after the mds asserted
> hitting the metadata corruption (dentry first corrupted) and it
> _seemed_ that this corruption somehow triggered a corrupted snaptrace
> to be sent to the client.

[sent message a bit early - cotd...]

But I think you found the issue - the message dump did help and its
not related to the dentry first corruption.

>
> >
> > Thanks
> >
> > - Xiubo
> >
> > >
> > >> Has MDS always been including split_inos and split_realms arrays in
> > >> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > >> change, I'd argue that this needs to be addressed on the MDS side.
> > >
> > > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > > list just before the commit in 2017:
> > >
> > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > > Author: Yan, Zheng <zyan@redhat.com>
> > > Date:   Fri Jul 21 21:40:46 2017 +0800
> > >
> > >     mds: send snap related messages centrally during mds recovery
> > >
> > >     sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> > >     clients centrally in MDCache::open_snaprealms()
> > >
> > >     Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > >
> > > Before this commit it will only send the 'split_realm' list for the
> > > _SPLIT op.
> > >
> > >
> > > The following the snaptrace:
> > >
> > > [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01
> > > 00 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00
> > > 00 00 00 00 00 01 00 00  ................
> > > [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60
> > > 93                                   ...`.
> > > [Wed May 10 16:03:06 2023]  front: 00000000: 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00 00  ................ <<== The op is 0, which is
> > > 'CEPH_SNAP_OP_UPDATE'
> > > [Wed May 10 16:03:06 2023]  front: 00000010: 0c 00 00 00 88 00 00 00
> > > d1 c0 71 38 00 01 00 00  ..........q8....           <<== The '0c' is
> > > the split_realm number
> > > [Wed May 10 16:03:06 2023]  front: 00000020: 22 c8 71 38 00 01 00 00
> > > d7 c7 71 38 00 01 00 00  ".q8......q8....       <<== All the 'q8' are
> > > the ino#
> > > [Wed May 10 16:03:06 2023]  front: 00000030: d9 c7 71 38 00 01 00 00
> > > d4 c7 71 38 00 01 00 00  ..q8......q8....
> > > [Wed May 10 16:03:06 2023]  front: 00000040: f1 c0 71 38 00 01 00 00
> > > d4 c0 71 38 00 01 00 00  ..q8......q8....
> > > [Wed May 10 16:03:06 2023]  front: 00000050: 20 c8 71 38 00 01 00 00
> > > 1d c8 71 38 00 01 00 00   .q8......q8....
> > > [Wed May 10 16:03:06 2023]  front: 00000060: ec c0 71 38 00 01 00 00
> > > d6 c0 71 38 00 01 00 00  ..q8......q8....
> > > [Wed May 10 16:03:06 2023]  front: 00000070: ef c0 71 38 00 01 00 00
> > > 6a 11 2d 1a 00 01 00 00  ..q8....j.-.....
> > > [Wed May 10 16:03:06 2023]  front: 00000080: 01 00 00 00 00 00 00 00
> > > 01 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 00000090: ee 01 00 00 00 00 00 00
> > > 01 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000a0: 00 00 00 00 00 00 00 00
> > > 01 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000b0: 01 09 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000c0: 01 00 00 00 00 00 00 00
> > > 02 09 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000d0: 05 00 00 00 00 00 00 00
> > > 01 09 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000e0: ff 08 00 00 00 00 00 00
> > > fd 08 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023]  front: 000000f0: fb 08 00 00 00 00 00 00
> > > f9 08 00 00 00 00 00 00  ................
> > > [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00
> > > 00 00 00 00 42 06 63 61  .9..........B.ca
> > > [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d
> > > 05                                   {K]-.
> > >
> > >
> > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > > will be parsed by kclient too and kclient won't give any warning, but
> > > it will corrupted the snaprealm and capsnap info in kclient.
> > >
> > >
> > > Thanks
> > >
> > > - Xiubo
> > >
> > >
> > >> Thanks,
> > >>
> > >>                  Ilya
> > >>
> >
>
>
> --
> Cheers,
> Venky
Ilya Dryomov May 17, 2023, 1:11 p.m. UTC | #8
On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 5/17/23 19:29, Ilya Dryomov wrote:
> > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 5/17/23 18:31, Ilya Dryomov wrote:
> >>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> >>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>
> >>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> >>>> request may still contain a list of 'split_realms', and we need
> >>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> >>>>
> >>>> Cc: stable@vger.kernel.org
> >>>> Cc: Frank Schilder <frans@dtu.dk>
> >>>> Reported-by: Frank Schilder <frans@dtu.dk>
> >>>> URL: https://tracker.ceph.com/issues/61200
> >>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>>> ---
> >>>>    fs/ceph/snap.c | 3 +++
> >>>>    1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>> index 0e59e95a96d9..d95dfe16b624 100644
> >>>> --- a/fs/ceph/snap.c
> >>>> +++ b/fs/ceph/snap.c
> >>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> >>>>                                   continue;
> >>>>                           adjust_snap_realm_parent(mdsc, child, realm->ino);
> >>>>                   }
> >>>> +       } else {
> >>>> +               p += sizeof(u64) * num_split_inos;
> >>>> +               p += sizeof(u64) * num_split_realms;
> >>>>           }
> >>>>
> >>>>           /*
> >>>> --
> >>>> 2.40.1
> >>>>
> >>> Hi Xiubo,
> >>>
> >>> This code appears to be very old -- it goes back to the initial commit
> >>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> >>> explanation for why this popped up only now?
> >> As I remembered we hit this before in one cu BZ last year, but I
> >> couldn't remember exactly which one.  But I am not sure whether @Jeff
> >> saw this before I joint ceph team.
> >>
> >>
> >>> Has MDS always been including split_inos and split_realms arrays in
> >>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> >>> change, I'd argue that this needs to be addressed on the MDS side.
> >> While in MDS side for the _UPDATE op it won't send the 'split_realm'
> >> list just before the commit in 2017:
> >>
> >> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> >> Author: Yan, Zheng <zyan@redhat.com>
> >> Date:   Fri Jul 21 21:40:46 2017 +0800
> >>
> >>       mds: send snap related messages centrally during mds recovery
> >>
> >>       sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> >>       clients centrally in MDCache::open_snaprealms()
> >>
> >>       Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> >>
> >> Before this commit it will only send the 'split_realm' list for the
> >> _SPLIT op.
> > It sounds like we have the culprit.  This should be treated as
> > a regression and fixed on the MDS side.  I don't see a justification
> > for putting useless data on the wire.
>
> But we still need this patch to make it to work with the old ceph releases.

This is debatable:

- given that no one noticed this for so long, the likelihood of MDS
  sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
  split_realms arrays is very low

- MDS side fix would be backported to supported Ceph releases

- people who are running unsupported Ceph releases (i.e. aren't
  updating) are unlikely to be diligently updating their kernel clients

Thanks,

                Ilya
Xiubo Li May 17, 2023, 1:14 p.m. UTC | #9
On 5/17/23 20:53, Venky Shankar wrote:
> On Wed, May 17, 2023 at 6:18 PM Venky Shankar <vshankar@redhat.com> wrote:
>> Hey Xiubo,
>>
>> On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>
>>> On 5/17/23 19:04, Xiubo Li wrote:
>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>> ---
>>>>>>    fs/ceph/snap.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>> --- a/fs/ceph/snap.c
>>>>>> +++ b/fs/ceph/snap.c
>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client
>>>>>> *mdsc,
>>>>>>                                   continue;
>>>>>>                           adjust_snap_realm_parent(mdsc, child,
>>>>>> realm->ino);
>>>>>>                   }
>>>>>> +       } else {
>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>           }
>>>>>>
>>>>>>           /*
>>>>>> --
>>>>>> 2.40.1
>>>>>>
>>>>> Hi Xiubo,
>>>>>
>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>> explanation for why this popped up only now?
>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>> saw this before I joint ceph team.
>>>>
>>> @Venky,
>>>
>>> Do you remember which one ? As I remembered this is why we fixed the
>>> snaptrace issue by blocking all the IOs and at the same time
>>> blocklisting the kclient before.
>>>
>>> Before the kcleint won't dump the corrupted msg and we don't know what
>>> was wrong with the msg and also we added code to dump the msg in the
>>> above fix.
>> The "corrupted" snaptrace issue happened just after the mds asserted
>> hitting the metadata corruption (dentry first corrupted) and it
>> _seemed_ that this corruption somehow triggered a corrupted snaptrace
>> to be sent to the client.
> [sent message a bit early - cotd...]
>
> But I think you found the issue - the message dump did help and its
> not related to the dentry first corruption.

Yeah, this one is not related to dentry first corruption.

@Ilya

I have created one ceph PR to fix it in MDS side 
https://tracker.ceph.com/issues/61217 and 
https://github.com/ceph/ceph/pull/51536.

Let's keep this kclient patch to make to be compatible with the old 
cephs. At least users could only update the kclient node to fix this issue.

Thanks

- Xiubo


>>> Thanks
>>>
>>> - Xiubo
>>>
>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>> list just before the commit in 2017:
>>>>
>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>
>>>>      mds: send snap related messages centrally during mds recovery
>>>>
>>>>      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>      clients centrally in MDCache::open_snaprealms()
>>>>
>>>>      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>
>>>> Before this commit it will only send the 'split_realm' list for the
>>>> _SPLIT op.
>>>>
>>>>
>>>> The following the snaptrace:
>>>>
>>>> [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00
>>>> 00 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01
>>>> 00 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00
>>>> 00 00 00 00 00 01 00 00  ................
>>>> [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60
>>>> 93                                   ...`.
>>>> [Wed May 10 16:03:06 2023]  front: 00000000: 00 00 00 00 00 00 00 00
>>>> 00 00 00 00 00 00 00 00  ................ <<== The op is 0, which is
>>>> 'CEPH_SNAP_OP_UPDATE'
>>>> [Wed May 10 16:03:06 2023]  front: 00000010: 0c 00 00 00 88 00 00 00
>>>> d1 c0 71 38 00 01 00 00  ..........q8....           <<== The '0c' is
>>>> the split_realm number
>>>> [Wed May 10 16:03:06 2023]  front: 00000020: 22 c8 71 38 00 01 00 00
>>>> d7 c7 71 38 00 01 00 00  ".q8......q8....       <<== All the 'q8' are
>>>> the ino#
>>>> [Wed May 10 16:03:06 2023]  front: 00000030: d9 c7 71 38 00 01 00 00
>>>> d4 c7 71 38 00 01 00 00  ..q8......q8....
>>>> [Wed May 10 16:03:06 2023]  front: 00000040: f1 c0 71 38 00 01 00 00
>>>> d4 c0 71 38 00 01 00 00  ..q8......q8....
>>>> [Wed May 10 16:03:06 2023]  front: 00000050: 20 c8 71 38 00 01 00 00
>>>> 1d c8 71 38 00 01 00 00   .q8......q8....
>>>> [Wed May 10 16:03:06 2023]  front: 00000060: ec c0 71 38 00 01 00 00
>>>> d6 c0 71 38 00 01 00 00  ..q8......q8....
>>>> [Wed May 10 16:03:06 2023]  front: 00000070: ef c0 71 38 00 01 00 00
>>>> 6a 11 2d 1a 00 01 00 00  ..q8....j.-.....
>>>> [Wed May 10 16:03:06 2023]  front: 00000080: 01 00 00 00 00 00 00 00
>>>> 01 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 00000090: ee 01 00 00 00 00 00 00
>>>> 01 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000a0: 00 00 00 00 00 00 00 00
>>>> 01 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000b0: 01 09 00 00 00 00 00 00
>>>> 00 00 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000c0: 01 00 00 00 00 00 00 00
>>>> 02 09 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000d0: 05 00 00 00 00 00 00 00
>>>> 01 09 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000e0: ff 08 00 00 00 00 00 00
>>>> fd 08 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023]  front: 000000f0: fb 08 00 00 00 00 00 00
>>>> f9 08 00 00 00 00 00 00  ................
>>>> [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00
>>>> 00 00 00 00 42 06 63 61  .9..........B.ca
>>>> [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d
>>>> 05                                   {K]-.
>>>>
>>>>
>>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
>>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace
>>>> will be parsed by kclient too and kclient won't give any warning, but
>>>> it will corrupted the snaprealm and capsnap info in kclient.
>>>>
>>>>
>>>> Thanks
>>>>
>>>> - Xiubo
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>>                   Ilya
>>>>>
>>
>> --
>> Cheers,
>> Venky
>
>
Xiubo Li May 17, 2023, 1:23 p.m. UTC | #10
On 5/17/23 21:11, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/17/23 19:29, Ilya Dryomov wrote:
>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>> ---
>>>>>>     fs/ceph/snap.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>> --- a/fs/ceph/snap.c
>>>>>> +++ b/fs/ceph/snap.c
>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>                                    continue;
>>>>>>                            adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>                    }
>>>>>> +       } else {
>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>            }
>>>>>>
>>>>>>            /*
>>>>>> --
>>>>>> 2.40.1
>>>>>>
>>>>> Hi Xiubo,
>>>>>
>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>> explanation for why this popped up only now?
>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>> saw this before I joint ceph team.
>>>>
>>>>
>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>> list just before the commit in 2017:
>>>>
>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>
>>>>        mds: send snap related messages centrally during mds recovery
>>>>
>>>>        sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>        clients centrally in MDCache::open_snaprealms()
>>>>
>>>>        Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>
>>>> Before this commit it will only send the 'split_realm' list for the
>>>> _SPLIT op.
>>> It sounds like we have the culprit.  This should be treated as
>>> a regression and fixed on the MDS side.  I don't see a justification
>>> for putting useless data on the wire.
>> But we still need this patch to make it to work with the old ceph releases.
> This is debatable:
>
> - given that no one noticed this for so long, the likelihood of MDS
>    sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
>    split_realms arrays is very low
>
> - MDS side fix would be backported to supported Ceph releases
>
> - people who are running unsupported Ceph releases (i.e. aren't
>    updating) are unlikely to be diligently updating their kernel clients

Yeah. While IMO usually upgrading the kernel will be safer and easier 
than upgrading the whole ceph cluster, and upgrading the ceph cluster 
may cause the fs metadatas corruption issue, which we have hit many time 
from CUs and ceph-user mail list.

Will leave it to you for this. If that still doesn't make sense I will 
drop this patch.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Xiubo Li May 17, 2023, 1:33 p.m. UTC | #11
On 5/17/23 21:11, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/17/23 19:29, Ilya Dryomov wrote:
>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>> ---
>>>>>>     fs/ceph/snap.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>> --- a/fs/ceph/snap.c
>>>>>> +++ b/fs/ceph/snap.c
>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>                                    continue;
>>>>>>                            adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>                    }
>>>>>> +       } else {
>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>            }
>>>>>>
>>>>>>            /*
>>>>>> --
>>>>>> 2.40.1
>>>>>>
>>>>> Hi Xiubo,
>>>>>
>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>> explanation for why this popped up only now?
>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>> saw this before I joint ceph team.
>>>>
>>>>
>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>> list just before the commit in 2017:
>>>>
>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>
>>>>        mds: send snap related messages centrally during mds recovery
>>>>
>>>>        sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>        clients centrally in MDCache::open_snaprealms()
>>>>
>>>>        Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>
>>>> Before this commit it will only send the 'split_realm' list for the
>>>> _SPLIT op.
>>> It sounds like we have the culprit.  This should be treated as
>>> a regression and fixed on the MDS side.  I don't see a justification
>>> for putting useless data on the wire.
>> But we still need this patch to make it to work with the old ceph releases.
> This is debatable:
>
> - given that no one noticed this for so long, the likelihood of MDS
>    sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
>    split_realms arrays is very low
>
> - MDS side fix would be backported to supported Ceph releases
>
> - people who are running unsupported Ceph releases (i.e. aren't
>    updating) are unlikely to be diligently updating their kernel clients

Just searched the ceph tracker and I found another 3 trackers have the 
same issue:

https://tracker.ceph.com/issues/57817
https://tracker.ceph.com/issues/57703
https://tracker.ceph.com/issues/57686

So plusing this time and the previous CU case:

https://www.spinics.net/lists/ceph-users/msg77106.html

I have seen at least 5 times.

All this are reproduced when doing MDS failover, and this is the root 
cause in MDS side.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Ilya Dryomov May 17, 2023, 1:56 p.m. UTC | #12
On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 5/17/23 21:11, Ilya Dryomov wrote:
> > On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>
> >> On 5/17/23 19:29, Ilya Dryomov wrote:
> >>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> >>>> On 5/17/23 18:31, Ilya Dryomov wrote:
> >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> >>>>>> From: Xiubo Li <xiubli@redhat.com>
> >>>>>>
> >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> >>>>>> request may still contain a list of 'split_realms', and we need
> >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> >>>>>>
> >>>>>> Cc: stable@vger.kernel.org
> >>>>>> Cc: Frank Schilder <frans@dtu.dk>
> >>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
> >>>>>> URL: https://tracker.ceph.com/issues/61200
> >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >>>>>> ---
> >>>>>>     fs/ceph/snap.c | 3 +++
> >>>>>>     1 file changed, 3 insertions(+)
> >>>>>>
> >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>>>> index 0e59e95a96d9..d95dfe16b624 100644
> >>>>>> --- a/fs/ceph/snap.c
> >>>>>> +++ b/fs/ceph/snap.c
> >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> >>>>>>                                    continue;
> >>>>>>                            adjust_snap_realm_parent(mdsc, child, realm->ino);
> >>>>>>                    }
> >>>>>> +       } else {
> >>>>>> +               p += sizeof(u64) * num_split_inos;
> >>>>>> +               p += sizeof(u64) * num_split_realms;
> >>>>>>            }
> >>>>>>
> >>>>>>            /*
> >>>>>> --
> >>>>>> 2.40.1
> >>>>>>
> >>>>> Hi Xiubo,
> >>>>>
> >>>>> This code appears to be very old -- it goes back to the initial commit
> >>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> >>>>> explanation for why this popped up only now?
> >>>> As I remembered we hit this before in one cu BZ last year, but I
> >>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
> >>>> saw this before I joint ceph team.
> >>>>
> >>>>
> >>>>> Has MDS always been including split_inos and split_realms arrays in
> >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> >>>>> change, I'd argue that this needs to be addressed on the MDS side.
> >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
> >>>> list just before the commit in 2017:
> >>>>
> >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> >>>> Author: Yan, Zheng <zyan@redhat.com>
> >>>> Date:   Fri Jul 21 21:40:46 2017 +0800
> >>>>
> >>>>        mds: send snap related messages centrally during mds recovery
> >>>>
> >>>>        sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> >>>>        clients centrally in MDCache::open_snaprealms()
> >>>>
> >>>>        Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> >>>>
> >>>> Before this commit it will only send the 'split_realm' list for the
> >>>> _SPLIT op.
> >>> It sounds like we have the culprit.  This should be treated as
> >>> a regression and fixed on the MDS side.  I don't see a justification
> >>> for putting useless data on the wire.
> >> But we still need this patch to make it to work with the old ceph releases.
> > This is debatable:
> >
> > - given that no one noticed this for so long, the likelihood of MDS
> >    sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
> >    split_realms arrays is very low
> >
> > - MDS side fix would be backported to supported Ceph releases
> >
> > - people who are running unsupported Ceph releases (i.e. aren't
> >    updating) are unlikely to be diligently updating their kernel clients
>
> Just searched the ceph tracker and I found another 3 trackers have the
> same issue:
>
> https://tracker.ceph.com/issues/57817
> https://tracker.ceph.com/issues/57703
> https://tracker.ceph.com/issues/57686
>
> So plusing this time and the previous CU case:
>
> https://www.spinics.net/lists/ceph-users/msg77106.html
>
> I have seen at least 5 times.
>
> All this are reproduced when doing MDS failover, and this is the root
> cause in MDS side.

OK, given that the fixup in the kernel client is small, it seems
justified.  But, please, add a comment in the new else branch saying
that it's there only to work around a bug in the MDS.

Thanks,

                Ilya
Gregory Farnum May 17, 2023, 1:58 p.m. UTC | #13
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> >
> >
> > On 5/17/23 18:31, Ilya Dryomov wrote:
> > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > >> From: Xiubo Li <xiubli@redhat.com>
> > >>
> > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > >> request may still contain a list of 'split_realms', and we need
> > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > >>
> > >> Cc: stable@vger.kernel.org
> > >> Cc: Frank Schilder <frans@dtu.dk>
> > >> Reported-by: Frank Schilder <frans@dtu.dk>
> > >> URL: https://tracker.ceph.com/issues/61200
> > >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > >> ---
> > >>   fs/ceph/snap.c | 3 +++
> > >>   1 file changed, 3 insertions(+)
> > >>
> > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > >> index 0e59e95a96d9..d95dfe16b624 100644
> > >> --- a/fs/ceph/snap.c
> > >> +++ b/fs/ceph/snap.c
> > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > >>                                  continue;
> > >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> > >>                  }
> > >> +       } else {
> > >> +               p += sizeof(u64) * num_split_inos;
> > >> +               p += sizeof(u64) * num_split_realms;
> > >>          }
> > >>
> > >>          /*
> > >> --
> > >> 2.40.1
> > >>
> > > Hi Xiubo,
> > >
> > > This code appears to be very old -- it goes back to the initial commit
> > > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > > explanation for why this popped up only now?
> >
> > As I remembered we hit this before in one cu BZ last year, but I
> > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > saw this before I joint ceph team.
> >
> >
> > > Has MDS always been including split_inos and split_realms arrays in
> > > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > > change, I'd argue that this needs to be addressed on the MDS side.
> >
> > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > list just before the commit in 2017:
> >
> > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > Author: Yan, Zheng <zyan@redhat.com>
> > Date:   Fri Jul 21 21:40:46 2017 +0800
> >
> >      mds: send snap related messages centrally during mds recovery
> >
> >      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> >      clients centrally in MDCache::open_snaprealms()
> >
> >      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> >
> > Before this commit it will only send the 'split_realm' list for the
> > _SPLIT op.
>
> It sounds like we have the culprit.  This should be treated as
> a regression and fixed on the MDS side.  I don't see a justification
> for putting useless data on the wire.

I don't really understand this viewpoint. We can treat it as an MDS
regression if we want, but it's a six-year-old patch so this is in
nearly every version of server code anybody's running. Why wouldn't we
fix it on both sides?

On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> will be parsed by kclient too and kclient won't give any warning, but it
> will corrupted the snaprealm and capsnap info in kclient.

I'm a bit confused about this patch, but I also don't follow the
kernel client code much so please forgive my ignorance. The change
you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
case, so clearly the kclient *mostly* does the right thing when these
"unexpected" SPLIT ops show up. (Which leads me to assume Zheng
checked that the referenced server patch would work, and he was mostly
right.)

So what logic makes it so "split_realm == sizeof(ceph_mds_snap_realm)
+ <extra snap buffer size>" is the condition which leads to breakage,
and allows everything else to work?
-Greg
Xiubo Li May 17, 2023, 2:04 p.m. UTC | #14
On 5/17/23 21:56, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/17/23 21:11, Ilya Dryomov wrote:
>>> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 5/17/23 19:29, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>
>>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>> ---
>>>>>>>>      fs/ceph/snap.c | 3 +++
>>>>>>>>      1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>>>                                     continue;
>>>>>>>>                             adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>>>                     }
>>>>>>>> +       } else {
>>>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>>>             }
>>>>>>>>
>>>>>>>>             /*
>>>>>>>> --
>>>>>>>> 2.40.1
>>>>>>>>
>>>>>>> Hi Xiubo,
>>>>>>>
>>>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>>>> explanation for why this popped up only now?
>>>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>>>> saw this before I joint ceph team.
>>>>>>
>>>>>>
>>>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>>>> list just before the commit in 2017:
>>>>>>
>>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>>>
>>>>>>         mds: send snap related messages centrally during mds recovery
>>>>>>
>>>>>>         sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>>>         clients centrally in MDCache::open_snaprealms()
>>>>>>
>>>>>>         Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>>>
>>>>>> Before this commit it will only send the 'split_realm' list for the
>>>>>> _SPLIT op.
>>>>> It sounds like we have the culprit.  This should be treated as
>>>>> a regression and fixed on the MDS side.  I don't see a justification
>>>>> for putting useless data on the wire.
>>>> But we still need this patch to make it to work with the old ceph releases.
>>> This is debatable:
>>>
>>> - given that no one noticed this for so long, the likelihood of MDS
>>>     sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
>>>     split_realms arrays is very low
>>>
>>> - MDS side fix would be backported to supported Ceph releases
>>>
>>> - people who are running unsupported Ceph releases (i.e. aren't
>>>     updating) are unlikely to be diligently updating their kernel clients
>> Just searched the ceph tracker and I found another 3 trackers have the
>> same issue:
>>
>> https://tracker.ceph.com/issues/57817
>> https://tracker.ceph.com/issues/57703
>> https://tracker.ceph.com/issues/57686
>>
>> So plusing this time and the previous CU case:
>>
>> https://www.spinics.net/lists/ceph-users/msg77106.html
>>
>> I have seen at least 5 times.
>>
>> All this are reproduced when doing MDS failover, and this is the root
>> cause in MDS side.
> OK, given that the fixup in the kernel client is small, it seems
> justified.  But, please, add a comment in the new else branch saying
> that it's there only to work around a bug in the MDS.

> Thanks,
>
>                  Ilya
>
Ilya Dryomov May 17, 2023, 2:27 p.m. UTC | #15
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> > >
> > >
> > > On 5/17/23 18:31, Ilya Dryomov wrote:
> > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > > >> From: Xiubo Li <xiubli@redhat.com>
> > > >>
> > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > > >> request may still contain a list of 'split_realms', and we need
> > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > > >>
> > > >> Cc: stable@vger.kernel.org
> > > >> Cc: Frank Schilder <frans@dtu.dk>
> > > >> Reported-by: Frank Schilder <frans@dtu.dk>
> > > >> URL: https://tracker.ceph.com/issues/61200
> > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > >> ---
> > > >>   fs/ceph/snap.c | 3 +++
> > > >>   1 file changed, 3 insertions(+)
> > > >>
> > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > >> index 0e59e95a96d9..d95dfe16b624 100644
> > > >> --- a/fs/ceph/snap.c
> > > >> +++ b/fs/ceph/snap.c
> > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > >>                                  continue;
> > > >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> > > >>                  }
> > > >> +       } else {
> > > >> +               p += sizeof(u64) * num_split_inos;
> > > >> +               p += sizeof(u64) * num_split_realms;
> > > >>          }
> > > >>
> > > >>          /*
> > > >> --
> > > >> 2.40.1
> > > >>
> > > > Hi Xiubo,
> > > >
> > > > This code appears to be very old -- it goes back to the initial commit
> > > > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > > > explanation for why this popped up only now?
> > >
> > > As I remembered we hit this before in one cu BZ last year, but I
> > > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > > saw this before I joint ceph team.
> > >
> > >
> > > > Has MDS always been including split_inos and split_realms arrays in
> > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > > > change, I'd argue that this needs to be addressed on the MDS side.
> > >
> > > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > > list just before the commit in 2017:
> > >
> > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > > Author: Yan, Zheng <zyan@redhat.com>
> > > Date:   Fri Jul 21 21:40:46 2017 +0800
> > >
> > >      mds: send snap related messages centrally during mds recovery
> > >
> > >      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> > >      clients centrally in MDCache::open_snaprealms()
> > >
> > >      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > >
> > > Before this commit it will only send the 'split_realm' list for the
> > > _SPLIT op.
> >
> > It sounds like we have the culprit.  This should be treated as
> > a regression and fixed on the MDS side.  I don't see a justification
> > for putting useless data on the wire.
>
> I don't really understand this viewpoint. We can treat it as an MDS
> regression if we want, but it's a six-year-old patch so this is in
> nearly every version of server code anybody's running. Why wouldn't we
> fix it on both sides?

Well, if I didn't speak up chances are we wouldn't have identified the
regression in the MDS at all.  People seem to have this perception that
the client is somehow "easier" to fix, assume that the server is always
doing the right thing and default to patching the client.  I'm just
trying to push back on that.

In this particular case, after understanding the scope of the issue
_and_ getting a committal for the MDS side fix, I approved taking the
kernel client patch in an earlier reply.

>
> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
> > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > will be parsed by kclient too and kclient won't give any warning, but it
> > will corrupted the snaprealm and capsnap info in kclient.
>
> I'm a bit confused about this patch, but I also don't follow the
> kernel client code much so please forgive my ignorance. The change
> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
> case, so clearly the kclient *mostly* does the right thing when these

No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
branch.

Thanks,

                Ilya
Gregory Farnum May 17, 2023, 3:03 p.m. UTC | #16
On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> >
> > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> > > >
> > > >
> > > > On 5/17/23 18:31, Ilya Dryomov wrote:
> > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > > > >> From: Xiubo Li <xiubli@redhat.com>
> > > > >>
> > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > > > >> request may still contain a list of 'split_realms', and we need
> > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > > > >>
> > > > >> Cc: stable@vger.kernel.org
> > > > >> Cc: Frank Schilder <frans@dtu.dk>
> > > > >> Reported-by: Frank Schilder <frans@dtu.dk>
> > > > >> URL: https://tracker.ceph.com/issues/61200
> > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > >> ---
> > > > >>   fs/ceph/snap.c | 3 +++
> > > > >>   1 file changed, 3 insertions(+)
> > > > >>
> > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > > >> index 0e59e95a96d9..d95dfe16b624 100644
> > > > >> --- a/fs/ceph/snap.c
> > > > >> +++ b/fs/ceph/snap.c
> > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > > >>                                  continue;
> > > > >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> > > > >>                  }
> > > > >> +       } else {
> > > > >> +               p += sizeof(u64) * num_split_inos;
> > > > >> +               p += sizeof(u64) * num_split_realms;
> > > > >>          }
> > > > >>
> > > > >>          /*
> > > > >> --
> > > > >> 2.40.1
> > > > >>
> > > > > Hi Xiubo,
> > > > >
> > > > > This code appears to be very old -- it goes back to the initial commit
> > > > > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > > > > explanation for why this popped up only now?
> > > >
> > > > As I remembered we hit this before in one cu BZ last year, but I
> > > > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > > > saw this before I joint ceph team.
> > > >
> > > >
> > > > > Has MDS always been including split_inos and split_realms arrays in
> > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > > > > change, I'd argue that this needs to be addressed on the MDS side.
> > > >
> > > > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > > > list just before the commit in 2017:
> > > >
> > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > > > Author: Yan, Zheng <zyan@redhat.com>
> > > > Date:   Fri Jul 21 21:40:46 2017 +0800
> > > >
> > > >      mds: send snap related messages centrally during mds recovery
> > > >
> > > >      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> > > >      clients centrally in MDCache::open_snaprealms()
> > > >
> > > >      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > > >
> > > > Before this commit it will only send the 'split_realm' list for the
> > > > _SPLIT op.
> > >
> > > It sounds like we have the culprit.  This should be treated as
> > > a regression and fixed on the MDS side.  I don't see a justification
> > > for putting useless data on the wire.
> >
> > I don't really understand this viewpoint. We can treat it as an MDS
> > regression if we want, but it's a six-year-old patch so this is in
> > nearly every version of server code anybody's running. Why wouldn't we
> > fix it on both sides?
>
> Well, if I didn't speak up chances are we wouldn't have identified the
> regression in the MDS at all.  People seem to have this perception that
> the client is somehow "easier" to fix, assume that the server is always
> doing the right thing and default to patching the client.  I'm just
> trying to push back on that.
>
> In this particular case, after understanding the scope of the issue
> _and_ getting a committal for the MDS side fix, I approved taking the
> kernel client patch in an earlier reply.
>
> >
> > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
> > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > > will be parsed by kclient too and kclient won't give any warning, but it
> > > will corrupted the snaprealm and capsnap info in kclient.
> >
> > I'm a bit confused about this patch, but I also don't follow the
> > kernel client code much so please forgive my ignorance. The change
> > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
> > case, so clearly the kclient *mostly* does the right thing when these
>
> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
> branch.

Oh I mis-parsed the braces/spacing there.

I'm still not getting how the precise size is causing the problem —
obviously this isn't an unheard-of category of issue, but the fact
that it works until the count matches a magic number is odd. Is that
ceph_decode_need macro being called from ceph_update_snap_trace just
skipping over the split data somehow? *puzzled*
-Greg

>
> Thanks,
>
>                 Ilya
>
Gregory Farnum May 17, 2023, 3:04 p.m. UTC | #17
Just to be clear, I'd like the details here so we can see if there are
ways to prevent similar issues in the future, which I haven't heard
anybody talk about. :)

On Wed, May 17, 2023 at 8:03 AM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> > > > >
> > > > >
> > > > > On 5/17/23 18:31, Ilya Dryomov wrote:
> > > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > > > > >> From: Xiubo Li <xiubli@redhat.com>
> > > > > >>
> > > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > > > > >> request may still contain a list of 'split_realms', and we need
> > > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > > > > >>
> > > > > >> Cc: stable@vger.kernel.org
> > > > > >> Cc: Frank Schilder <frans@dtu.dk>
> > > > > >> Reported-by: Frank Schilder <frans@dtu.dk>
> > > > > >> URL: https://tracker.ceph.com/issues/61200
> > > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > > >> ---
> > > > > >>   fs/ceph/snap.c | 3 +++
> > > > > >>   1 file changed, 3 insertions(+)
> > > > > >>
> > > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > > > >> index 0e59e95a96d9..d95dfe16b624 100644
> > > > > >> --- a/fs/ceph/snap.c
> > > > > >> +++ b/fs/ceph/snap.c
> > > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > > > >>                                  continue;
> > > > > >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> > > > > >>                  }
> > > > > >> +       } else {
> > > > > >> +               p += sizeof(u64) * num_split_inos;
> > > > > >> +               p += sizeof(u64) * num_split_realms;
> > > > > >>          }
> > > > > >>
> > > > > >>          /*
> > > > > >> --
> > > > > >> 2.40.1
> > > > > >>
> > > > > > Hi Xiubo,
> > > > > >
> > > > > > This code appears to be very old -- it goes back to the initial commit
> > > > > > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > > > > > explanation for why this popped up only now?
> > > > >
> > > > > As I remembered we hit this before in one cu BZ last year, but I
> > > > > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > > > > saw this before I joint ceph team.
> > > > >
> > > > >
> > > > > > Has MDS always been including split_inos and split_realms arrays in
> > > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > > > > > change, I'd argue that this needs to be addressed on the MDS side.
> > > > >
> > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > > > > list just before the commit in 2017:
> > > > >
> > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > > > > Author: Yan, Zheng <zyan@redhat.com>
> > > > > Date:   Fri Jul 21 21:40:46 2017 +0800
> > > > >
> > > > >      mds: send snap related messages centrally during mds recovery
> > > > >
> > > > >      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> > > > >      clients centrally in MDCache::open_snaprealms()
> > > > >
> > > > >      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > > > >
> > > > > Before this commit it will only send the 'split_realm' list for the
> > > > > _SPLIT op.
> > > >
> > > > It sounds like we have the culprit.  This should be treated as
> > > > a regression and fixed on the MDS side.  I don't see a justification
> > > > for putting useless data on the wire.
> > >
> > > I don't really understand this viewpoint. We can treat it as an MDS
> > > regression if we want, but it's a six-year-old patch so this is in
> > > nearly every version of server code anybody's running. Why wouldn't we
> > > fix it on both sides?
> >
> > Well, if I didn't speak up chances are we wouldn't have identified the
> > regression in the MDS at all.  People seem to have this perception that
> > the client is somehow "easier" to fix, assume that the server is always
> > doing the right thing and default to patching the client.  I'm just
> > trying to push back on that.
> >
> > In this particular case, after understanding the scope of the issue
> > _and_ getting a committal for the MDS side fix, I approved taking the
> > kernel client patch in an earlier reply.
> >
> > >
> > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
> > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > > > will be parsed by kclient too and kclient won't give any warning, but it
> > > > will corrupted the snaprealm and capsnap info in kclient.
> > >
> > > I'm a bit confused about this patch, but I also don't follow the
> > > kernel client code much so please forgive my ignorance. The change
> > > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
> > > case, so clearly the kclient *mostly* does the right thing when these
> >
> > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
> > branch.
>
> Oh I mis-parsed the braces/spacing there.
>
> I'm still not getting how the precise size is causing the problem —
> obviously this isn't an unheard-of category of issue, but the fact
> that it works until the count matches a magic number is odd. Is that
> ceph_decode_need macro being called from ceph_update_snap_trace just
> skipping over the split data somehow? *puzzled*
> -Greg
>
> >
> > Thanks,
> >
> >                 Ilya
> >
Frank Schilder May 17, 2023, 3:39 p.m. UTC | #18
Hi all,

joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches.

For these two reasons alone I strongly favour fixing both sides.

Extra reading, feel free to skip.

Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice.

I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason.

The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything.

The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do.

Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3.

I hope this was a useful point of view from someone suffering from the condition.

Best regards and thanks for your efforts addressing this!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Ilya Dryomov May 17, 2023, 4:18 p.m. UTC | #19
On Wed, May 17, 2023 at 5:03 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
> > >
> > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
> > > > >
> > > > >
> > > > > On 5/17/23 18:31, Ilya Dryomov wrote:
> > > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
> > > > > >> From: Xiubo Li <xiubli@redhat.com>
> > > > > >>
> > > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
> > > > > >> request may still contain a list of 'split_realms', and we need
> > > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
> > > > > >>
> > > > > >> Cc: stable@vger.kernel.org
> > > > > >> Cc: Frank Schilder <frans@dtu.dk>
> > > > > >> Reported-by: Frank Schilder <frans@dtu.dk>
> > > > > >> URL: https://tracker.ceph.com/issues/61200
> > > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > > >> ---
> > > > > >>   fs/ceph/snap.c | 3 +++
> > > > > >>   1 file changed, 3 insertions(+)
> > > > > >>
> > > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> > > > > >> index 0e59e95a96d9..d95dfe16b624 100644
> > > > > >> --- a/fs/ceph/snap.c
> > > > > >> +++ b/fs/ceph/snap.c
> > > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
> > > > > >>                                  continue;
> > > > > >>                          adjust_snap_realm_parent(mdsc, child, realm->ino);
> > > > > >>                  }
> > > > > >> +       } else {
> > > > > >> +               p += sizeof(u64) * num_split_inos;
> > > > > >> +               p += sizeof(u64) * num_split_realms;
> > > > > >>          }
> > > > > >>
> > > > > >>          /*
> > > > > >> --
> > > > > >> 2.40.1
> > > > > >>
> > > > > > Hi Xiubo,
> > > > > >
> > > > > > This code appears to be very old -- it goes back to the initial commit
> > > > > > 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
> > > > > > explanation for why this popped up only now?
> > > > >
> > > > > As I remembered we hit this before in one cu BZ last year, but I
> > > > > couldn't remember exactly which one.  But I am not sure whether @Jeff
> > > > > saw this before I joint ceph team.
> > > > >
> > > > >
> > > > > > Has MDS always been including split_inos and split_realms arrays in
> > > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
> > > > > > change, I'd argue that this needs to be addressed on the MDS side.
> > > > >
> > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm'
> > > > > list just before the commit in 2017:
> > > > >
> > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4
> > > > > Author: Yan, Zheng <zyan@redhat.com>
> > > > > Date:   Fri Jul 21 21:40:46 2017 +0800
> > > > >
> > > > >      mds: send snap related messages centrally during mds recovery
> > > > >
> > > > >      sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
> > > > >      clients centrally in MDCache::open_snaprealms()
> > > > >
> > > > >      Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
> > > > >
> > > > > Before this commit it will only send the 'split_realm' list for the
> > > > > _SPLIT op.
> > > >
> > > > It sounds like we have the culprit.  This should be treated as
> > > > a regression and fixed on the MDS side.  I don't see a justification
> > > > for putting useless data on the wire.
> > >
> > > I don't really understand this viewpoint. We can treat it as an MDS
> > > regression if we want, but it's a six-year-old patch so this is in
> > > nearly every version of server code anybody's running. Why wouldn't we
> > > fix it on both sides?
> >
> > Well, if I didn't speak up chances are we wouldn't have identified the
> > regression in the MDS at all.  People seem to have this perception that
> > the client is somehow "easier" to fix, assume that the server is always
> > doing the right thing and default to patching the client.  I'm just
> > trying to push back on that.
> >
> > In this particular case, after understanding the scope of the issue
> > _and_ getting a committal for the MDS side fix, I approved taking the
> > kernel client patch in an earlier reply.
> >
> > >
> > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
> > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
> > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace
> > > > will be parsed by kclient too and kclient won't give any warning, but it
> > > > will corrupted the snaprealm and capsnap info in kclient.
> > >
> > > I'm a bit confused about this patch, but I also don't follow the
> > > kernel client code much so please forgive my ignorance. The change
> > > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
> > > case, so clearly the kclient *mostly* does the right thing when these
> >
> > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
> > branch.
>
> Oh I mis-parsed the braces/spacing there.
>
> I'm still not getting how the precise size is causing the problem —
> obviously this isn't an unheard-of category of issue, but the fact
> that it works until the count matches a magic number is odd. Is that
> ceph_decode_need macro being called from ceph_update_snap_trace just
> skipping over the split data somehow? *puzzled*

I'm not sure what count or magic number you are referring to.

AFAIU the issue is that ceph_update_snap_trace() expects to be called
with "p" pointing at ceph_mds_snap_realm encoding.  In the case that is
being fixed it gets called with "p" pointing at "split_inos" array (and
"split_realms" array following that).  The exact check that throws it
off (whether ceph_decode_need() or something else) is irrelevant -- we
are already toasted by then.  In the worst case no check could fire and
ceph_update_snap_trace() could interpret the bogus array data as
a valid ceph_mds_snap_realm with who knows what consequences...

Thanks,

                Ilya
Xiubo Li May 18, 2023, 12:40 a.m. UTC | #20
On 5/17/23 21:56, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 5/17/23 21:11, Ilya Dryomov wrote:
>>> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>> On 5/17/23 19:29, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>
>>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>> ---
>>>>>>>>      fs/ceph/snap.c | 3 +++
>>>>>>>>      1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>>>                                     continue;
>>>>>>>>                             adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>>>                     }
>>>>>>>> +       } else {
>>>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>>>             }
>>>>>>>>
>>>>>>>>             /*
>>>>>>>> --
>>>>>>>> 2.40.1
>>>>>>>>
>>>>>>> Hi Xiubo,
>>>>>>>
>>>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>>>> explanation for why this popped up only now?
>>>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>>>> saw this before I joint ceph team.
>>>>>>
>>>>>>
>>>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>>>> list just before the commit in 2017:
>>>>>>
>>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>>>
>>>>>>         mds: send snap related messages centrally during mds recovery
>>>>>>
>>>>>>         sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>>>         clients centrally in MDCache::open_snaprealms()
>>>>>>
>>>>>>         Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>>>
>>>>>> Before this commit it will only send the 'split_realm' list for the
>>>>>> _SPLIT op.
>>>>> It sounds like we have the culprit.  This should be treated as
>>>>> a regression and fixed on the MDS side.  I don't see a justification
>>>>> for putting useless data on the wire.
>>>> But we still need this patch to make it to work with the old ceph releases.
>>> This is debatable:
>>>
>>> - given that no one noticed this for so long, the likelihood of MDS
>>>     sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and
>>>     split_realms arrays is very low
>>>
>>> - MDS side fix would be backported to supported Ceph releases
>>>
>>> - people who are running unsupported Ceph releases (i.e. aren't
>>>     updating) are unlikely to be diligently updating their kernel clients
>> Just searched the ceph tracker and I found another 3 trackers have the
>> same issue:
>>
>> https://tracker.ceph.com/issues/57817
>> https://tracker.ceph.com/issues/57703
>> https://tracker.ceph.com/issues/57686
>>
>> So plusing this time and the previous CU case:
>>
>> https://www.spinics.net/lists/ceph-users/msg77106.html
>>
>> I have seen at least 5 times.
>>
>> All this are reproduced when doing MDS failover, and this is the root
>> cause in MDS side.
> OK, given that the fixup in the kernel client is small, it seems
> justified.  But, please, add a comment in the new else branch saying
> that it's there only to work around a bug in the MDS.

Sure. Will update it in next version.

Thanks

- Xiubo


> Thanks,
>
>                  Ilya
>
Xiubo Li May 18, 2023, 12:47 a.m. UTC | #21
On 5/17/23 22:27, Ilya Dryomov wrote:
> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>
>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>> ---
>>>>>>    fs/ceph/snap.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>> --- a/fs/ceph/snap.c
>>>>>> +++ b/fs/ceph/snap.c
>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>                                   continue;
>>>>>>                           adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>                   }
>>>>>> +       } else {
>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>           }
>>>>>>
>>>>>>           /*
>>>>>> --
>>>>>> 2.40.1
>>>>>>
>>>>> Hi Xiubo,
>>>>>
>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>> explanation for why this popped up only now?
>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>> saw this before I joint ceph team.
>>>>
>>>>
>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>> list just before the commit in 2017:
>>>>
>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>
>>>>       mds: send snap related messages centrally during mds recovery
>>>>
>>>>       sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>       clients centrally in MDCache::open_snaprealms()
>>>>
>>>>       Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>
>>>> Before this commit it will only send the 'split_realm' list for the
>>>> _SPLIT op.
>>> It sounds like we have the culprit.  This should be treated as
>>> a regression and fixed on the MDS side.  I don't see a justification
>>> for putting useless data on the wire.
>> I don't really understand this viewpoint. We can treat it as an MDS
>> regression if we want, but it's a six-year-old patch so this is in
>> nearly every version of server code anybody's running. Why wouldn't we
>> fix it on both sides?
> Well, if I didn't speak up chances are we wouldn't have identified the
> regression in the MDS at all.  People seem to have this perception that
> the client is somehow "easier" to fix, assume that the server is always
> doing the right thing and default to patching the client.  I'm just
> trying to push back on that.
>
> In this particular case, after understanding the scope of the issue
> _and_ getting a committal for the MDS side fix, I approved taking the
> kernel client patch in an earlier reply.
>
>> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace
>>> will be parsed by kclient too and kclient won't give any warning, but it
>>> will corrupted the snaprealm and capsnap info in kclient.
>> I'm a bit confused about this patch, but I also don't follow the
>> kernel client code much so please forgive my ignorance. The change
>> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
>> case, so clearly the kclient *mostly* does the right thing when these
> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
> branch.

Yeah, correct.

The problem is not all the UPDATE cases in MDS will encode and pass the 
'split_realms' list to clients via the MClientSnap requests. By reading 
Zheng's commit I think he just copied that code to one UPDATE case from 
old SPLIT case and didn't notice it will introduce this issue in kclient.

Sending the 'split_realms' is unnecessary and useless in client side for 
UPDATE case, we should fix it in MDS side too.

Thanks

- Xiubo

> Thanks,
>
>                  Ilya
>
Xiubo Li May 18, 2023, 1:18 a.m. UTC | #22
On 5/17/23 23:03, Gregory Farnum wrote:
> On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>>> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>
>>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>
>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>>
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>> ---
>>>>>>>    fs/ceph/snap.c | 3 +++
>>>>>>>    1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>>> --- a/fs/ceph/snap.c
>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>>                                   continue;
>>>>>>>                           adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>>                   }
>>>>>>> +       } else {
>>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>>           }
>>>>>>>
>>>>>>>           /*
>>>>>>> --
>>>>>>> 2.40.1
>>>>>>>
>>>>>> Hi Xiubo,
>>>>>>
>>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>>> explanation for why this popped up only now?
>>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>>> saw this before I joint ceph team.
>>>>>
>>>>>
>>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>>> list just before the commit in 2017:
>>>>>
>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>>
>>>>>       mds: send snap related messages centrally during mds recovery
>>>>>
>>>>>       sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>>       clients centrally in MDCache::open_snaprealms()
>>>>>
>>>>>       Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>>
>>>>> Before this commit it will only send the 'split_realm' list for the
>>>>> _SPLIT op.
>>>> It sounds like we have the culprit.  This should be treated as
>>>> a regression and fixed on the MDS side.  I don't see a justification
>>>> for putting useless data on the wire.
>>> I don't really understand this viewpoint. We can treat it as an MDS
>>> regression if we want, but it's a six-year-old patch so this is in
>>> nearly every version of server code anybody's running. Why wouldn't we
>>> fix it on both sides?
>> Well, if I didn't speak up chances are we wouldn't have identified the
>> regression in the MDS at all.  People seem to have this perception that
>> the client is somehow "easier" to fix, assume that the server is always
>> doing the right thing and default to patching the client.  I'm just
>> trying to push back on that.
>>
>> In this particular case, after understanding the scope of the issue
>> _and_ getting a committal for the MDS side fix, I approved taking the
>> kernel client patch in an earlier reply.
>>
>>> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
>>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
>>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace
>>>> will be parsed by kclient too and kclient won't give any warning, but it
>>>> will corrupted the snaprealm and capsnap info in kclient.
>>> I'm a bit confused about this patch, but I also don't follow the
>>> kernel client code much so please forgive my ignorance. The change
>>> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
>>> case, so clearly the kclient *mostly* does the right thing when these
>> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
>> branch.
> Oh I mis-parsed the braces/spacing there.
>
> I'm still not getting how the precise size is causing the problem —
> obviously this isn't an unheard-of category of issue, but the fact
> that it works until the count matches a magic number is odd. Is that
> ceph_decode_need macro being called from ceph_update_snap_trace just
> skipping over the split data somehow? *puzzled*

Yeah, it's called and this is where the corrupted snaptrace reported.

The ceph_update_snap_trace() will try to parse the snaptraces in a loop 
and if there still has extra memories it will try to prase the rest 
memories as a new snaptrace, but the extra memories do not have enough 
memories needed and just reports it as a bad msg.

Thanks

- Xiubo

> -Greg
>
>> Thanks,
>>
>>                  Ilya
>>
Xiubo Li May 18, 2023, 1:23 a.m. UTC | #23
On 5/17/23 23:04, Gregory Farnum wrote:
> Just to be clear, I'd like the details here so we can see if there are
> ways to prevent similar issues in the future, which I haven't heard
> anybody talk about. :)

We have hit a similar issue with Venky in 
https://github.com/ceph/ceph/pull/48382. This PR just added two extra 
members and the old kclient couldn't recognize them and also just 
incorrectly parsed them as a new snaptrace.

Venky has fixed it by checking the peer client's feature bit before 
sending the msgs.

Thanks

- Xiubo

> On Wed, May 17, 2023 at 8:03 AM Gregory Farnum <gfarnum@redhat.com> wrote:
>> On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>>>> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote:
>>>>>>
>>>>>> On 5/17/23 18:31, Ilya Dryomov wrote:
>>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote:
>>>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>>>
>>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
>>>>>>>> request may still contain a list of 'split_realms', and we need
>>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>> Cc: Frank Schilder <frans@dtu.dk>
>>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk>
>>>>>>>> URL: https://tracker.ceph.com/issues/61200
>>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>>>>> ---
>>>>>>>>    fs/ceph/snap.c | 3 +++
>>>>>>>>    1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
>>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644
>>>>>>>> --- a/fs/ceph/snap.c
>>>>>>>> +++ b/fs/ceph/snap.c
>>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc,
>>>>>>>>                                   continue;
>>>>>>>>                           adjust_snap_realm_parent(mdsc, child, realm->ino);
>>>>>>>>                   }
>>>>>>>> +       } else {
>>>>>>>> +               p += sizeof(u64) * num_split_inos;
>>>>>>>> +               p += sizeof(u64) * num_split_realms;
>>>>>>>>           }
>>>>>>>>
>>>>>>>>           /*
>>>>>>>> --
>>>>>>>> 2.40.1
>>>>>>>>
>>>>>>> Hi Xiubo,
>>>>>>>
>>>>>>> This code appears to be very old -- it goes back to the initial commit
>>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009.  Do you have an
>>>>>>> explanation for why this popped up only now?
>>>>>> As I remembered we hit this before in one cu BZ last year, but I
>>>>>> couldn't remember exactly which one.  But I am not sure whether @Jeff
>>>>>> saw this before I joint ceph team.
>>>>>>
>>>>>>
>>>>>>> Has MDS always been including split_inos and split_realms arrays in
>>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change?  If it's a recent
>>>>>>> change, I'd argue that this needs to be addressed on the MDS side.
>>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm'
>>>>>> list just before the commit in 2017:
>>>>>>
>>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4
>>>>>> Author: Yan, Zheng <zyan@redhat.com>
>>>>>> Date:   Fri Jul 21 21:40:46 2017 +0800
>>>>>>
>>>>>>       mds: send snap related messages centrally during mds recovery
>>>>>>
>>>>>>       sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to
>>>>>>       clients centrally in MDCache::open_snaprealms()
>>>>>>
>>>>>>       Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
>>>>>>
>>>>>> Before this commit it will only send the 'split_realm' list for the
>>>>>> _SPLIT op.
>>>>> It sounds like we have the culprit.  This should be treated as
>>>>> a regression and fixed on the MDS side.  I don't see a justification
>>>>> for putting useless data on the wire.
>>>> I don't really understand this viewpoint. We can treat it as an MDS
>>>> regression if we want, but it's a six-year-old patch so this is in
>>>> nearly every version of server code anybody's running. Why wouldn't we
>>>> fix it on both sides?
>>> Well, if I didn't speak up chances are we wouldn't have identified the
>>> regression in the MDS at all.  People seem to have this perception that
>>> the client is somehow "easier" to fix, assume that the server is always
>>> doing the right thing and default to patching the client.  I'm just
>>> trying to push back on that.
>>>
>>> In this particular case, after understanding the scope of the issue
>>> _and_ getting a committal for the MDS side fix, I approved taking the
>>> kernel client patch in an earlier reply.
>>>
>>>> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote:
>>>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) +
>>>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace
>>>>> will be parsed by kclient too and kclient won't give any warning, but it
>>>>> will corrupted the snaprealm and capsnap info in kclient.
>>>> I'm a bit confused about this patch, but I also don't follow the
>>>> kernel client code much so please forgive my ignorance. The change
>>>> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT
>>>> case, so clearly the kclient *mostly* does the right thing when these
>>> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT"
>>> branch.
>> Oh I mis-parsed the braces/spacing there.
>>
>> I'm still not getting how the precise size is causing the problem —
>> obviously this isn't an unheard-of category of issue, but the fact
>> that it works until the count matches a magic number is odd. Is that
>> ceph_decode_need macro being called from ceph_update_snap_trace just
>> skipping over the split data somehow? *puzzled*
>> -Greg
>>
>>> Thanks,
>>>
>>>                  Ilya
>>>
Xiubo Li May 18, 2023, 1:24 a.m. UTC | #24
Thanks Frank for your feedback.

On 5/17/23 23:39, Frank Schilder wrote:
> Hi all,
>
> joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches.
>
> For these two reasons alone I strongly favour fixing both sides.
>
> Extra reading, feel free to skip.
>
> Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice.
>
> I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason.
>
> The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything.
>
> The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do.
>
> Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3.
>
> I hope this was a useful point of view from someone suffering from the condition.
>
> Best regards and thanks for your efforts addressing this!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
diff mbox series

Patch

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 0e59e95a96d9..d95dfe16b624 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -1114,6 +1114,9 @@  void ceph_handle_snap(struct ceph_mds_client *mdsc,
 				continue;
 			adjust_snap_realm_parent(mdsc, child, realm->ino);
 		}
+	} else {
+		p += sizeof(u64) * num_split_inos;
+		p += sizeof(u64) * num_split_realms;
 	}
 
 	/*