Message ID | 20230517052404.99904-1-xiubli@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ceph: force updating the msg pointer in non-split case | expand |
On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > From: Xiubo Li <xiubli@redhat.com> > > When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > request may still contain a list of 'split_realms', and we need > to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > Cc: stable@vger.kernel.org > Cc: Frank Schilder <frans@dtu.dk> > Reported-by: Frank Schilder <frans@dtu.dk> > URL: https://tracker.ceph.com/issues/61200 > Signed-off-by: Xiubo Li <xiubli@redhat.com> > --- > fs/ceph/snap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > index 0e59e95a96d9..d95dfe16b624 100644 > --- a/fs/ceph/snap.c > +++ b/fs/ceph/snap.c > @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > continue; > adjust_snap_realm_parent(mdsc, child, realm->ino); > } > + } else { > + p += sizeof(u64) * num_split_inos; > + p += sizeof(u64) * num_split_realms; > } > > /* > -- > 2.40.1 > Hi Xiubo, This code appears to be very old -- it goes back to the initial commit 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an explanation for why this popped up only now? Has MDS always been including split_inos and split_realms arrays in !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent change, I'd argue that this needs to be addressed on the MDS side. Thanks, Ilya
On 5/17/23 18:31, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >> From: Xiubo Li <xiubli@redhat.com> >> >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >> request may still contain a list of 'split_realms', and we need >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >> >> Cc: stable@vger.kernel.org >> Cc: Frank Schilder <frans@dtu.dk> >> Reported-by: Frank Schilder <frans@dtu.dk> >> URL: https://tracker.ceph.com/issues/61200 >> Signed-off-by: Xiubo Li <xiubli@redhat.com> >> --- >> fs/ceph/snap.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >> index 0e59e95a96d9..d95dfe16b624 100644 >> --- a/fs/ceph/snap.c >> +++ b/fs/ceph/snap.c >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >> continue; >> adjust_snap_realm_parent(mdsc, child, realm->ino); >> } >> + } else { >> + p += sizeof(u64) * num_split_inos; >> + p += sizeof(u64) * num_split_realms; >> } >> >> /* >> -- >> 2.40.1 >> > Hi Xiubo, > > This code appears to be very old -- it goes back to the initial commit > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > explanation for why this popped up only now? As I remembered we hit this before in one cu BZ last year, but I couldn't remember exactly which one. But I am not sure whether @Jeff saw this before I joint ceph team. > Has MDS always been including split_inos and split_realms arrays in > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > change, I'd argue that this needs to be addressed on the MDS side. While in MDS side for the _UPDATE op it won't send the 'split_realm' list just before the commit in 2017: commit 93e7267757508520dfc22cff1ab20558bd4a44d4 Author: Yan, Zheng <zyan@redhat.com> Date: Fri Jul 21 21:40:46 2017 +0800 mds: send snap related messages centrally during mds recovery sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to clients centrally in MDCache::open_snaprealms() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Before this commit it will only send the 'split_realm' list for the _SPLIT op. The following the snaptrace: [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 00 00 00 00 00 01 00 00 ................ [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 93 ...`. [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is 'CEPH_SNAP_OP_UPDATE' [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is the split_realm number [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are the ino# [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 d4 c7 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 d4 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 1d c8 71 38 00 01 00 00 .q8......q8.... [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 d6 c0 71 38 00 01 00 00 ..q8......q8.... [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 02 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 01 09 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 fd 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 f9 08 00 00 00 00 00 00 ................ [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 00 00 00 00 42 06 63 61 .9..........B.ca [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d 05 {K]-. And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + extra snap buffer size by coincidence, the above 'corrupted' snaptrace will be parsed by kclient too and kclient won't give any warning, but it will corrupted the snaprealm and capsnap info in kclient. Thanks - Xiubo > Thanks, > > Ilya >
On 5/17/23 19:04, Xiubo Li wrote: > > On 5/17/23 18:31, Ilya Dryomov wrote: >> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>> From: Xiubo Li <xiubli@redhat.com> >>> >>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>> request may still contain a list of 'split_realms', and we need >>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>> >>> Cc: stable@vger.kernel.org >>> Cc: Frank Schilder <frans@dtu.dk> >>> Reported-by: Frank Schilder <frans@dtu.dk> >>> URL: https://tracker.ceph.com/issues/61200 >>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>> --- >>> fs/ceph/snap.c | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>> index 0e59e95a96d9..d95dfe16b624 100644 >>> --- a/fs/ceph/snap.c >>> +++ b/fs/ceph/snap.c >>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client >>> *mdsc, >>> continue; >>> adjust_snap_realm_parent(mdsc, child, >>> realm->ino); >>> } >>> + } else { >>> + p += sizeof(u64) * num_split_inos; >>> + p += sizeof(u64) * num_split_realms; >>> } >>> >>> /* >>> -- >>> 2.40.1 >>> >> Hi Xiubo, >> >> This code appears to be very old -- it goes back to the initial commit >> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >> explanation for why this popped up only now? > > As I remembered we hit this before in one cu BZ last year, but I > couldn't remember exactly which one. But I am not sure whether @Jeff > saw this before I joint ceph team. > @Venky, Do you remember which one ? As I remembered this is why we fixed the snaptrace issue by blocking all the IOs and at the same time blocklisting the kclient before. Before the kcleint won't dump the corrupted msg and we don't know what was wrong with the msg and also we added code to dump the msg in the above fix. Thanks - Xiubo > >> Has MDS always been including split_inos and split_realms arrays in >> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >> change, I'd argue that this needs to be addressed on the MDS side. > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > list just before the commit in 2017: > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > Author: Yan, Zheng <zyan@redhat.com> > Date: Fri Jul 21 21:40:46 2017 +0800 > > mds: send snap related messages centrally during mds recovery > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > clients centrally in MDCache::open_snaprealms() > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > Before this commit it will only send the 'split_realm' list for the > _SPLIT op. > > > The following the snaptrace: > > [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 > 00 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 > 00 00 00 00 00 01 00 00 ................ > [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 > 93 ...`. > [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is > 'CEPH_SNAP_OP_UPDATE' > [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 > d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is > the split_realm number > [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 > d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are > the ino# > [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 > d4 c7 71 38 00 01 00 00 ..q8......q8.... > [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 > d4 c0 71 38 00 01 00 00 ..q8......q8.... > [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 > 1d c8 71 38 00 01 00 00 .q8......q8.... > [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 > d6 c0 71 38 00 01 00 00 ..q8......q8.... > [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 > 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... > [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 > 01 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 > 01 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 > 01 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 > 02 09 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 > 01 09 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 > fd 08 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 > f9 08 00 00 00 00 00 00 ................ > [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 > 00 00 00 00 42 06 63 61 .9..........B.ca > [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d > 05 {K]-. > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > will be parsed by kclient too and kclient won't give any warning, but > it will corrupted the snaprealm and capsnap info in kclient. > > > Thanks > > - Xiubo > > >> Thanks, >> >> Ilya >>
On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > >> From: Xiubo Li <xiubli@redhat.com> > >> > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > >> request may still contain a list of 'split_realms', and we need > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > >> > >> Cc: stable@vger.kernel.org > >> Cc: Frank Schilder <frans@dtu.dk> > >> Reported-by: Frank Schilder <frans@dtu.dk> > >> URL: https://tracker.ceph.com/issues/61200 > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > >> --- > >> fs/ceph/snap.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >> index 0e59e95a96d9..d95dfe16b624 100644 > >> --- a/fs/ceph/snap.c > >> +++ b/fs/ceph/snap.c > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > >> continue; > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > >> } > >> + } else { > >> + p += sizeof(u64) * num_split_inos; > >> + p += sizeof(u64) * num_split_realms; > >> } > >> > >> /* > >> -- > >> 2.40.1 > >> > > Hi Xiubo, > > > > This code appears to be very old -- it goes back to the initial commit > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > explanation for why this popped up only now? > > As I remembered we hit this before in one cu BZ last year, but I > couldn't remember exactly which one. But I am not sure whether @Jeff > saw this before I joint ceph team. > > > > Has MDS always been including split_inos and split_realms arrays in > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > change, I'd argue that this needs to be addressed on the MDS side. > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > list just before the commit in 2017: > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > Author: Yan, Zheng <zyan@redhat.com> > Date: Fri Jul 21 21:40:46 2017 +0800 > > mds: send snap related messages centrally during mds recovery > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > clients centrally in MDCache::open_snaprealms() > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > Before this commit it will only send the 'split_realm' list for the > _SPLIT op. It sounds like we have the culprit. This should be treated as a regression and fixed on the MDS side. I don't see a justification for putting useless data on the wire. Thanks, Ilya
On 5/17/23 19:29, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/17/23 18:31, Ilya Dryomov wrote: >>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>> From: Xiubo Li <xiubli@redhat.com> >>>> >>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>> request may still contain a list of 'split_realms', and we need >>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>> >>>> Cc: stable@vger.kernel.org >>>> Cc: Frank Schilder <frans@dtu.dk> >>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>> URL: https://tracker.ceph.com/issues/61200 >>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>> --- >>>> fs/ceph/snap.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>> --- a/fs/ceph/snap.c >>>> +++ b/fs/ceph/snap.c >>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>> continue; >>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>> } >>>> + } else { >>>> + p += sizeof(u64) * num_split_inos; >>>> + p += sizeof(u64) * num_split_realms; >>>> } >>>> >>>> /* >>>> -- >>>> 2.40.1 >>>> >>> Hi Xiubo, >>> >>> This code appears to be very old -- it goes back to the initial commit >>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>> explanation for why this popped up only now? >> As I remembered we hit this before in one cu BZ last year, but I >> couldn't remember exactly which one. But I am not sure whether @Jeff >> saw this before I joint ceph team. >> >> >>> Has MDS always been including split_inos and split_realms arrays in >>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>> change, I'd argue that this needs to be addressed on the MDS side. >> While in MDS side for the _UPDATE op it won't send the 'split_realm' >> list just before the commit in 2017: >> >> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >> Author: Yan, Zheng <zyan@redhat.com> >> Date: Fri Jul 21 21:40:46 2017 +0800 >> >> mds: send snap related messages centrally during mds recovery >> >> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >> clients centrally in MDCache::open_snaprealms() >> >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >> >> Before this commit it will only send the 'split_realm' list for the >> _SPLIT op. > It sounds like we have the culprit. This should be treated as > a regression and fixed on the MDS side. I don't see a justification > for putting useless data on the wire. But we still need this patch to make it to work with the old ceph releases. Thanks - Xiubo > Thanks, > > Ilya >
Hey Xiubo, On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote: > > > On 5/17/23 19:04, Xiubo Li wrote: > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > >> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > >>> From: Xiubo Li <xiubli@redhat.com> > >>> > >>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > >>> request may still contain a list of 'split_realms', and we need > >>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > >>> > >>> Cc: stable@vger.kernel.org > >>> Cc: Frank Schilder <frans@dtu.dk> > >>> Reported-by: Frank Schilder <frans@dtu.dk> > >>> URL: https://tracker.ceph.com/issues/61200 > >>> Signed-off-by: Xiubo Li <xiubli@redhat.com> > >>> --- > >>> fs/ceph/snap.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >>> > >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >>> index 0e59e95a96d9..d95dfe16b624 100644 > >>> --- a/fs/ceph/snap.c > >>> +++ b/fs/ceph/snap.c > >>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client > >>> *mdsc, > >>> continue; > >>> adjust_snap_realm_parent(mdsc, child, > >>> realm->ino); > >>> } > >>> + } else { > >>> + p += sizeof(u64) * num_split_inos; > >>> + p += sizeof(u64) * num_split_realms; > >>> } > >>> > >>> /* > >>> -- > >>> 2.40.1 > >>> > >> Hi Xiubo, > >> > >> This code appears to be very old -- it goes back to the initial commit > >> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > >> explanation for why this popped up only now? > > > > As I remembered we hit this before in one cu BZ last year, but I > > couldn't remember exactly which one. But I am not sure whether @Jeff > > saw this before I joint ceph team. > > > @Venky, > > Do you remember which one ? As I remembered this is why we fixed the > snaptrace issue by blocking all the IOs and at the same time > blocklisting the kclient before. > > Before the kcleint won't dump the corrupted msg and we don't know what > was wrong with the msg and also we added code to dump the msg in the > above fix. The "corrupted" snaptrace issue happened just after the mds asserted hitting the metadata corruption (dentry first corrupted) and it _seemed_ that this corruption somehow triggered a corrupted snaptrace to be sent to the client. > > Thanks > > - Xiubo > > > > >> Has MDS always been including split_inos and split_realms arrays in > >> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > >> change, I'd argue that this needs to be addressed on the MDS side. > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > list just before the commit in 2017: > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > Author: Yan, Zheng <zyan@redhat.com> > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > mds: send snap related messages centrally during mds recovery > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > clients centrally in MDCache::open_snaprealms() > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > Before this commit it will only send the 'split_realm' list for the > > _SPLIT op. > > > > > > The following the snaptrace: > > > > [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 > > 00 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 > > 00 00 00 00 00 01 00 00 ................ > > [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 > > 93 ...`. > > [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is > > 'CEPH_SNAP_OP_UPDATE' > > [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 > > d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is > > the split_realm number > > [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 > > d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are > > the ino# > > [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 > > d4 c7 71 38 00 01 00 00 ..q8......q8.... > > [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 > > d4 c0 71 38 00 01 00 00 ..q8......q8.... > > [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 > > 1d c8 71 38 00 01 00 00 .q8......q8.... > > [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 > > d6 c0 71 38 00 01 00 00 ..q8......q8.... > > [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 > > 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... > > [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 > > 01 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 > > 01 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 > > 01 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 > > 02 09 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 > > 01 09 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 > > fd 08 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 > > f9 08 00 00 00 00 00 00 ................ > > [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 > > 00 00 00 00 42 06 63 61 .9..........B.ca > > [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d > > 05 {K]-. > > > > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > will be parsed by kclient too and kclient won't give any warning, but > > it will corrupted the snaprealm and capsnap info in kclient. > > > > > > Thanks > > > > - Xiubo > > > > > >> Thanks, > >> > >> Ilya > >> >
On Wed, May 17, 2023 at 6:18 PM Venky Shankar <vshankar@redhat.com> wrote: > > Hey Xiubo, > > On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > On 5/17/23 19:04, Xiubo Li wrote: > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > >> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > >>> From: Xiubo Li <xiubli@redhat.com> > > >>> > > >>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > >>> request may still contain a list of 'split_realms', and we need > > >>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > >>> > > >>> Cc: stable@vger.kernel.org > > >>> Cc: Frank Schilder <frans@dtu.dk> > > >>> Reported-by: Frank Schilder <frans@dtu.dk> > > >>> URL: https://tracker.ceph.com/issues/61200 > > >>> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > >>> --- > > >>> fs/ceph/snap.c | 3 +++ > > >>> 1 file changed, 3 insertions(+) > > >>> > > >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > >>> index 0e59e95a96d9..d95dfe16b624 100644 > > >>> --- a/fs/ceph/snap.c > > >>> +++ b/fs/ceph/snap.c > > >>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client > > >>> *mdsc, > > >>> continue; > > >>> adjust_snap_realm_parent(mdsc, child, > > >>> realm->ino); > > >>> } > > >>> + } else { > > >>> + p += sizeof(u64) * num_split_inos; > > >>> + p += sizeof(u64) * num_split_realms; > > >>> } > > >>> > > >>> /* > > >>> -- > > >>> 2.40.1 > > >>> > > >> Hi Xiubo, > > >> > > >> This code appears to be very old -- it goes back to the initial commit > > >> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > >> explanation for why this popped up only now? > > > > > > As I remembered we hit this before in one cu BZ last year, but I > > > couldn't remember exactly which one. But I am not sure whether @Jeff > > > saw this before I joint ceph team. > > > > > @Venky, > > > > Do you remember which one ? As I remembered this is why we fixed the > > snaptrace issue by blocking all the IOs and at the same time > > blocklisting the kclient before. > > > > Before the kcleint won't dump the corrupted msg and we don't know what > > was wrong with the msg and also we added code to dump the msg in the > > above fix. > > The "corrupted" snaptrace issue happened just after the mds asserted > hitting the metadata corruption (dentry first corrupted) and it > _seemed_ that this corruption somehow triggered a corrupted snaptrace > to be sent to the client. [sent message a bit early - cotd...] But I think you found the issue - the message dump did help and its not related to the dentry first corruption. > > > > > Thanks > > > > - Xiubo > > > > > > > >> Has MDS always been including split_inos and split_realms arrays in > > >> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > >> change, I'd argue that this needs to be addressed on the MDS side. > > > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > > list just before the commit in 2017: > > > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > > Author: Yan, Zheng <zyan@redhat.com> > > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > > > mds: send snap related messages centrally during mds recovery > > > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > > clients centrally in MDCache::open_snaprealms() > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > Before this commit it will only send the 'split_realm' list for the > > > _SPLIT op. > > > > > > > > > The following the snaptrace: > > > > > > [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 > > > 00 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 > > > 00 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 > > > 00 00 00 00 00 01 00 00 ................ > > > [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 > > > 93 ...`. > > > [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 > > > 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is > > > 'CEPH_SNAP_OP_UPDATE' > > > [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 > > > d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is > > > the split_realm number > > > [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 > > > d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are > > > the ino# > > > [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 > > > d4 c7 71 38 00 01 00 00 ..q8......q8.... > > > [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 > > > d4 c0 71 38 00 01 00 00 ..q8......q8.... > > > [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 > > > 1d c8 71 38 00 01 00 00 .q8......q8.... > > > [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 > > > d6 c0 71 38 00 01 00 00 ..q8......q8.... > > > [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 > > > 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... > > > [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 > > > 01 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 > > > 01 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 > > > 01 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 > > > 00 00 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 > > > 02 09 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 > > > 01 09 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 > > > fd 08 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 > > > f9 08 00 00 00 00 00 00 ................ > > > [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 > > > 00 00 00 00 42 06 63 61 .9..........B.ca > > > [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d > > > 05 {K]-. > > > > > > > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > > will be parsed by kclient too and kclient won't give any warning, but > > > it will corrupted the snaprealm and capsnap info in kclient. > > > > > > > > > Thanks > > > > > > - Xiubo > > > > > > > > >> Thanks, > > >> > > >> Ilya > > >> > > > > > -- > Cheers, > Venky
On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: > > > On 5/17/23 19:29, Ilya Dryomov wrote: > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > >> > >> On 5/17/23 18:31, Ilya Dryomov wrote: > >>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > >>>> From: Xiubo Li <xiubli@redhat.com> > >>>> > >>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > >>>> request may still contain a list of 'split_realms', and we need > >>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > >>>> > >>>> Cc: stable@vger.kernel.org > >>>> Cc: Frank Schilder <frans@dtu.dk> > >>>> Reported-by: Frank Schilder <frans@dtu.dk> > >>>> URL: https://tracker.ceph.com/issues/61200 > >>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> > >>>> --- > >>>> fs/ceph/snap.c | 3 +++ > >>>> 1 file changed, 3 insertions(+) > >>>> > >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >>>> index 0e59e95a96d9..d95dfe16b624 100644 > >>>> --- a/fs/ceph/snap.c > >>>> +++ b/fs/ceph/snap.c > >>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > >>>> continue; > >>>> adjust_snap_realm_parent(mdsc, child, realm->ino); > >>>> } > >>>> + } else { > >>>> + p += sizeof(u64) * num_split_inos; > >>>> + p += sizeof(u64) * num_split_realms; > >>>> } > >>>> > >>>> /* > >>>> -- > >>>> 2.40.1 > >>>> > >>> Hi Xiubo, > >>> > >>> This code appears to be very old -- it goes back to the initial commit > >>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > >>> explanation for why this popped up only now? > >> As I remembered we hit this before in one cu BZ last year, but I > >> couldn't remember exactly which one. But I am not sure whether @Jeff > >> saw this before I joint ceph team. > >> > >> > >>> Has MDS always been including split_inos and split_realms arrays in > >>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > >>> change, I'd argue that this needs to be addressed on the MDS side. > >> While in MDS side for the _UPDATE op it won't send the 'split_realm' > >> list just before the commit in 2017: > >> > >> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > >> Author: Yan, Zheng <zyan@redhat.com> > >> Date: Fri Jul 21 21:40:46 2017 +0800 > >> > >> mds: send snap related messages centrally during mds recovery > >> > >> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > >> clients centrally in MDCache::open_snaprealms() > >> > >> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > >> > >> Before this commit it will only send the 'split_realm' list for the > >> _SPLIT op. > > It sounds like we have the culprit. This should be treated as > > a regression and fixed on the MDS side. I don't see a justification > > for putting useless data on the wire. > > But we still need this patch to make it to work with the old ceph releases. This is debatable: - given that no one noticed this for so long, the likelihood of MDS sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and split_realms arrays is very low - MDS side fix would be backported to supported Ceph releases - people who are running unsupported Ceph releases (i.e. aren't updating) are unlikely to be diligently updating their kernel clients Thanks, Ilya
On 5/17/23 20:53, Venky Shankar wrote: > On Wed, May 17, 2023 at 6:18 PM Venky Shankar <vshankar@redhat.com> wrote: >> Hey Xiubo, >> >> On Wed, May 17, 2023 at 4:45 PM Xiubo Li <xiubli@redhat.com> wrote: >>> >>> On 5/17/23 19:04, Xiubo Li wrote: >>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>> >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>> request may still contain a list of 'split_realms', and we need >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>> >>>>>> Cc: stable@vger.kernel.org >>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>> --- >>>>>> fs/ceph/snap.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>> --- a/fs/ceph/snap.c >>>>>> +++ b/fs/ceph/snap.c >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client >>>>>> *mdsc, >>>>>> continue; >>>>>> adjust_snap_realm_parent(mdsc, child, >>>>>> realm->ino); >>>>>> } >>>>>> + } else { >>>>>> + p += sizeof(u64) * num_split_inos; >>>>>> + p += sizeof(u64) * num_split_realms; >>>>>> } >>>>>> >>>>>> /* >>>>>> -- >>>>>> 2.40.1 >>>>>> >>>>> Hi Xiubo, >>>>> >>>>> This code appears to be very old -- it goes back to the initial commit >>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>> explanation for why this popped up only now? >>>> As I remembered we hit this before in one cu BZ last year, but I >>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>> saw this before I joint ceph team. >>>> >>> @Venky, >>> >>> Do you remember which one ? As I remembered this is why we fixed the >>> snaptrace issue by blocking all the IOs and at the same time >>> blocklisting the kclient before. >>> >>> Before the kcleint won't dump the corrupted msg and we don't know what >>> was wrong with the msg and also we added code to dump the msg in the >>> above fix. >> The "corrupted" snaptrace issue happened just after the mds asserted >> hitting the metadata corruption (dentry first corrupted) and it >> _seemed_ that this corruption somehow triggered a corrupted snaptrace >> to be sent to the client. > [sent message a bit early - cotd...] > > But I think you found the issue - the message dump did help and its > not related to the dentry first corruption. Yeah, this one is not related to dentry first corruption. @Ilya I have created one ceph PR to fix it in MDS side https://tracker.ceph.com/issues/61217 and https://github.com/ceph/ceph/pull/51536. Let's keep this kclient patch to make to be compatible with the old cephs. At least users could only update the kclient node to fix this issue. Thanks - Xiubo >>> Thanks >>> >>> - Xiubo >>> >>>>> Has MDS always been including split_inos and split_realms arrays in >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>> list just before the commit in 2017: >>>> >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>> Author: Yan, Zheng <zyan@redhat.com> >>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>> >>>> mds: send snap related messages centrally during mds recovery >>>> >>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>> clients centrally in MDCache::open_snaprealms() >>>> >>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>> >>>> Before this commit it will only send the 'split_realm' list for the >>>> _SPLIT op. >>>> >>>> >>>> The following the snaptrace: >>>> >>>> [Wed May 10 16:03:06 2023] header: 00000000: 05 00 00 00 00 00 00 00 >>>> 00 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] header: 00000010: 12 03 7f 00 01 00 00 01 >>>> 00 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] header: 00000020: 00 00 00 00 02 01 00 00 >>>> 00 00 00 00 00 01 00 00 ................ >>>> [Wed May 10 16:03:06 2023] header: 00000030: 00 98 0d 60 >>>> 93 ...`. >>>> [Wed May 10 16:03:06 2023] front: 00000000: 00 00 00 00 00 00 00 00 >>>> 00 00 00 00 00 00 00 00 ................ <<== The op is 0, which is >>>> 'CEPH_SNAP_OP_UPDATE' >>>> [Wed May 10 16:03:06 2023] front: 00000010: 0c 00 00 00 88 00 00 00 >>>> d1 c0 71 38 00 01 00 00 ..........q8.... <<== The '0c' is >>>> the split_realm number >>>> [Wed May 10 16:03:06 2023] front: 00000020: 22 c8 71 38 00 01 00 00 >>>> d7 c7 71 38 00 01 00 00 ".q8......q8.... <<== All the 'q8' are >>>> the ino# >>>> [Wed May 10 16:03:06 2023] front: 00000030: d9 c7 71 38 00 01 00 00 >>>> d4 c7 71 38 00 01 00 00 ..q8......q8.... >>>> [Wed May 10 16:03:06 2023] front: 00000040: f1 c0 71 38 00 01 00 00 >>>> d4 c0 71 38 00 01 00 00 ..q8......q8.... >>>> [Wed May 10 16:03:06 2023] front: 00000050: 20 c8 71 38 00 01 00 00 >>>> 1d c8 71 38 00 01 00 00 .q8......q8.... >>>> [Wed May 10 16:03:06 2023] front: 00000060: ec c0 71 38 00 01 00 00 >>>> d6 c0 71 38 00 01 00 00 ..q8......q8.... >>>> [Wed May 10 16:03:06 2023] front: 00000070: ef c0 71 38 00 01 00 00 >>>> 6a 11 2d 1a 00 01 00 00 ..q8....j.-..... >>>> [Wed May 10 16:03:06 2023] front: 00000080: 01 00 00 00 00 00 00 00 >>>> 01 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 00000090: ee 01 00 00 00 00 00 00 >>>> 01 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000a0: 00 00 00 00 00 00 00 00 >>>> 01 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000b0: 01 09 00 00 00 00 00 00 >>>> 00 00 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000c0: 01 00 00 00 00 00 00 00 >>>> 02 09 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000d0: 05 00 00 00 00 00 00 00 >>>> 01 09 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000e0: ff 08 00 00 00 00 00 00 >>>> fd 08 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] front: 000000f0: fb 08 00 00 00 00 00 00 >>>> f9 08 00 00 00 00 00 00 ................ >>>> [Wed May 10 16:03:06 2023] footer: 00000000: ca 39 06 07 00 00 00 00 >>>> 00 00 00 00 42 06 63 61 .9..........B.ca >>>> [Wed May 10 16:03:06 2023] footer: 00000010: 7b 4b 5d 2d >>>> 05 {K]-. >>>> >>>> >>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + >>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace >>>> will be parsed by kclient too and kclient won't give any warning, but >>>> it will corrupted the snaprealm and capsnap info in kclient. >>>> >>>> >>>> Thanks >>>> >>>> - Xiubo >>>> >>>> >>>>> Thanks, >>>>> >>>>> Ilya >>>>> >> >> -- >> Cheers, >> Venky > >
On 5/17/23 21:11, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/17/23 19:29, Ilya Dryomov wrote: >>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>> >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>> request may still contain a list of 'split_realms', and we need >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>> >>>>>> Cc: stable@vger.kernel.org >>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>> --- >>>>>> fs/ceph/snap.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>> --- a/fs/ceph/snap.c >>>>>> +++ b/fs/ceph/snap.c >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>> continue; >>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>> } >>>>>> + } else { >>>>>> + p += sizeof(u64) * num_split_inos; >>>>>> + p += sizeof(u64) * num_split_realms; >>>>>> } >>>>>> >>>>>> /* >>>>>> -- >>>>>> 2.40.1 >>>>>> >>>>> Hi Xiubo, >>>>> >>>>> This code appears to be very old -- it goes back to the initial commit >>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>> explanation for why this popped up only now? >>>> As I remembered we hit this before in one cu BZ last year, but I >>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>> saw this before I joint ceph team. >>>> >>>> >>>>> Has MDS always been including split_inos and split_realms arrays in >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>> list just before the commit in 2017: >>>> >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>> Author: Yan, Zheng <zyan@redhat.com> >>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>> >>>> mds: send snap related messages centrally during mds recovery >>>> >>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>> clients centrally in MDCache::open_snaprealms() >>>> >>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>> >>>> Before this commit it will only send the 'split_realm' list for the >>>> _SPLIT op. >>> It sounds like we have the culprit. This should be treated as >>> a regression and fixed on the MDS side. I don't see a justification >>> for putting useless data on the wire. >> But we still need this patch to make it to work with the old ceph releases. > This is debatable: > > - given that no one noticed this for so long, the likelihood of MDS > sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and > split_realms arrays is very low > > - MDS side fix would be backported to supported Ceph releases > > - people who are running unsupported Ceph releases (i.e. aren't > updating) are unlikely to be diligently updating their kernel clients Yeah. While IMO usually upgrading the kernel will be safer and easier than upgrading the whole ceph cluster, and upgrading the ceph cluster may cause the fs metadatas corruption issue, which we have hit many time from CUs and ceph-user mail list. Will leave it to you for this. If that still doesn't make sense I will drop this patch. Thanks - Xiubo > Thanks, > > Ilya >
On 5/17/23 21:11, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/17/23 19:29, Ilya Dryomov wrote: >>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>> >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>> request may still contain a list of 'split_realms', and we need >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>> >>>>>> Cc: stable@vger.kernel.org >>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>> --- >>>>>> fs/ceph/snap.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>> --- a/fs/ceph/snap.c >>>>>> +++ b/fs/ceph/snap.c >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>> continue; >>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>> } >>>>>> + } else { >>>>>> + p += sizeof(u64) * num_split_inos; >>>>>> + p += sizeof(u64) * num_split_realms; >>>>>> } >>>>>> >>>>>> /* >>>>>> -- >>>>>> 2.40.1 >>>>>> >>>>> Hi Xiubo, >>>>> >>>>> This code appears to be very old -- it goes back to the initial commit >>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>> explanation for why this popped up only now? >>>> As I remembered we hit this before in one cu BZ last year, but I >>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>> saw this before I joint ceph team. >>>> >>>> >>>>> Has MDS always been including split_inos and split_realms arrays in >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>> list just before the commit in 2017: >>>> >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>> Author: Yan, Zheng <zyan@redhat.com> >>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>> >>>> mds: send snap related messages centrally during mds recovery >>>> >>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>> clients centrally in MDCache::open_snaprealms() >>>> >>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>> >>>> Before this commit it will only send the 'split_realm' list for the >>>> _SPLIT op. >>> It sounds like we have the culprit. This should be treated as >>> a regression and fixed on the MDS side. I don't see a justification >>> for putting useless data on the wire. >> But we still need this patch to make it to work with the old ceph releases. > This is debatable: > > - given that no one noticed this for so long, the likelihood of MDS > sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and > split_realms arrays is very low > > - MDS side fix would be backported to supported Ceph releases > > - people who are running unsupported Ceph releases (i.e. aren't > updating) are unlikely to be diligently updating their kernel clients Just searched the ceph tracker and I found another 3 trackers have the same issue: https://tracker.ceph.com/issues/57817 https://tracker.ceph.com/issues/57703 https://tracker.ceph.com/issues/57686 So plusing this time and the previous CU case: https://www.spinics.net/lists/ceph-users/msg77106.html I have seen at least 5 times. All this are reproduced when doing MDS failover, and this is the root cause in MDS side. Thanks - Xiubo > Thanks, > > Ilya >
On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote: > > > On 5/17/23 21:11, Ilya Dryomov wrote: > > On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: > >> > >> On 5/17/23 19:29, Ilya Dryomov wrote: > >>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > >>>> On 5/17/23 18:31, Ilya Dryomov wrote: > >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > >>>>>> From: Xiubo Li <xiubli@redhat.com> > >>>>>> > >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > >>>>>> request may still contain a list of 'split_realms', and we need > >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > >>>>>> > >>>>>> Cc: stable@vger.kernel.org > >>>>>> Cc: Frank Schilder <frans@dtu.dk> > >>>>>> Reported-by: Frank Schilder <frans@dtu.dk> > >>>>>> URL: https://tracker.ceph.com/issues/61200 > >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> > >>>>>> --- > >>>>>> fs/ceph/snap.c | 3 +++ > >>>>>> 1 file changed, 3 insertions(+) > >>>>>> > >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >>>>>> index 0e59e95a96d9..d95dfe16b624 100644 > >>>>>> --- a/fs/ceph/snap.c > >>>>>> +++ b/fs/ceph/snap.c > >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > >>>>>> continue; > >>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); > >>>>>> } > >>>>>> + } else { > >>>>>> + p += sizeof(u64) * num_split_inos; > >>>>>> + p += sizeof(u64) * num_split_realms; > >>>>>> } > >>>>>> > >>>>>> /* > >>>>>> -- > >>>>>> 2.40.1 > >>>>>> > >>>>> Hi Xiubo, > >>>>> > >>>>> This code appears to be very old -- it goes back to the initial commit > >>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > >>>>> explanation for why this popped up only now? > >>>> As I remembered we hit this before in one cu BZ last year, but I > >>>> couldn't remember exactly which one. But I am not sure whether @Jeff > >>>> saw this before I joint ceph team. > >>>> > >>>> > >>>>> Has MDS always been including split_inos and split_realms arrays in > >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > >>>>> change, I'd argue that this needs to be addressed on the MDS side. > >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' > >>>> list just before the commit in 2017: > >>>> > >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > >>>> Author: Yan, Zheng <zyan@redhat.com> > >>>> Date: Fri Jul 21 21:40:46 2017 +0800 > >>>> > >>>> mds: send snap related messages centrally during mds recovery > >>>> > >>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > >>>> clients centrally in MDCache::open_snaprealms() > >>>> > >>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > >>>> > >>>> Before this commit it will only send the 'split_realm' list for the > >>>> _SPLIT op. > >>> It sounds like we have the culprit. This should be treated as > >>> a regression and fixed on the MDS side. I don't see a justification > >>> for putting useless data on the wire. > >> But we still need this patch to make it to work with the old ceph releases. > > This is debatable: > > > > - given that no one noticed this for so long, the likelihood of MDS > > sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and > > split_realms arrays is very low > > > > - MDS side fix would be backported to supported Ceph releases > > > > - people who are running unsupported Ceph releases (i.e. aren't > > updating) are unlikely to be diligently updating their kernel clients > > Just searched the ceph tracker and I found another 3 trackers have the > same issue: > > https://tracker.ceph.com/issues/57817 > https://tracker.ceph.com/issues/57703 > https://tracker.ceph.com/issues/57686 > > So plusing this time and the previous CU case: > > https://www.spinics.net/lists/ceph-users/msg77106.html > > I have seen at least 5 times. > > All this are reproduced when doing MDS failover, and this is the root > cause in MDS side. OK, given that the fixup in the kernel client is small, it seems justified. But, please, add a comment in the new else branch saying that it's there only to work around a bug in the MDS. Thanks, Ilya
On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > >> From: Xiubo Li <xiubli@redhat.com> > > >> > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > >> request may still contain a list of 'split_realms', and we need > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > >> > > >> Cc: stable@vger.kernel.org > > >> Cc: Frank Schilder <frans@dtu.dk> > > >> Reported-by: Frank Schilder <frans@dtu.dk> > > >> URL: https://tracker.ceph.com/issues/61200 > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > >> --- > > >> fs/ceph/snap.c | 3 +++ > > >> 1 file changed, 3 insertions(+) > > >> > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > >> index 0e59e95a96d9..d95dfe16b624 100644 > > >> --- a/fs/ceph/snap.c > > >> +++ b/fs/ceph/snap.c > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > > >> continue; > > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > > >> } > > >> + } else { > > >> + p += sizeof(u64) * num_split_inos; > > >> + p += sizeof(u64) * num_split_realms; > > >> } > > >> > > >> /* > > >> -- > > >> 2.40.1 > > >> > > > Hi Xiubo, > > > > > > This code appears to be very old -- it goes back to the initial commit > > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > > explanation for why this popped up only now? > > > > As I remembered we hit this before in one cu BZ last year, but I > > couldn't remember exactly which one. But I am not sure whether @Jeff > > saw this before I joint ceph team. > > > > > > > Has MDS always been including split_inos and split_realms arrays in > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > > change, I'd argue that this needs to be addressed on the MDS side. > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > list just before the commit in 2017: > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > Author: Yan, Zheng <zyan@redhat.com> > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > mds: send snap related messages centrally during mds recovery > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > clients centrally in MDCache::open_snaprealms() > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > Before this commit it will only send the 'split_realm' list for the > > _SPLIT op. > > It sounds like we have the culprit. This should be treated as > a regression and fixed on the MDS side. I don't see a justification > for putting useless data on the wire. I don't really understand this viewpoint. We can treat it as an MDS regression if we want, but it's a six-year-old patch so this is in nearly every version of server code anybody's running. Why wouldn't we fix it on both sides? On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > will be parsed by kclient too and kclient won't give any warning, but it > will corrupted the snaprealm and capsnap info in kclient. I'm a bit confused about this patch, but I also don't follow the kernel client code much so please forgive my ignorance. The change you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT case, so clearly the kclient *mostly* does the right thing when these "unexpected" SPLIT ops show up. (Which leads me to assume Zheng checked that the referenced server patch would work, and he was mostly right.) So what logic makes it so "split_realm == sizeof(ceph_mds_snap_realm) + <extra snap buffer size>" is the condition which leads to breakage, and allows everything else to work? -Greg
On 5/17/23 21:56, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/17/23 21:11, Ilya Dryomov wrote: >>> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: >>>> On 5/17/23 19:29, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>>>> >>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>>>> request may still contain a list of 'split_realms', and we need >>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org >>>>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>>>> --- >>>>>>>> fs/ceph/snap.c | 3 +++ >>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>> >>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>>>> --- a/fs/ceph/snap.c >>>>>>>> +++ b/fs/ceph/snap.c >>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>>>> continue; >>>>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>>>> } >>>>>>>> + } else { >>>>>>>> + p += sizeof(u64) * num_split_inos; >>>>>>>> + p += sizeof(u64) * num_split_realms; >>>>>>>> } >>>>>>>> >>>>>>>> /* >>>>>>>> -- >>>>>>>> 2.40.1 >>>>>>>> >>>>>>> Hi Xiubo, >>>>>>> >>>>>>> This code appears to be very old -- it goes back to the initial commit >>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>>>> explanation for why this popped up only now? >>>>>> As I remembered we hit this before in one cu BZ last year, but I >>>>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>>>> saw this before I joint ceph team. >>>>>> >>>>>> >>>>>>> Has MDS always been including split_inos and split_realms arrays in >>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>>>> list just before the commit in 2017: >>>>>> >>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>>>> Author: Yan, Zheng <zyan@redhat.com> >>>>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>>>> >>>>>> mds: send snap related messages centrally during mds recovery >>>>>> >>>>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>>>> clients centrally in MDCache::open_snaprealms() >>>>>> >>>>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>>>> >>>>>> Before this commit it will only send the 'split_realm' list for the >>>>>> _SPLIT op. >>>>> It sounds like we have the culprit. This should be treated as >>>>> a regression and fixed on the MDS side. I don't see a justification >>>>> for putting useless data on the wire. >>>> But we still need this patch to make it to work with the old ceph releases. >>> This is debatable: >>> >>> - given that no one noticed this for so long, the likelihood of MDS >>> sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and >>> split_realms arrays is very low >>> >>> - MDS side fix would be backported to supported Ceph releases >>> >>> - people who are running unsupported Ceph releases (i.e. aren't >>> updating) are unlikely to be diligently updating their kernel clients >> Just searched the ceph tracker and I found another 3 trackers have the >> same issue: >> >> https://tracker.ceph.com/issues/57817 >> https://tracker.ceph.com/issues/57703 >> https://tracker.ceph.com/issues/57686 >> >> So plusing this time and the previous CU case: >> >> https://www.spinics.net/lists/ceph-users/msg77106.html >> >> I have seen at least 5 times. >> >> All this are reproduced when doing MDS failover, and this is the root >> cause in MDS side. > OK, given that the fixup in the kernel client is small, it seems > justified. But, please, add a comment in the new else branch saying > that it's there only to work around a bug in the MDS. > Thanks, > > Ilya >
On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > > >> From: Xiubo Li <xiubli@redhat.com> > > > >> > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > > >> request may still contain a list of 'split_realms', and we need > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > > >> > > > >> Cc: stable@vger.kernel.org > > > >> Cc: Frank Schilder <frans@dtu.dk> > > > >> Reported-by: Frank Schilder <frans@dtu.dk> > > > >> URL: https://tracker.ceph.com/issues/61200 > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > > >> --- > > > >> fs/ceph/snap.c | 3 +++ > > > >> 1 file changed, 3 insertions(+) > > > >> > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > > >> index 0e59e95a96d9..d95dfe16b624 100644 > > > >> --- a/fs/ceph/snap.c > > > >> +++ b/fs/ceph/snap.c > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > > > >> continue; > > > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > > > >> } > > > >> + } else { > > > >> + p += sizeof(u64) * num_split_inos; > > > >> + p += sizeof(u64) * num_split_realms; > > > >> } > > > >> > > > >> /* > > > >> -- > > > >> 2.40.1 > > > >> > > > > Hi Xiubo, > > > > > > > > This code appears to be very old -- it goes back to the initial commit > > > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > > > explanation for why this popped up only now? > > > > > > As I remembered we hit this before in one cu BZ last year, but I > > > couldn't remember exactly which one. But I am not sure whether @Jeff > > > saw this before I joint ceph team. > > > > > > > > > > Has MDS always been including split_inos and split_realms arrays in > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > > > change, I'd argue that this needs to be addressed on the MDS side. > > > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > > list just before the commit in 2017: > > > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > > Author: Yan, Zheng <zyan@redhat.com> > > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > > > mds: send snap related messages centrally during mds recovery > > > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > > clients centrally in MDCache::open_snaprealms() > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > Before this commit it will only send the 'split_realm' list for the > > > _SPLIT op. > > > > It sounds like we have the culprit. This should be treated as > > a regression and fixed on the MDS side. I don't see a justification > > for putting useless data on the wire. > > I don't really understand this viewpoint. We can treat it as an MDS > regression if we want, but it's a six-year-old patch so this is in > nearly every version of server code anybody's running. Why wouldn't we > fix it on both sides? Well, if I didn't speak up chances are we wouldn't have identified the regression in the MDS at all. People seem to have this perception that the client is somehow "easier" to fix, assume that the server is always doing the right thing and default to patching the client. I'm just trying to push back on that. In this particular case, after understanding the scope of the issue _and_ getting a committal for the MDS side fix, I approved taking the kernel client patch in an earlier reply. > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > will be parsed by kclient too and kclient won't give any warning, but it > > will corrupted the snaprealm and capsnap info in kclient. > > I'm a bit confused about this patch, but I also don't follow the > kernel client code much so please forgive my ignorance. The change > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT > case, so clearly the kclient *mostly* does the right thing when these No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" branch. Thanks, Ilya
On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > > > >> From: Xiubo Li <xiubli@redhat.com> > > > > >> > > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > > > >> request may still contain a list of 'split_realms', and we need > > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > > > >> > > > > >> Cc: stable@vger.kernel.org > > > > >> Cc: Frank Schilder <frans@dtu.dk> > > > > >> Reported-by: Frank Schilder <frans@dtu.dk> > > > > >> URL: https://tracker.ceph.com/issues/61200 > > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > > > >> --- > > > > >> fs/ceph/snap.c | 3 +++ > > > > >> 1 file changed, 3 insertions(+) > > > > >> > > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > > > >> index 0e59e95a96d9..d95dfe16b624 100644 > > > > >> --- a/fs/ceph/snap.c > > > > >> +++ b/fs/ceph/snap.c > > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > > > > >> continue; > > > > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > > > > >> } > > > > >> + } else { > > > > >> + p += sizeof(u64) * num_split_inos; > > > > >> + p += sizeof(u64) * num_split_realms; > > > > >> } > > > > >> > > > > >> /* > > > > >> -- > > > > >> 2.40.1 > > > > >> > > > > > Hi Xiubo, > > > > > > > > > > This code appears to be very old -- it goes back to the initial commit > > > > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > > > > explanation for why this popped up only now? > > > > > > > > As I remembered we hit this before in one cu BZ last year, but I > > > > couldn't remember exactly which one. But I am not sure whether @Jeff > > > > saw this before I joint ceph team. > > > > > > > > > > > > > Has MDS always been including split_inos and split_realms arrays in > > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > > > > change, I'd argue that this needs to be addressed on the MDS side. > > > > > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > > > list just before the commit in 2017: > > > > > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > > > Author: Yan, Zheng <zyan@redhat.com> > > > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > > > > > mds: send snap related messages centrally during mds recovery > > > > > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > > > clients centrally in MDCache::open_snaprealms() > > > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > > > Before this commit it will only send the 'split_realm' list for the > > > > _SPLIT op. > > > > > > It sounds like we have the culprit. This should be treated as > > > a regression and fixed on the MDS side. I don't see a justification > > > for putting useless data on the wire. > > > > I don't really understand this viewpoint. We can treat it as an MDS > > regression if we want, but it's a six-year-old patch so this is in > > nearly every version of server code anybody's running. Why wouldn't we > > fix it on both sides? > > Well, if I didn't speak up chances are we wouldn't have identified the > regression in the MDS at all. People seem to have this perception that > the client is somehow "easier" to fix, assume that the server is always > doing the right thing and default to patching the client. I'm just > trying to push back on that. > > In this particular case, after understanding the scope of the issue > _and_ getting a committal for the MDS side fix, I approved taking the > kernel client patch in an earlier reply. > > > > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > > will be parsed by kclient too and kclient won't give any warning, but it > > > will corrupted the snaprealm and capsnap info in kclient. > > > > I'm a bit confused about this patch, but I also don't follow the > > kernel client code much so please forgive my ignorance. The change > > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT > > case, so clearly the kclient *mostly* does the right thing when these > > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" > branch. Oh I mis-parsed the braces/spacing there. I'm still not getting how the precise size is causing the problem — obviously this isn't an unheard-of category of issue, but the fact that it works until the count matches a magic number is odd. Is that ceph_decode_need macro being called from ceph_update_snap_trace just skipping over the split data somehow? *puzzled* -Greg > > Thanks, > > Ilya >
Just to be clear, I'd like the details here so we can see if there are ways to prevent similar issues in the future, which I haven't heard anybody talk about. :) On Wed, May 17, 2023 at 8:03 AM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > > > > > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > > > > >> From: Xiubo Li <xiubli@redhat.com> > > > > > >> > > > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > > > > >> request may still contain a list of 'split_realms', and we need > > > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > > > > >> > > > > > >> Cc: stable@vger.kernel.org > > > > > >> Cc: Frank Schilder <frans@dtu.dk> > > > > > >> Reported-by: Frank Schilder <frans@dtu.dk> > > > > > >> URL: https://tracker.ceph.com/issues/61200 > > > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > > > > >> --- > > > > > >> fs/ceph/snap.c | 3 +++ > > > > > >> 1 file changed, 3 insertions(+) > > > > > >> > > > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > > > > >> index 0e59e95a96d9..d95dfe16b624 100644 > > > > > >> --- a/fs/ceph/snap.c > > > > > >> +++ b/fs/ceph/snap.c > > > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > > > > > >> continue; > > > > > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > > > > > >> } > > > > > >> + } else { > > > > > >> + p += sizeof(u64) * num_split_inos; > > > > > >> + p += sizeof(u64) * num_split_realms; > > > > > >> } > > > > > >> > > > > > >> /* > > > > > >> -- > > > > > >> 2.40.1 > > > > > >> > > > > > > Hi Xiubo, > > > > > > > > > > > > This code appears to be very old -- it goes back to the initial commit > > > > > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > > > > > explanation for why this popped up only now? > > > > > > > > > > As I remembered we hit this before in one cu BZ last year, but I > > > > > couldn't remember exactly which one. But I am not sure whether @Jeff > > > > > saw this before I joint ceph team. > > > > > > > > > > > > > > > > Has MDS always been including split_inos and split_realms arrays in > > > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > > > > > change, I'd argue that this needs to be addressed on the MDS side. > > > > > > > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > > > > list just before the commit in 2017: > > > > > > > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > > > > Author: Yan, Zheng <zyan@redhat.com> > > > > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > > > > > > > mds: send snap related messages centrally during mds recovery > > > > > > > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > > > > clients centrally in MDCache::open_snaprealms() > > > > > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > > > > > Before this commit it will only send the 'split_realm' list for the > > > > > _SPLIT op. > > > > > > > > It sounds like we have the culprit. This should be treated as > > > > a regression and fixed on the MDS side. I don't see a justification > > > > for putting useless data on the wire. > > > > > > I don't really understand this viewpoint. We can treat it as an MDS > > > regression if we want, but it's a six-year-old patch so this is in > > > nearly every version of server code anybody's running. Why wouldn't we > > > fix it on both sides? > > > > Well, if I didn't speak up chances are we wouldn't have identified the > > regression in the MDS at all. People seem to have this perception that > > the client is somehow "easier" to fix, assume that the server is always > > doing the right thing and default to patching the client. I'm just > > trying to push back on that. > > > > In this particular case, after understanding the scope of the issue > > _and_ getting a committal for the MDS side fix, I approved taking the > > kernel client patch in an earlier reply. > > > > > > > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: > > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > > > will be parsed by kclient too and kclient won't give any warning, but it > > > > will corrupted the snaprealm and capsnap info in kclient. > > > > > > I'm a bit confused about this patch, but I also don't follow the > > > kernel client code much so please forgive my ignorance. The change > > > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT > > > case, so clearly the kclient *mostly* does the right thing when these > > > > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" > > branch. > > Oh I mis-parsed the braces/spacing there. > > I'm still not getting how the precise size is causing the problem — > obviously this isn't an unheard-of category of issue, but the fact > that it works until the count matches a magic number is odd. Is that > ceph_decode_need macro being called from ceph_update_snap_trace just > skipping over the split data somehow? *puzzled* > -Greg > > > > > Thanks, > > > > Ilya > >
Hi all, joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches. For these two reasons alone I strongly favour fixing both sides. Extra reading, feel free to skip. Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice. I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason. The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything. The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do. Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3. I hope this was a useful point of view from someone suffering from the condition. Best regards and thanks for your efforts addressing this! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14
On Wed, May 17, 2023 at 5:03 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: > > > > > > On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: > > > > > > > > On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: > > > > > > > > > > > > > > > On 5/17/23 18:31, Ilya Dryomov wrote: > > > > > > On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: > > > > > >> From: Xiubo Li <xiubli@redhat.com> > > > > > >> > > > > > >> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the > > > > > >> request may still contain a list of 'split_realms', and we need > > > > > >> to skip it anyway. Or it will be parsed as a corrupt snaptrace. > > > > > >> > > > > > >> Cc: stable@vger.kernel.org > > > > > >> Cc: Frank Schilder <frans@dtu.dk> > > > > > >> Reported-by: Frank Schilder <frans@dtu.dk> > > > > > >> URL: https://tracker.ceph.com/issues/61200 > > > > > >> Signed-off-by: Xiubo Li <xiubli@redhat.com> > > > > > >> --- > > > > > >> fs/ceph/snap.c | 3 +++ > > > > > >> 1 file changed, 3 insertions(+) > > > > > >> > > > > > >> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > > > > >> index 0e59e95a96d9..d95dfe16b624 100644 > > > > > >> --- a/fs/ceph/snap.c > > > > > >> +++ b/fs/ceph/snap.c > > > > > >> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, > > > > > >> continue; > > > > > >> adjust_snap_realm_parent(mdsc, child, realm->ino); > > > > > >> } > > > > > >> + } else { > > > > > >> + p += sizeof(u64) * num_split_inos; > > > > > >> + p += sizeof(u64) * num_split_realms; > > > > > >> } > > > > > >> > > > > > >> /* > > > > > >> -- > > > > > >> 2.40.1 > > > > > >> > > > > > > Hi Xiubo, > > > > > > > > > > > > This code appears to be very old -- it goes back to the initial commit > > > > > > 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an > > > > > > explanation for why this popped up only now? > > > > > > > > > > As I remembered we hit this before in one cu BZ last year, but I > > > > > couldn't remember exactly which one. But I am not sure whether @Jeff > > > > > saw this before I joint ceph team. > > > > > > > > > > > > > > > > Has MDS always been including split_inos and split_realms arrays in > > > > > > !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent > > > > > > change, I'd argue that this needs to be addressed on the MDS side. > > > > > > > > > > While in MDS side for the _UPDATE op it won't send the 'split_realm' > > > > > list just before the commit in 2017: > > > > > > > > > > commit 93e7267757508520dfc22cff1ab20558bd4a44d4 > > > > > Author: Yan, Zheng <zyan@redhat.com> > > > > > Date: Fri Jul 21 21:40:46 2017 +0800 > > > > > > > > > > mds: send snap related messages centrally during mds recovery > > > > > > > > > > sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to > > > > > clients centrally in MDCache::open_snaprealms() > > > > > > > > > > Signed-off-by: "Yan, Zheng" <zyan@redhat.com> > > > > > > > > > > Before this commit it will only send the 'split_realm' list for the > > > > > _SPLIT op. > > > > > > > > It sounds like we have the culprit. This should be treated as > > > > a regression and fixed on the MDS side. I don't see a justification > > > > for putting useless data on the wire. > > > > > > I don't really understand this viewpoint. We can treat it as an MDS > > > regression if we want, but it's a six-year-old patch so this is in > > > nearly every version of server code anybody's running. Why wouldn't we > > > fix it on both sides? > > > > Well, if I didn't speak up chances are we wouldn't have identified the > > regression in the MDS at all. People seem to have this perception that > > the client is somehow "easier" to fix, assume that the server is always > > doing the right thing and default to patching the client. I'm just > > trying to push back on that. > > > > In this particular case, after understanding the scope of the issue > > _and_ getting a committal for the MDS side fix, I approved taking the > > kernel client patch in an earlier reply. > > > > > > > > On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: > > > > And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + > > > > extra snap buffer size by coincidence, the above 'corrupted' snaptrace > > > > will be parsed by kclient too and kclient won't give any warning, but it > > > > will corrupted the snaprealm and capsnap info in kclient. > > > > > > I'm a bit confused about this patch, but I also don't follow the > > > kernel client code much so please forgive my ignorance. The change > > > you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT > > > case, so clearly the kclient *mostly* does the right thing when these > > > > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" > > branch. > > Oh I mis-parsed the braces/spacing there. > > I'm still not getting how the precise size is causing the problem — > obviously this isn't an unheard-of category of issue, but the fact > that it works until the count matches a magic number is odd. Is that > ceph_decode_need macro being called from ceph_update_snap_trace just > skipping over the split data somehow? *puzzled* I'm not sure what count or magic number you are referring to. AFAIU the issue is that ceph_update_snap_trace() expects to be called with "p" pointing at ceph_mds_snap_realm encoding. In the case that is being fixed it gets called with "p" pointing at "split_inos" array (and "split_realms" array following that). The exact check that throws it off (whether ceph_decode_need() or something else) is irrelevant -- we are already toasted by then. In the worst case no check could fire and ceph_update_snap_trace() could interpret the bogus array data as a valid ceph_mds_snap_realm with who knows what consequences... Thanks, Ilya
On 5/17/23 21:56, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 3:33 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/17/23 21:11, Ilya Dryomov wrote: >>> On Wed, May 17, 2023 at 2:46 PM Xiubo Li <xiubli@redhat.com> wrote: >>>> On 5/17/23 19:29, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>>>> >>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>>>> request may still contain a list of 'split_realms', and we need >>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org >>>>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>>>> --- >>>>>>>> fs/ceph/snap.c | 3 +++ >>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>> >>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>>>> --- a/fs/ceph/snap.c >>>>>>>> +++ b/fs/ceph/snap.c >>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>>>> continue; >>>>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>>>> } >>>>>>>> + } else { >>>>>>>> + p += sizeof(u64) * num_split_inos; >>>>>>>> + p += sizeof(u64) * num_split_realms; >>>>>>>> } >>>>>>>> >>>>>>>> /* >>>>>>>> -- >>>>>>>> 2.40.1 >>>>>>>> >>>>>>> Hi Xiubo, >>>>>>> >>>>>>> This code appears to be very old -- it goes back to the initial commit >>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>>>> explanation for why this popped up only now? >>>>>> As I remembered we hit this before in one cu BZ last year, but I >>>>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>>>> saw this before I joint ceph team. >>>>>> >>>>>> >>>>>>> Has MDS always been including split_inos and split_realms arrays in >>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>>>> list just before the commit in 2017: >>>>>> >>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>>>> Author: Yan, Zheng <zyan@redhat.com> >>>>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>>>> >>>>>> mds: send snap related messages centrally during mds recovery >>>>>> >>>>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>>>> clients centrally in MDCache::open_snaprealms() >>>>>> >>>>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>>>> >>>>>> Before this commit it will only send the 'split_realm' list for the >>>>>> _SPLIT op. >>>>> It sounds like we have the culprit. This should be treated as >>>>> a regression and fixed on the MDS side. I don't see a justification >>>>> for putting useless data on the wire. >>>> But we still need this patch to make it to work with the old ceph releases. >>> This is debatable: >>> >>> - given that no one noticed this for so long, the likelihood of MDS >>> sending a CEPH_SNAP_OP_UPDATE message with bogus split_inos and >>> split_realms arrays is very low >>> >>> - MDS side fix would be backported to supported Ceph releases >>> >>> - people who are running unsupported Ceph releases (i.e. aren't >>> updating) are unlikely to be diligently updating their kernel clients >> Just searched the ceph tracker and I found another 3 trackers have the >> same issue: >> >> https://tracker.ceph.com/issues/57817 >> https://tracker.ceph.com/issues/57703 >> https://tracker.ceph.com/issues/57686 >> >> So plusing this time and the previous CU case: >> >> https://www.spinics.net/lists/ceph-users/msg77106.html >> >> I have seen at least 5 times. >> >> All this are reproduced when doing MDS failover, and this is the root >> cause in MDS side. > OK, given that the fixup in the kernel client is small, it seems > justified. But, please, add a comment in the new else branch saying > that it's there only to work around a bug in the MDS. Sure. Will update it in next version. Thanks - Xiubo > Thanks, > > Ilya >
On 5/17/23 22:27, Ilya Dryomov wrote: > On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: >> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: >>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>> >>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>> >>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>> request may still contain a list of 'split_realms', and we need >>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>> >>>>>> Cc: stable@vger.kernel.org >>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>> --- >>>>>> fs/ceph/snap.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>> --- a/fs/ceph/snap.c >>>>>> +++ b/fs/ceph/snap.c >>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>> continue; >>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>> } >>>>>> + } else { >>>>>> + p += sizeof(u64) * num_split_inos; >>>>>> + p += sizeof(u64) * num_split_realms; >>>>>> } >>>>>> >>>>>> /* >>>>>> -- >>>>>> 2.40.1 >>>>>> >>>>> Hi Xiubo, >>>>> >>>>> This code appears to be very old -- it goes back to the initial commit >>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>> explanation for why this popped up only now? >>>> As I remembered we hit this before in one cu BZ last year, but I >>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>> saw this before I joint ceph team. >>>> >>>> >>>>> Has MDS always been including split_inos and split_realms arrays in >>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>> list just before the commit in 2017: >>>> >>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>> Author: Yan, Zheng <zyan@redhat.com> >>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>> >>>> mds: send snap related messages centrally during mds recovery >>>> >>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>> clients centrally in MDCache::open_snaprealms() >>>> >>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>> >>>> Before this commit it will only send the 'split_realm' list for the >>>> _SPLIT op. >>> It sounds like we have the culprit. This should be treated as >>> a regression and fixed on the MDS side. I don't see a justification >>> for putting useless data on the wire. >> I don't really understand this viewpoint. We can treat it as an MDS >> regression if we want, but it's a six-year-old patch so this is in >> nearly every version of server code anybody's running. Why wouldn't we >> fix it on both sides? > Well, if I didn't speak up chances are we wouldn't have identified the > regression in the MDS at all. People seem to have this perception that > the client is somehow "easier" to fix, assume that the server is always > doing the right thing and default to patching the client. I'm just > trying to push back on that. > > In this particular case, after understanding the scope of the issue > _and_ getting a committal for the MDS side fix, I approved taking the > kernel client patch in an earlier reply. > >> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: >>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + >>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace >>> will be parsed by kclient too and kclient won't give any warning, but it >>> will corrupted the snaprealm and capsnap info in kclient. >> I'm a bit confused about this patch, but I also don't follow the >> kernel client code much so please forgive my ignorance. The change >> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT >> case, so clearly the kclient *mostly* does the right thing when these > No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" > branch. Yeah, correct. The problem is not all the UPDATE cases in MDS will encode and pass the 'split_realms' list to clients via the MClientSnap requests. By reading Zheng's commit I think he just copied that code to one UPDATE case from old SPLIT case and didn't notice it will introduce this issue in kclient. Sending the 'split_realms' is unnecessary and useless in client side for UPDATE case, we should fix it in MDS side too. Thanks - Xiubo > Thanks, > > Ilya >
On 5/17/23 23:03, Gregory Farnum wrote: > On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: >> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: >>> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: >>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>>> >>>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>>> >>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>>> request may still contain a list of 'split_realms', and we need >>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>>> >>>>>>> Cc: stable@vger.kernel.org >>>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>>> --- >>>>>>> fs/ceph/snap.c | 3 +++ >>>>>>> 1 file changed, 3 insertions(+) >>>>>>> >>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>>> --- a/fs/ceph/snap.c >>>>>>> +++ b/fs/ceph/snap.c >>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>>> continue; >>>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>>> } >>>>>>> + } else { >>>>>>> + p += sizeof(u64) * num_split_inos; >>>>>>> + p += sizeof(u64) * num_split_realms; >>>>>>> } >>>>>>> >>>>>>> /* >>>>>>> -- >>>>>>> 2.40.1 >>>>>>> >>>>>> Hi Xiubo, >>>>>> >>>>>> This code appears to be very old -- it goes back to the initial commit >>>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>>> explanation for why this popped up only now? >>>>> As I remembered we hit this before in one cu BZ last year, but I >>>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>>> saw this before I joint ceph team. >>>>> >>>>> >>>>>> Has MDS always been including split_inos and split_realms arrays in >>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>>> list just before the commit in 2017: >>>>> >>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>>> Author: Yan, Zheng <zyan@redhat.com> >>>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>>> >>>>> mds: send snap related messages centrally during mds recovery >>>>> >>>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>>> clients centrally in MDCache::open_snaprealms() >>>>> >>>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>>> >>>>> Before this commit it will only send the 'split_realm' list for the >>>>> _SPLIT op. >>>> It sounds like we have the culprit. This should be treated as >>>> a regression and fixed on the MDS side. I don't see a justification >>>> for putting useless data on the wire. >>> I don't really understand this viewpoint. We can treat it as an MDS >>> regression if we want, but it's a six-year-old patch so this is in >>> nearly every version of server code anybody's running. Why wouldn't we >>> fix it on both sides? >> Well, if I didn't speak up chances are we wouldn't have identified the >> regression in the MDS at all. People seem to have this perception that >> the client is somehow "easier" to fix, assume that the server is always >> doing the right thing and default to patching the client. I'm just >> trying to push back on that. >> >> In this particular case, after understanding the scope of the issue >> _and_ getting a committal for the MDS side fix, I approved taking the >> kernel client patch in an earlier reply. >> >>> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: >>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + >>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace >>>> will be parsed by kclient too and kclient won't give any warning, but it >>>> will corrupted the snaprealm and capsnap info in kclient. >>> I'm a bit confused about this patch, but I also don't follow the >>> kernel client code much so please forgive my ignorance. The change >>> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT >>> case, so clearly the kclient *mostly* does the right thing when these >> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" >> branch. > Oh I mis-parsed the braces/spacing there. > > I'm still not getting how the precise size is causing the problem — > obviously this isn't an unheard-of category of issue, but the fact > that it works until the count matches a magic number is odd. Is that > ceph_decode_need macro being called from ceph_update_snap_trace just > skipping over the split data somehow? *puzzled* Yeah, it's called and this is where the corrupted snaptrace reported. The ceph_update_snap_trace() will try to parse the snaptraces in a loop and if there still has extra memories it will try to prase the rest memories as a new snaptrace, but the extra memories do not have enough memories needed and just reports it as a bad msg. Thanks - Xiubo > -Greg > >> Thanks, >> >> Ilya >>
On 5/17/23 23:04, Gregory Farnum wrote: > Just to be clear, I'd like the details here so we can see if there are > ways to prevent similar issues in the future, which I haven't heard > anybody talk about. :) We have hit a similar issue with Venky in https://github.com/ceph/ceph/pull/48382. This PR just added two extra members and the old kclient couldn't recognize them and also just incorrectly parsed them as a new snaptrace. Venky has fixed it by checking the peer client's feature bit before sending the msgs. Thanks - Xiubo > On Wed, May 17, 2023 at 8:03 AM Gregory Farnum <gfarnum@redhat.com> wrote: >> On Wed, May 17, 2023 at 7:27 AM Ilya Dryomov <idryomov@gmail.com> wrote: >>> On Wed, May 17, 2023 at 3:59 PM Gregory Farnum <gfarnum@redhat.com> wrote: >>>> On Wed, May 17, 2023 at 4:33 AM Ilya Dryomov <idryomov@gmail.com> wrote: >>>>> On Wed, May 17, 2023 at 1:04 PM Xiubo Li <xiubli@redhat.com> wrote: >>>>>> >>>>>> On 5/17/23 18:31, Ilya Dryomov wrote: >>>>>>> On Wed, May 17, 2023 at 7:24 AM <xiubli@redhat.com> wrote: >>>>>>>> From: Xiubo Li <xiubli@redhat.com> >>>>>>>> >>>>>>>> When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the >>>>>>>> request may still contain a list of 'split_realms', and we need >>>>>>>> to skip it anyway. Or it will be parsed as a corrupt snaptrace. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org >>>>>>>> Cc: Frank Schilder <frans@dtu.dk> >>>>>>>> Reported-by: Frank Schilder <frans@dtu.dk> >>>>>>>> URL: https://tracker.ceph.com/issues/61200 >>>>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com> >>>>>>>> --- >>>>>>>> fs/ceph/snap.c | 3 +++ >>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>> >>>>>>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c >>>>>>>> index 0e59e95a96d9..d95dfe16b624 100644 >>>>>>>> --- a/fs/ceph/snap.c >>>>>>>> +++ b/fs/ceph/snap.c >>>>>>>> @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, >>>>>>>> continue; >>>>>>>> adjust_snap_realm_parent(mdsc, child, realm->ino); >>>>>>>> } >>>>>>>> + } else { >>>>>>>> + p += sizeof(u64) * num_split_inos; >>>>>>>> + p += sizeof(u64) * num_split_realms; >>>>>>>> } >>>>>>>> >>>>>>>> /* >>>>>>>> -- >>>>>>>> 2.40.1 >>>>>>>> >>>>>>> Hi Xiubo, >>>>>>> >>>>>>> This code appears to be very old -- it goes back to the initial commit >>>>>>> 963b61eb041e ("ceph: snapshot management") in 2009. Do you have an >>>>>>> explanation for why this popped up only now? >>>>>> As I remembered we hit this before in one cu BZ last year, but I >>>>>> couldn't remember exactly which one. But I am not sure whether @Jeff >>>>>> saw this before I joint ceph team. >>>>>> >>>>>> >>>>>>> Has MDS always been including split_inos and split_realms arrays in >>>>>>> !CEPH_SNAP_OP_SPLIT case or is this a recent change? If it's a recent >>>>>>> change, I'd argue that this needs to be addressed on the MDS side. >>>>>> While in MDS side for the _UPDATE op it won't send the 'split_realm' >>>>>> list just before the commit in 2017: >>>>>> >>>>>> commit 93e7267757508520dfc22cff1ab20558bd4a44d4 >>>>>> Author: Yan, Zheng <zyan@redhat.com> >>>>>> Date: Fri Jul 21 21:40:46 2017 +0800 >>>>>> >>>>>> mds: send snap related messages centrally during mds recovery >>>>>> >>>>>> sending CEPH_SNAP_OP_SPLIT and CEPH_SNAP_OP_UPDATE messages to >>>>>> clients centrally in MDCache::open_snaprealms() >>>>>> >>>>>> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> >>>>>> >>>>>> Before this commit it will only send the 'split_realm' list for the >>>>>> _SPLIT op. >>>>> It sounds like we have the culprit. This should be treated as >>>>> a regression and fixed on the MDS side. I don't see a justification >>>>> for putting useless data on the wire. >>>> I don't really understand this viewpoint. We can treat it as an MDS >>>> regression if we want, but it's a six-year-old patch so this is in >>>> nearly every version of server code anybody's running. Why wouldn't we >>>> fix it on both sides? >>> Well, if I didn't speak up chances are we wouldn't have identified the >>> regression in the MDS at all. People seem to have this perception that >>> the client is somehow "easier" to fix, assume that the server is always >>> doing the right thing and default to patching the client. I'm just >>> trying to push back on that. >>> >>> In this particular case, after understanding the scope of the issue >>> _and_ getting a committal for the MDS side fix, I approved taking the >>> kernel client patch in an earlier reply. >>> >>>> On Wed, May 17, 2023 at 4:07 AM Xiubo Li <xiubli@redhat.com> wrote: >>>>> And if the split_realm number equals to sizeof(ceph_mds_snap_realm) + >>>>> extra snap buffer size by coincidence, the above 'corrupted' snaptrace >>>>> will be parsed by kclient too and kclient won't give any warning, but it >>>>> will corrupted the snaprealm and capsnap info in kclient. >>>> I'm a bit confused about this patch, but I also don't follow the >>>> kernel client code much so please forgive my ignorance. The change >>>> you've made is still only invoked inside of the CEPH_SNAP_OP_SPLIT >>>> case, so clearly the kclient *mostly* does the right thing when these >>> No, it's invoked outside: it introduces a "op != CEPH_SNAP_OP_SPLIT" >>> branch. >> Oh I mis-parsed the braces/spacing there. >> >> I'm still not getting how the precise size is causing the problem — >> obviously this isn't an unheard-of category of issue, but the fact >> that it works until the count matches a magic number is odd. Is that >> ceph_decode_need macro being called from ceph_update_snap_trace just >> skipping over the split data somehow? *puzzled* >> -Greg >> >>> Thanks, >>> >>> Ilya >>>
Thanks Frank for your feedback. On 5/17/23 23:39, Frank Schilder wrote: > Hi all, > > joining in here as the one who is hit pretty badly and also not being able to upgrade ceph any time soon to a version receiving patches. > > For these two reasons alone I strongly favour fixing both sides. > > Extra reading, feel free to skip. > > Additional reasons for fixing both sides are (1) to have more error tolerant code - if one side breaks/regresses the other side still knows what to do and can report back while moving on without a fatal crash and (2) to help users of old clusters who are affected without noticing yet. Every now and then one should afford to be nice. > > I personally think that (1) is generally good practice, explicitly handling seemingly unexpected cases increases overall robustness (its a bit like raiding up code to catch code rot) and will highlight otherwise unnoticed issues early in testing. It is not the first time our installation was hit by an unnecessarily missing catch-all clause that triggered an assert or follow-up crash for no real reason. > > The main reason we actually discovered this is that under certain rare circumstances it makes a server with a kclient mount freeze. There is some kind of follow-up condition that is triggered only under heavy load and almost certainly only at a time when a snapshot is taken. Hence, it is very well possible that many if not all users have these invalid snaptrace message on their system, but nothing else happens so they don't report anything. > > The hallmark in our case is a hanging client caps recall that eventually leads to a spontaneous restart of the affected MDS and then we end up with either a frozen server or a stale file handle at the ceph mount point. Others might not encounter these conditions simultaneously on their system as often as we do. > > Apart from that, its not even sure that this is the core issue causing all the trouble on our system. Having the kclient fixed would allow us to verify that we don't have yet another problem that should be looked at before considering a ceph upgrade - extra reason no. 3. > > I hope this was a useful point of view from someone suffering from the condition. > > Best regards and thanks for your efforts addressing this! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > >
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 0e59e95a96d9..d95dfe16b624 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -1114,6 +1114,9 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, continue; adjust_snap_realm_parent(mdsc, child, realm->ino); } + } else { + p += sizeof(u64) * num_split_inos; + p += sizeof(u64) * num_split_realms; } /*