Message ID | 20191212173159.35013-1-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] ceph: guard against __ceph_remove_cap races | expand |
On Thu, Dec 12, 2019 at 12:31:59PM -0500, Jeff Layton wrote: > I believe it's possible that we could end up with racing calls to > __ceph_remove_cap for the same cap. If that happens, the cap->ci > pointer will be zereoed out and we can hit a NULL pointer dereference. > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > and just return without doing anything if it is. > > URL: https://tracker.ceph.com/issues/43272 > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/ceph/caps.c | 21 ++++++++++++++++----- > 1 file changed, 16 insertions(+), 5 deletions(-) > > This is the only scenario that made sense to me in light of Ilya's > analysis on the tracker above. I could be off here though -- the locking > around this code is horrifically complex, and I could be missing what > should guard against this scenario. > > Thoughts? This patch _seems_ to make sense. But, as you said, the locking code is incredibly complex. I tried to understand if __send_cap() could have a similar race by accessing cap->ci without s_cap_lock. But I couldn't reach a conclusion :-/ Cheers, -- Luís > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index 9d09bb53c1ab..7e39ee8eff60 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) > void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > { > struct ceph_mds_session *session = cap->session; > - struct ceph_inode_info *ci = cap->ci; > - struct ceph_mds_client *mdsc = > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > + struct ceph_inode_info *ci; > + struct ceph_mds_client *mdsc; > int removed = 0; > > + spin_lock(&session->s_cap_lock); > + ci = cap->ci; > + if (!ci) { > + /* > + * Did we race with a competing __ceph_remove_cap call? If > + * ci is zeroed out, then just unlock and don't do anything. > + * Assume that it's on its way out anyway. > + */ > + spin_unlock(&session->s_cap_lock); > + return; > + } > + > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > /* remove from inode's cap rbtree, and clear auth cap */ > @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > if (ci->i_auth_cap == cap) > ci->i_auth_cap = NULL; > > - /* remove from session list */ > - spin_lock(&session->s_cap_lock); > if (session->s_cap_iterator == cap) { > /* not yet, we are iterating over this very cap */ > dout("__ceph_remove_cap delaying %p removal from session %p\n", > cap, cap->session); > } else { > + /* remove from session list */ > list_del_init(&cap->session_caps); > session->s_nr_caps--; > cap->session = NULL; > @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > } > /* protect backpointer with s_cap_lock: see iterate_session_caps */ > cap->ci = NULL; > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > /* > * s_cap_reconnect is protected by s_cap_lock. no one changes > -- > 2.23.0 >
On Fri, Dec 13, 2019 at 1:32 AM Jeff Layton <jlayton@kernel.org> wrote: > > I believe it's possible that we could end up with racing calls to > __ceph_remove_cap for the same cap. If that happens, the cap->ci > pointer will be zereoed out and we can hit a NULL pointer dereference. > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > and just return without doing anything if it is. > > URL: https://tracker.ceph.com/issues/43272 > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/ceph/caps.c | 21 ++++++++++++++++----- > 1 file changed, 16 insertions(+), 5 deletions(-) > > This is the only scenario that made sense to me in light of Ilya's > analysis on the tracker above. I could be off here though -- the locking > around this code is horrifically complex, and I could be missing what > should guard against this scenario. > I think the simpler fix is, in trim_caps_cb, check if cap-ci is non-null before calling __ceph_remove_cap(). this should work because __ceph_remove_cap() is always called inside i_ceph_lock > Thoughts? > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index 9d09bb53c1ab..7e39ee8eff60 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) > void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > { > struct ceph_mds_session *session = cap->session; > - struct ceph_inode_info *ci = cap->ci; > - struct ceph_mds_client *mdsc = > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > + struct ceph_inode_info *ci; > + struct ceph_mds_client *mdsc; > int removed = 0; > > + spin_lock(&session->s_cap_lock); > + ci = cap->ci; > + if (!ci) { > + /* > + * Did we race with a competing __ceph_remove_cap call? If > + * ci is zeroed out, then just unlock and don't do anything. > + * Assume that it's on its way out anyway. > + */ > + spin_unlock(&session->s_cap_lock); > + return; > + } > + > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > /* remove from inode's cap rbtree, and clear auth cap */ > @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > if (ci->i_auth_cap == cap) > ci->i_auth_cap = NULL; > > - /* remove from session list */ > - spin_lock(&session->s_cap_lock); > if (session->s_cap_iterator == cap) { > /* not yet, we are iterating over this very cap */ > dout("__ceph_remove_cap delaying %p removal from session %p\n", > cap, cap->session); > } else { > + /* remove from session list */ > list_del_init(&cap->session_caps); > session->s_nr_caps--; > cap->session = NULL; > @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > } > /* protect backpointer with s_cap_lock: see iterate_session_caps */ > cap->ci = NULL; > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > /* > * s_cap_reconnect is protected by s_cap_lock. no one changes > -- > 2.23.0 >
On Sat, 2019-12-14 at 10:46 +0800, Yan, Zheng wrote: > On Fri, Dec 13, 2019 at 1:32 AM Jeff Layton <jlayton@kernel.org> wrote: > > I believe it's possible that we could end up with racing calls to > > __ceph_remove_cap for the same cap. If that happens, the cap->ci > > pointer will be zereoed out and we can hit a NULL pointer dereference. > > > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > > and just return without doing anything if it is. > > > > URL: https://tracker.ceph.com/issues/43272 > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/ceph/caps.c | 21 ++++++++++++++++----- > > 1 file changed, 16 insertions(+), 5 deletions(-) > > > > This is the only scenario that made sense to me in light of Ilya's > > analysis on the tracker above. I could be off here though -- the locking > > around this code is horrifically complex, and I could be missing what > > should guard against this scenario. > > > > I think the simpler fix is, in trim_caps_cb, check if cap-ci is > non-null before calling __ceph_remove_cap(). this should work because > __ceph_remove_cap() is always called inside i_ceph_lock > Is that sufficient though? The stack trace in the bug shows it being called by ceph_trim_caps, but I think we could hit the same problem with other __ceph_remove_cap callers, if they happen to race in at the right time. > > Thoughts? > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > index 9d09bb53c1ab..7e39ee8eff60 100644 > > --- a/fs/ceph/caps.c > > +++ b/fs/ceph/caps.c > > @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) > > void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > { > > struct ceph_mds_session *session = cap->session; > > - struct ceph_inode_info *ci = cap->ci; > > - struct ceph_mds_client *mdsc = > > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > + struct ceph_inode_info *ci; > > + struct ceph_mds_client *mdsc; > > int removed = 0; > > > > + spin_lock(&session->s_cap_lock); > > + ci = cap->ci; > > + if (!ci) { > > + /* > > + * Did we race with a competing __ceph_remove_cap call? If > > + * ci is zeroed out, then just unlock and don't do anything. > > + * Assume that it's on its way out anyway. > > + */ > > + spin_unlock(&session->s_cap_lock); > > + return; > > + } > > + > > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > > > /* remove from inode's cap rbtree, and clear auth cap */ > > @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > if (ci->i_auth_cap == cap) > > ci->i_auth_cap = NULL; > > > > - /* remove from session list */ > > - spin_lock(&session->s_cap_lock); > > if (session->s_cap_iterator == cap) { > > /* not yet, we are iterating over this very cap */ > > dout("__ceph_remove_cap delaying %p removal from session %p\n", > > cap, cap->session); > > } else { > > + /* remove from session list */ > > list_del_init(&cap->session_caps); > > session->s_nr_caps--; > > cap->session = NULL; > > @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > } > > /* protect backpointer with s_cap_lock: see iterate_session_caps */ > > cap->ci = NULL; > > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > > > /* > > * s_cap_reconnect is protected by s_cap_lock. no one changes > > -- > > 2.23.0 > >
On 2019/12/13 1:31, Jeff Layton wrote: > I believe it's possible that we could end up with racing calls to > __ceph_remove_cap for the same cap. If that happens, the cap->ci > pointer will be zereoed out and we can hit a NULL pointer dereference. > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > and just return without doing anything if it is. > > URL: https://tracker.ceph.com/issues/43272 > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/ceph/caps.c | 21 ++++++++++++++++----- > 1 file changed, 16 insertions(+), 5 deletions(-) > > This is the only scenario that made sense to me in light of Ilya's > analysis on the tracker above. I could be off here though -- the locking > around this code is horrifically complex, and I could be missing what > should guard against this scenario. Checked the downstream 3.10.0-862.14.4 code, it seems that when doing trim_caps_cb() and at the same time if the inode is being destroyed we could hit this. All the __ceph_remove_cap() calls will be protected by the "ci->i_ceph_lock" lock, only except when destroying the inode. And the upstream seems have no this problem now. BRs > Thoughts? > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index 9d09bb53c1ab..7e39ee8eff60 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) > void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > { > struct ceph_mds_session *session = cap->session; > - struct ceph_inode_info *ci = cap->ci; > - struct ceph_mds_client *mdsc = > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > + struct ceph_inode_info *ci; > + struct ceph_mds_client *mdsc; > int removed = 0; > > + spin_lock(&session->s_cap_lock); > + ci = cap->ci; > + if (!ci) { > + /* > + * Did we race with a competing __ceph_remove_cap call? If > + * ci is zeroed out, then just unlock and don't do anything. > + * Assume that it's on its way out anyway. > + */ > + spin_unlock(&session->s_cap_lock); > + return; > + } > + > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > /* remove from inode's cap rbtree, and clear auth cap */ > @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > if (ci->i_auth_cap == cap) > ci->i_auth_cap = NULL; > > - /* remove from session list */ > - spin_lock(&session->s_cap_lock); > if (session->s_cap_iterator == cap) { > /* not yet, we are iterating over this very cap */ > dout("__ceph_remove_cap delaying %p removal from session %p\n", > cap, cap->session); > } else { > + /* remove from session list */ > list_del_init(&cap->session_caps); > session->s_nr_caps--; > cap->session = NULL; > @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > } > /* protect backpointer with s_cap_lock: see iterate_session_caps */ > cap->ci = NULL; > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > /* > * s_cap_reconnect is protected by s_cap_lock. no one changes
On Mon, 2019-12-16 at 09:57 +0800, Xiubo Li wrote: > On 2019/12/13 1:31, Jeff Layton wrote: > > I believe it's possible that we could end up with racing calls to > > __ceph_remove_cap for the same cap. If that happens, the cap->ci > > pointer will be zereoed out and we can hit a NULL pointer dereference. > > > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > > and just return without doing anything if it is. > > > > URL: https://tracker.ceph.com/issues/43272 > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/ceph/caps.c | 21 ++++++++++++++++----- > > 1 file changed, 16 insertions(+), 5 deletions(-) > > > > This is the only scenario that made sense to me in light of Ilya's > > analysis on the tracker above. I could be off here though -- the locking > > around this code is horrifically complex, and I could be missing what > > should guard against this scenario. > > Checked the downstream 3.10.0-862.14.4 code, it seems that when doing > trim_caps_cb() and at the same time if the inode is being destroyed we > could hit this. > Yes, RHEL7 kernels clearly have this race. We can probably pull in d6e47819721ae2d9 to fix it there. > All the __ceph_remove_cap() calls will be protected by the > "ci->i_ceph_lock" lock, only except when destroying the inode. > > And the upstream seems have no this problem now. > The only way the i_ceph_lock helps this is if you don't drop it after you've found the cap in the inode's rbtree. The only callers that don't hold the i_ceph_lock continuously are the ceph_iterate_session_caps callbacks. Those however are iterating over the session->s_caps list, so they shouldn't have a problem there either. So I agree -- upstream doesn't have this problem and we can drop this patch. We'll just have to focus on fixing this in RHEL7 instead. Longer term, I think we need to consider a substantial overhaul of the cap handling code. The locking in here is much more complex than is warranted for what this code does. I'll probably start looking at that once the dust settles on some other efforts. > > > Thoughts? > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > index 9d09bb53c1ab..7e39ee8eff60 100644 > > --- a/fs/ceph/caps.c > > +++ b/fs/ceph/caps.c > > @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) > > void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > { > > struct ceph_mds_session *session = cap->session; > > - struct ceph_inode_info *ci = cap->ci; > > - struct ceph_mds_client *mdsc = > > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > + struct ceph_inode_info *ci; > > + struct ceph_mds_client *mdsc; > > int removed = 0; > > > > + spin_lock(&session->s_cap_lock); > > + ci = cap->ci; > > + if (!ci) { > > + /* > > + * Did we race with a competing __ceph_remove_cap call? If > > + * ci is zeroed out, then just unlock and don't do anything. > > + * Assume that it's on its way out anyway. > > + */ > > + spin_unlock(&session->s_cap_lock); > > + return; > > + } > > + > > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > > > /* remove from inode's cap rbtree, and clear auth cap */ > > @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > if (ci->i_auth_cap == cap) > > ci->i_auth_cap = NULL; > > > > - /* remove from session list */ > > - spin_lock(&session->s_cap_lock); > > if (session->s_cap_iterator == cap) { > > /* not yet, we are iterating over this very cap */ > > dout("__ceph_remove_cap delaying %p removal from session %p\n", > > cap, cap->session); > > } else { > > + /* remove from session list */ > > list_del_init(&cap->session_caps); > > session->s_nr_caps--; > > cap->session = NULL; > > @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > > } > > /* protect backpointer with s_cap_lock: see iterate_session_caps */ > > cap->ci = NULL; > > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > > > > /* > > * s_cap_reconnect is protected by s_cap_lock. no one changes > >
On 2019/12/16 21:35, Jeff Layton wrote: > On Mon, 2019-12-16 at 09:57 +0800, Xiubo Li wrote: >> On 2019/12/13 1:31, Jeff Layton wrote: >>> I believe it's possible that we could end up with racing calls to >>> __ceph_remove_cap for the same cap. If that happens, the cap->ci >>> pointer will be zereoed out and we can hit a NULL pointer dereference. >>> >>> Once we acquire the s_cap_lock, check for the ci pointer being NULL, >>> and just return without doing anything if it is. >>> >>> URL: https://tracker.ceph.com/issues/43272 >>> Signed-off-by: Jeff Layton <jlayton@kernel.org> >>> --- >>> fs/ceph/caps.c | 21 ++++++++++++++++----- >>> 1 file changed, 16 insertions(+), 5 deletions(-) >>> >>> This is the only scenario that made sense to me in light of Ilya's >>> analysis on the tracker above. I could be off here though -- the locking >>> around this code is horrifically complex, and I could be missing what >>> should guard against this scenario. >> Checked the downstream 3.10.0-862.14.4 code, it seems that when doing >> trim_caps_cb() and at the same time if the inode is being destroyed we >> could hit this. >> > Yes, RHEL7 kernels clearly have this race. We can probably pull in > d6e47819721ae2d9 to fix it there. Yeah, it is. > >> All the __ceph_remove_cap() calls will be protected by the >> "ci->i_ceph_lock" lock, only except when destroying the inode. >> >> And the upstream seems have no this problem now. >> > The only way the i_ceph_lock helps this is if you don't drop it after > you've found the cap in the inode's rbtree. > > The only callers that don't hold the i_ceph_lock continuously are the > ceph_iterate_session_caps callbacks. Those however are iterating over > the session->s_caps list, so they shouldn't have a problem there either. > > So I agree -- upstream doesn't have this problem and we can drop this > patch. We'll just have to focus on fixing this in RHEL7 instead. > > Longer term, I think we need to consider a substantial overhaul of the > cap handling code. The locking in here is much more complex than is > warranted for what this code does. I'll probably start looking at that > once the dust settles on some other efforts. Yeah, we should make sure that the ci not to be released when we are still accessing it, from this case it seems it could happen, maybe the iget/iput could help. BRs > >>> Thoughts? >>> >>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >>> index 9d09bb53c1ab..7e39ee8eff60 100644 >>> --- a/fs/ceph/caps.c >>> +++ b/fs/ceph/caps.c >>> @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) >>> void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) >>> { >>> struct ceph_mds_session *session = cap->session; >>> - struct ceph_inode_info *ci = cap->ci; >>> - struct ceph_mds_client *mdsc = >>> - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; >>> + struct ceph_inode_info *ci; >>> + struct ceph_mds_client *mdsc; >>> int removed = 0; >>> >>> + spin_lock(&session->s_cap_lock); >>> + ci = cap->ci; >>> + if (!ci) { >>> + /* >>> + * Did we race with a competing __ceph_remove_cap call? If >>> + * ci is zeroed out, then just unlock and don't do anything. >>> + * Assume that it's on its way out anyway. >>> + */ >>> + spin_unlock(&session->s_cap_lock); >>> + return; >>> + } >>> + >>> dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); >>> >>> /* remove from inode's cap rbtree, and clear auth cap */ >>> @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) >>> if (ci->i_auth_cap == cap) >>> ci->i_auth_cap = NULL; >>> >>> - /* remove from session list */ >>> - spin_lock(&session->s_cap_lock); >>> if (session->s_cap_iterator == cap) { >>> /* not yet, we are iterating over this very cap */ >>> dout("__ceph_remove_cap delaying %p removal from session %p\n", >>> cap, cap->session); >>> } else { >>> + /* remove from session list */ >>> list_del_init(&cap->session_caps); >>> session->s_nr_caps--; >>> cap->session = NULL; >>> @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) >>> } >>> /* protect backpointer with s_cap_lock: see iterate_session_caps */ >>> cap->ci = NULL; >>> + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; >>> >>> /* >>> * s_cap_reconnect is protected by s_cap_lock. no one changes >>
Jeff Layton <jlayton@kernel.org> writes: > On Sat, 2019-12-14 at 10:46 +0800, Yan, Zheng wrote: >> On Fri, Dec 13, 2019 at 1:32 AM Jeff Layton <jlayton@kernel.org> wrote: >> > I believe it's possible that we could end up with racing calls to >> > __ceph_remove_cap for the same cap. If that happens, the cap->ci >> > pointer will be zereoed out and we can hit a NULL pointer dereference. >> > >> > Once we acquire the s_cap_lock, check for the ci pointer being NULL, >> > and just return without doing anything if it is. >> > >> > URL: https://tracker.ceph.com/issues/43272 >> > Signed-off-by: Jeff Layton <jlayton@kernel.org> >> > --- >> > fs/ceph/caps.c | 21 ++++++++++++++++----- >> > 1 file changed, 16 insertions(+), 5 deletions(-) >> > >> > This is the only scenario that made sense to me in light of Ilya's >> > analysis on the tracker above. I could be off here though -- the locking >> > around this code is horrifically complex, and I could be missing what >> > should guard against this scenario. >> > >> >> I think the simpler fix is, in trim_caps_cb, check if cap-ci is >> non-null before calling __ceph_remove_cap(). this should work because >> __ceph_remove_cap() is always called inside i_ceph_lock >> > > Is that sufficient though? The stack trace in the bug shows it being > called by ceph_trim_caps, but I think we could hit the same problem with > other __ceph_remove_cap callers, if they happen to race in at the right > time. Sorry for resurrecting this old thread, but we just got a report with this issue on a kernel that includes commit d6e47819721a ("ceph: hold i_ceph_lock when removing caps for freeing inode"). Looking at the code, I believe Zheng's suggestion should work as I don't see any __ceph_remove_cap callers that don't hold the i_ceph_lock. So, would something like the diff bellow be acceptable? Cheers,
On Wed, 2020-11-11 at 11:08 +0000, Luis Henriques wrote: > Jeff Layton <jlayton@kernel.org> writes: > > > On Sat, 2019-12-14 at 10:46 +0800, Yan, Zheng wrote: > > > On Fri, Dec 13, 2019 at 1:32 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > I believe it's possible that we could end up with racing calls to > > > > __ceph_remove_cap for the same cap. If that happens, the cap->ci > > > > pointer will be zereoed out and we can hit a NULL pointer dereference. > > > > > > > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, > > > > and just return without doing anything if it is. > > > > > > > > URL: https://tracker.ceph.com/issues/43272 > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/ceph/caps.c | 21 ++++++++++++++++----- > > > > 1 file changed, 16 insertions(+), 5 deletions(-) > > > > > > > > This is the only scenario that made sense to me in light of Ilya's > > > > analysis on the tracker above. I could be off here though -- the locking > > > > around this code is horrifically complex, and I could be missing what > > > > should guard against this scenario. > > > > > > > > > > I think the simpler fix is, in trim_caps_cb, check if cap-ci is > > > non-null before calling __ceph_remove_cap(). this should work because > > > __ceph_remove_cap() is always called inside i_ceph_lock > > > > > > > Is that sufficient though? The stack trace in the bug shows it being > > called by ceph_trim_caps, but I think we could hit the same problem with > > other __ceph_remove_cap callers, if they happen to race in at the right > > time. > > Sorry for resurrecting this old thread, but we just got a report with this > issue on a kernel that includes commit d6e47819721a ("ceph: hold > i_ceph_lock when removing caps for freeing inode"). > > Looking at the code, I believe Zheng's suggestion should work as I don't > see any __ceph_remove_cap callers that don't hold the i_ceph_lock. So, > would something like the diff bellow be acceptable? > > Cheers, I'm still not convinced that's the correct fix. Why would trim_caps_cb be subject to this race when other __ceph_remove_cap callers are not? Maybe the right fix is to test for a NULL cap->ci in __ceph_remove_cap and just return early if it is?
Jeff Layton <jlayton@kernel.org> writes: > On Wed, 2020-11-11 at 11:08 +0000, Luis Henriques wrote: >> Jeff Layton <jlayton@kernel.org> writes: >> >> > On Sat, 2019-12-14 at 10:46 +0800, Yan, Zheng wrote: >> > > On Fri, Dec 13, 2019 at 1:32 AM Jeff Layton <jlayton@kernel.org> wrote: >> > > > I believe it's possible that we could end up with racing calls to >> > > > __ceph_remove_cap for the same cap. If that happens, the cap->ci >> > > > pointer will be zereoed out and we can hit a NULL pointer dereference. >> > > > >> > > > Once we acquire the s_cap_lock, check for the ci pointer being NULL, >> > > > and just return without doing anything if it is. >> > > > >> > > > URL: https://tracker.ceph.com/issues/43272 >> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> >> > > > --- >> > > > fs/ceph/caps.c | 21 ++++++++++++++++----- >> > > > 1 file changed, 16 insertions(+), 5 deletions(-) >> > > > >> > > > This is the only scenario that made sense to me in light of Ilya's >> > > > analysis on the tracker above. I could be off here though -- the locking >> > > > around this code is horrifically complex, and I could be missing what >> > > > should guard against this scenario. >> > > > >> > > >> > > I think the simpler fix is, in trim_caps_cb, check if cap-ci is >> > > non-null before calling __ceph_remove_cap(). this should work because >> > > __ceph_remove_cap() is always called inside i_ceph_lock >> > > >> > >> > Is that sufficient though? The stack trace in the bug shows it being >> > called by ceph_trim_caps, but I think we could hit the same problem with >> > other __ceph_remove_cap callers, if they happen to race in at the right >> > time. >> >> Sorry for resurrecting this old thread, but we just got a report with this >> issue on a kernel that includes commit d6e47819721a ("ceph: hold >> i_ceph_lock when removing caps for freeing inode"). >> >> Looking at the code, I believe Zheng's suggestion should work as I don't >> see any __ceph_remove_cap callers that don't hold the i_ceph_lock. So, >> would something like the diff bellow be acceptable? >> >> Cheers, > > I'm still not convinced that's the correct fix. > > Why would trim_caps_cb be subject to this race when other > __ceph_remove_cap callers are not? Maybe the right fix is to test for a > NULL cap->ci in __ceph_remove_cap and just return early if it is? I see, you're probably right. Looking again at the code I see that there are two possible places where this race could occur, and they're both used as callbacks in ceph_iterate_session_caps. They are trim_caps_cb and remove_session_caps_cb. These callbacks get the struct ceph_cap as argument and only then they will acquire i_ceph_lock. Since this isn't protected with session->s_cap_lock, I guess this could be where the race window is, where cap->ci can be set to NULL. Bellow is the patch you suggested. If you think that's acceptable I can resend with a proper commit message. Cheers,
On Wed, 2020-11-11 at 14:11 +0000, Luis Henriques wrote: > > It think this looks reasonable. Minor nits below: > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index ded4229c314a..917dfaf0bd01 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1140,12 +1140,17 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) > { > struct ceph_mds_session *session = cap->session; > struct ceph_inode_info *ci = cap->ci; > - struct ceph_mds_client *mdsc = > - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > + struct ceph_mds_client *mdsc; > + nit: remove the above newline > int removed = 0; > Maybe add a comment here to the effect that a NULL cap->ci indicates that the remove has already been done? > + if (!ci) > + return; > + > dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); > > + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; > + There's a ceph_inode_to_client helper now that may make this a bit more readable. > /* remove from inode's cap rbtree, and clear auth cap */ > rb_erase(&cap->ci_node, &ci->i_caps); > if (ci->i_auth_cap == cap) {
Jeff Layton <jlayton@kernel.org> writes: > On Wed, 2020-11-11 at 14:11 +0000, Luis Henriques wrote: >> >> > > It think this looks reasonable. Minor nits below: > >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> index ded4229c314a..917dfaf0bd01 100644 >> --- a/fs/ceph/caps.c >> +++ b/fs/ceph/caps.c >> @@ -1140,12 +1140,17 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) >> { >> struct ceph_mds_session *session = cap->session; >> struct ceph_inode_info *ci = cap->ci; >> - struct ceph_mds_client *mdsc = >> - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; >> + struct ceph_mds_client *mdsc; >> + > > nit: remove the above newline > >> int removed = 0; >> > > Maybe add a comment here to the effect that a NULL cap->ci indicates > that the remove has already been done? > >> + if (!ci) >> + return; >> + >> dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); >> >> + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; >> + > > There's a ceph_inode_to_client helper now that may make this a bit more > readable. > >> /* remove from inode's cap rbtree, and clear auth cap */ >> rb_erase(&cap->ci_node, &ci->i_caps); >> if (ci->i_auth_cap == cap) { Thanks Jeff. I'll re-post this soon with your suggestions. I just want to run some more local tests to make sure things aren't breaking with this change. Cheers,
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 9d09bb53c1ab..7e39ee8eff60 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1046,11 +1046,22 @@ static void drop_inode_snap_realm(struct ceph_inode_info *ci) void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) { struct ceph_mds_session *session = cap->session; - struct ceph_inode_info *ci = cap->ci; - struct ceph_mds_client *mdsc = - ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; + struct ceph_inode_info *ci; + struct ceph_mds_client *mdsc; int removed = 0; + spin_lock(&session->s_cap_lock); + ci = cap->ci; + if (!ci) { + /* + * Did we race with a competing __ceph_remove_cap call? If + * ci is zeroed out, then just unlock and don't do anything. + * Assume that it's on its way out anyway. + */ + spin_unlock(&session->s_cap_lock); + return; + } + dout("__ceph_remove_cap %p from %p\n", cap, &ci->vfs_inode); /* remove from inode's cap rbtree, and clear auth cap */ @@ -1058,13 +1069,12 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) if (ci->i_auth_cap == cap) ci->i_auth_cap = NULL; - /* remove from session list */ - spin_lock(&session->s_cap_lock); if (session->s_cap_iterator == cap) { /* not yet, we are iterating over this very cap */ dout("__ceph_remove_cap delaying %p removal from session %p\n", cap, cap->session); } else { + /* remove from session list */ list_del_init(&cap->session_caps); session->s_nr_caps--; cap->session = NULL; @@ -1072,6 +1082,7 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release) } /* protect backpointer with s_cap_lock: see iterate_session_caps */ cap->ci = NULL; + mdsc = ceph_sb_to_client(ci->vfs_inode.i_sb)->mdsc; /* * s_cap_reconnect is protected by s_cap_lock. no one changes
I believe it's possible that we could end up with racing calls to __ceph_remove_cap for the same cap. If that happens, the cap->ci pointer will be zereoed out and we can hit a NULL pointer dereference. Once we acquire the s_cap_lock, check for the ci pointer being NULL, and just return without doing anything if it is. URL: https://tracker.ceph.com/issues/43272 Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/ceph/caps.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) This is the only scenario that made sense to me in light of Ilya's analysis on the tracker above. I could be off here though -- the locking around this code is horrifically complex, and I could be missing what should guard against this scenario. Thoughts?