ceph: request expedited service when flushing caps
diff mbox series

Message ID 20200331105223.9610-1-jlayton@kernel.org
State New
Headers show
Series
  • ceph: request expedited service when flushing caps
Related show

Commit Message

Jeff Layton March 31, 2020, 10:52 a.m. UTC
Jan noticed some long stalls when flushing caps using sync() after
doing small file creates. For instance, running this:

    $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done

Could take more than 90s in some cases. The sync() will flush out caps,
but doesn't tell the MDS that it's waiting synchronously on the
replies.

When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
into that fact and it can then expedite the reply.

URL: https://tracker.ceph.com/issues/44744
Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Luis Henriques March 31, 2020, 1:37 p.m. UTC | #1
On Tue, Mar 31, 2020 at 06:52:23AM -0400, Jeff Layton wrote:
> Jan noticed some long stalls when flushing caps using sync() after
> doing small file creates. For instance, running this:
> 
>     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> 
> Could take more than 90s in some cases. The sync() will flush out caps,
> but doesn't tell the MDS that it's waiting synchronously on the
> replies.
> 
> When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> into that fact and it can then expedite the reply.
> 
> URL: https://tracker.ceph.com/issues/44744
> Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Feel free to add my Reviewed-by (and also tested).  Also, it may be worth
adding a stable tag for this patch, although it would require some
massaging to use __send_cap instead of __prep_cap.

Cheers,
--
Luis

> ---
>  fs/ceph/caps.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 61808793e0c0..6403178f2376 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>  
>  		mds = cap->mds;  /* remember mds, so we don't repeat */
>  
> -		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> -			   retain, flushing, flush_tid, oldest_flush_tid);
> +		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> +			   (flags & CHECK_CAPS_FLUSH) ?
> +			    CEPH_CLIENT_CAPS_SYNC : 0,
> +			   cap_used, want, retain, flushing, flush_tid,
> +			   oldest_flush_tid);
>  		spin_unlock(&ci->i_ceph_lock);
>  
>  		__send_cap(mdsc, &arg, ci);
> -- 
> 2.25.1
>
Yan, Zheng March 31, 2020, 1:48 p.m. UTC | #2
On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Jan noticed some long stalls when flushing caps using sync() after
> doing small file creates. For instance, running this:
>
>     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
>
> Could take more than 90s in some cases. The sync() will flush out caps,
> but doesn't tell the MDS that it's waiting synchronously on the
> replies.
>
> When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> into that fact and it can then expedite the reply.
>
> URL: https://tracker.ceph.com/issues/44744
> Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/caps.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 61808793e0c0..6403178f2376 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>
>                 mds = cap->mds;  /* remember mds, so we don't repeat */
>
> -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> -                          retain, flushing, flush_tid, oldest_flush_tid);
> +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> +                          (flags & CHECK_CAPS_FLUSH) ?
> +                           CEPH_CLIENT_CAPS_SYNC : 0,
> +                          cap_used, want, retain, flushing, flush_tid,
> +                          oldest_flush_tid);
>                 spin_unlock(&ci->i_ceph_lock);
>

this is too expensive for syncfs case. mds needs to flush journal for
each dirty inode.  we'd better to track dirty inodes by session, and
only set the flag when flushing the last inode in session dirty list.


>                 __send_cap(mdsc, &arg, ci);
> --
> 2.25.1
>
Gregory Farnum March 31, 2020, 2 p.m. UTC | #3
On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@gmail.com> wrote:
>
> On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > Jan noticed some long stalls when flushing caps using sync() after
> > doing small file creates. For instance, running this:
> >
> >     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> >
> > Could take more than 90s in some cases. The sync() will flush out caps,
> > but doesn't tell the MDS that it's waiting synchronously on the
> > replies.
> >
> > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> > into that fact and it can then expedite the reply.
> >
> > URL: https://tracker.ceph.com/issues/44744
> > Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/caps.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index 61808793e0c0..6403178f2376 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> >
> >                 mds = cap->mds;  /* remember mds, so we don't repeat */
> >
> > -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> > -                          retain, flushing, flush_tid, oldest_flush_tid);
> > +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> > +                          (flags & CHECK_CAPS_FLUSH) ?
> > +                           CEPH_CLIENT_CAPS_SYNC : 0,
> > +                          cap_used, want, retain, flushing, flush_tid,
> > +                          oldest_flush_tid);
> >                 spin_unlock(&ci->i_ceph_lock);
> >
>
> this is too expensive for syncfs case. mds needs to flush journal for
> each dirty inode.  we'd better to track dirty inodes by session, and
> only set the flag when flushing the last inode in session dirty list.

Yeah, see the userspace Client::_sync_fs() where we have an internal
flags argument which is set on the last cap in the dirty set and tells
the actual cap message flushing code to set FLAG_SYNC on the
MClientCaps message. I presume the kernel is operating on a similar
principle here?
-Greg


>
>
> >                 __send_cap(mdsc, &arg, ci);
> > --
> > 2.25.1
> >
>
Jeff Layton March 31, 2020, 2:56 p.m. UTC | #4
On Tue, 2020-03-31 at 07:00 -0700, Gregory Farnum wrote:
> On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > Jan noticed some long stalls when flushing caps using sync() after
> > > doing small file creates. For instance, running this:
> > > 
> > >     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> > > 
> > > Could take more than 90s in some cases. The sync() will flush out caps,
> > > but doesn't tell the MDS that it's waiting synchronously on the
> > > replies.
> > > 
> > > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> > > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> > > into that fact and it can then expedite the reply.
> > > 
> > > URL: https://tracker.ceph.com/issues/44744
> > > Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/ceph/caps.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > index 61808793e0c0..6403178f2376 100644
> > > --- a/fs/ceph/caps.c
> > > +++ b/fs/ceph/caps.c
> > > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> > > 
> > >                 mds = cap->mds;  /* remember mds, so we don't repeat */
> > > 
> > > -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> > > -                          retain, flushing, flush_tid, oldest_flush_tid);
> > > +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> > > +                          (flags & CHECK_CAPS_FLUSH) ?
> > > +                           CEPH_CLIENT_CAPS_SYNC : 0,
> > > +                          cap_used, want, retain, flushing, flush_tid,
> > > +                          oldest_flush_tid);
> > >                 spin_unlock(&ci->i_ceph_lock);
> > > 
> > 
> > this is too expensive for syncfs case. mds needs to flush journal for
> > each dirty inode.  we'd better to track dirty inodes by session, and
> > only set the flag when flushing the last inode in session dirty list.

I think this will be more difficult than that...

> 
> Yeah, see the userspace Client::_sync_fs() where we have an internal
> flags argument which is set on the last cap in the dirty set and tells
> the actual cap message flushing code to set FLAG_SYNC on the
> MClientCaps message. I presume the kernel is operating on a similar
> principle here?

Not today, but we need it to.

The caps are not tracked on a per-session basis (as Zheng points out),
and the locking and ordering of these requests is not as straightforward
as it is in the (trivial) libcephfs case. Fixing this will be a lot more
invasive than I had originally hoped.

It's also not 100% clear to me how we'll gauge which cap will be
"last".  As we move the last cap on the session's list from dirty to
flushing state, we can mark it so that it sets the SYNC flag when it
goes out, but what happens if we have a process that is actively
dirtying other inodes during this? You might never see the per-session
list go empty.

It may also go empty for reasons that have nothing to do with issuing a
sync(), (or fsync() or...) so we don't want to universally set this flag
in that case.

I'm not sure what the right solution is yet.
Yan, Zheng April 1, 2020, 11:04 a.m. UTC | #5
On Tue, Mar 31, 2020 at 10:56 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2020-03-31 at 07:00 -0700, Gregory Farnum wrote:
> > On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > > On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > Jan noticed some long stalls when flushing caps using sync() after
> > > > doing small file creates. For instance, running this:
> > > >
> > > >     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> > > >
> > > > Could take more than 90s in some cases. The sync() will flush out caps,
> > > > but doesn't tell the MDS that it's waiting synchronously on the
> > > > replies.
> > > >
> > > > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> > > > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> > > > into that fact and it can then expedite the reply.
> > > >
> > > > URL: https://tracker.ceph.com/issues/44744
> > > > Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/ceph/caps.c | 7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 61808793e0c0..6403178f2376 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> > > >
> > > >                 mds = cap->mds;  /* remember mds, so we don't repeat */
> > > >
> > > > -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> > > > -                          retain, flushing, flush_tid, oldest_flush_tid);
> > > > +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> > > > +                          (flags & CHECK_CAPS_FLUSH) ?
> > > > +                           CEPH_CLIENT_CAPS_SYNC : 0,
> > > > +                          cap_used, want, retain, flushing, flush_tid,
> > > > +                          oldest_flush_tid);
> > > >                 spin_unlock(&ci->i_ceph_lock);
> > > >
> > >
> > > this is too expensive for syncfs case. mds needs to flush journal for
> > > each dirty inode.  we'd better to track dirty inodes by session, and
> > > only set the flag when flushing the last inode in session dirty list.
>
> I think this will be more difficult than that...
>
> >
> > Yeah, see the userspace Client::_sync_fs() where we have an internal
> > flags argument which is set on the last cap in the dirty set and tells
> > the actual cap message flushing code to set FLAG_SYNC on the
> > MClientCaps message. I presume the kernel is operating on a similar
> > principle here?
>
> Not today, but we need it to.
>
> The caps are not tracked on a per-session basis (as Zheng points out),
> and the locking and ordering of these requests is not as straightforward
> as it is in the (trivial) libcephfs case. Fixing this will be a lot more
> invasive than I had originally hoped.
>
> It's also not 100% clear to me how we'll gauge which cap will be
> "last".  As we move the last cap on the session's list from dirty to
> flushing state, we can mark it so that it sets the SYNC flag when it
> goes out, but what happens if we have a process that is actively
> dirtying other inodes during this? You might never see the per-session
> list go empty.
>

It's not necessary to be strict 'last'.  just the one before exiting the loop

> It may also go empty for reasons that have nothing to do with issuing a
> sync(), (or fsync() or...) so we don't want to universally set this flag
> in that case.
>
> I'm not sure what the right solution is yet.
> --
> Jeff Layton <jlayton@kernel.org>
>
Jeff Layton April 1, 2020, 12:08 p.m. UTC | #6
On Wed, 2020-04-01 at 19:04 +0800, Yan, Zheng wrote:
> On Tue, Mar 31, 2020 at 10:56 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Tue, 2020-03-31 at 07:00 -0700, Gregory Farnum wrote:
> > > On Tue, Mar 31, 2020 at 6:49 AM Yan, Zheng <ukernel@gmail.com> wrote:
> > > > On Tue, Mar 31, 2020 at 6:52 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > Jan noticed some long stalls when flushing caps using sync() after
> > > > > doing small file creates. For instance, running this:
> > > > > 
> > > > >     $ time for i in $(seq -w 11 30); do echo "Hello World" > hello-$i.txt; sync -f ./hello-$i.txt; done
> > > > > 
> > > > > Could take more than 90s in some cases. The sync() will flush out caps,
> > > > > but doesn't tell the MDS that it's waiting synchronously on the
> > > > > replies.
> > > > > 
> > > > > When ceph_check_caps finds that CHECK_CAPS_FLUSH is set, then set the
> > > > > CEPH_CLIENT_CAPS_SYNC bit in the cap update request. This clues the MDS
> > > > > into that fact and it can then expedite the reply.
> > > > > 
> > > > > URL: https://tracker.ceph.com/issues/44744
> > > > > Reported-and-Tested-by: Jan Fajerski <jfajerski@suse.com>
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  fs/ceph/caps.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > index 61808793e0c0..6403178f2376 100644
> > > > > --- a/fs/ceph/caps.c
> > > > > +++ b/fs/ceph/caps.c
> > > > > @@ -2111,8 +2111,11 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> > > > > 
> > > > >                 mds = cap->mds;  /* remember mds, so we don't repeat */
> > > > > 
> > > > > -               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
> > > > > -                          retain, flushing, flush_tid, oldest_flush_tid);
> > > > > +               __prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
> > > > > +                          (flags & CHECK_CAPS_FLUSH) ?
> > > > > +                           CEPH_CLIENT_CAPS_SYNC : 0,
> > > > > +                          cap_used, want, retain, flushing, flush_tid,
> > > > > +                          oldest_flush_tid);
> > > > >                 spin_unlock(&ci->i_ceph_lock);
> > > > > 
> > > > 
> > > > this is too expensive for syncfs case. mds needs to flush journal for
> > > > each dirty inode.  we'd better to track dirty inodes by session, and
> > > > only set the flag when flushing the last inode in session dirty list.
> > 
> > I think this will be more difficult than that...
> > 
> > > Yeah, see the userspace Client::_sync_fs() where we have an internal
> > > flags argument which is set on the last cap in the dirty set and tells
> > > the actual cap message flushing code to set FLAG_SYNC on the
> > > MClientCaps message. I presume the kernel is operating on a similar
> > > principle here?
> > 
> > Not today, but we need it to.
> > 
> > The caps are not tracked on a per-session basis (as Zheng points out),
> > and the locking and ordering of these requests is not as straightforward
> > as it is in the (trivial) libcephfs case. Fixing this will be a lot more
> > invasive than I had originally hoped.
> > 
> > It's also not 100% clear to me how we'll gauge which cap will be
> > "last".  As we move the last cap on the session's list from dirty to
> > flushing state, we can mark it so that it sets the SYNC flag when it
> > goes out, but what happens if we have a process that is actively
> > dirtying other inodes during this? You might never see the per-session
> > list go empty.
> > 
> 
> It's not necessary to be strict 'last'.  just the one before exiting the loop
> 

What loop?

I added a little debugging code and ran Jan's reproducer, and it turns
out that we generally move the inode to flushing state in
ceph_put_wrbuffer_cap_refs while doing writeback under the sync syscall.
By the time we get to any sort of looping in the ceph code, the cap
flushes have already been issued.

My tentative idea was to just check whether this was the last dirty cap
associated with the session and set the flag on it if so, but we don't
really have visibility into that info, because we don't determine the
session until we move the inode from dirty->flushing.

So at this point, I'm still looking at options for fixing this. I really
don't want to just hack this in, as the technical debt in this code is
already substantial and that'll just make it worse.

Patch
diff mbox series

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 61808793e0c0..6403178f2376 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2111,8 +2111,11 @@  void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 
 		mds = cap->mds;  /* remember mds, so we don't repeat */
 
-		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want,
-			   retain, flushing, flush_tid, oldest_flush_tid);
+		__prep_cap(&arg, cap, CEPH_CAP_OP_UPDATE,
+			   (flags & CHECK_CAPS_FLUSH) ?
+			    CEPH_CLIENT_CAPS_SYNC : 0,
+			   cap_used, want, retain, flushing, flush_tid,
+			   oldest_flush_tid);
 		spin_unlock(&ci->i_ceph_lock);
 
 		__send_cap(mdsc, &arg, ci);