diff mbox

enable mds rejoin with active inodes' old parent xattrs

Message ID orvc2ytei1.fsf@livre.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alexandre Oliva Aug. 22, 2013, 6:40 a.m. UTC
When the parent xattrs of active inodes that the mds attempts to open
during rejoin lack pool info (struct_v < 5), this field will be filled
in with -1, causing the mds to retry fetching a backtrace with a pool
number that matches the expected value, which fails and causes the
err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
with pool -1, and so keeps on bouncing between the two retry cases
forever.

This patch arranges for the mds to go along with pool -1 instead of
insisting that it be refetched, enabling it to complete recovery
instead of eating cpu, network bandwidth and metadata osd's resources
like there's no tomorrow, in what AFAICT is an infinite and very busy
loop.

This is not a new problem: I've had it even before upgrading from
Cuttlefish to Dumpling, I'd just never managed to track it down, and
force-unmounting the filesystem and then restarting the mds was an
easier (if inconvenient) work-around, particularly because it always
hit when the filesystem was under active, heavy-ish use (or there
wouldn't be much reason for caps recovery ;-)


There are two issues not addressed in this patch, however.  One is
that nothing seems to proactively update the parent xattr when it is
found to be outdated, so it remains out of date forever.  Not even
renaming top-level directories causes the xattrs to be recursively
rewritten.  AFAICT that's a bug.

The other is that inodes that don't have a parent xattr (created by
even older versions of ceph) are reported as non-existing in the mds
rejoin message, because the absence of the parent xattr is signaled as
a missing inode (“failed to reconnect caps for missing inodes”).  I
suppose this may cause more serious recovery problems.

I suppose a global pass over the filesystem tree updating parent
xattrs that are out-of-date would be desirable, if we find any parent
xattrs still lacking current information; it might make sense to
activate it as a background thread from the backtrace decoding
function, when it finds a parent xattr that's too out-of-date, or as a
separate client (ceph-fsck?).

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mds/MDCache.cc |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yan, Zheng Aug. 22, 2013, 8:27 a.m. UTC | #1
On Thu, Aug 22, 2013 at 2:40 PM, Alexandre Oliva <oliva@gnu.org> wrote:
> When the parent xattrs of active inodes that the mds attempts to open
> during rejoin lack pool info (struct_v < 5), this field will be filled
> in with -1, causing the mds to retry fetching a backtrace with a pool
> number that matches the expected value, which fails and causes the
> err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
> with pool -1, and so keeps on bouncing between the two retry cases
> forever.
>
> This patch arranges for the mds to go along with pool -1 instead of
> insisting that it be refetched, enabling it to complete recovery
> instead of eating cpu, network bandwidth and metadata osd's resources
> like there's no tomorrow, in what AFAICT is an infinite and very busy
> loop.
>
> This is not a new problem: I've had it even before upgrading from
> Cuttlefish to Dumpling, I'd just never managed to track it down, and
> force-unmounting the filesystem and then restarting the mds was an
> easier (if inconvenient) work-around, particularly because it always
> hit when the filesystem was under active, heavy-ish use (or there
> wouldn't be much reason for caps recovery ;-)
>
My fault, I didn't do serious upgrade test for my code.  Thank you for nail
the issue down.

>
> There are two issues not addressed in this patch, however.  One is
> that nothing seems to proactively update the parent xattr when it is
> found to be outdated, so it remains out of date forever.  Not even
> renaming top-level directories causes the xattrs to be recursively
> rewritten.  AFAICT that's a bug.

This is not bug. Only the tail entry of the path encoded in the parent xattrs
need to be updated. (the entry for inode's parent directory)

>
> The other is that inodes that don't have a parent xattr (created by
> even older versions of ceph) are reported as non-existing in the mds
> rejoin message, because the absence of the parent xattr is signaled as
> a missing inode (“failed to reconnect caps for missing inodes”).  I
> suppose this may cause more serious recovery problems.

Cuttlefish also has this issue, it just does not print the error message
to console.

>
> I suppose a global pass over the filesystem tree updating parent
> xattrs that are out-of-date would be desirable, if we find any parent
> xattrs still lacking current information; it might make sense to
> activate it as a background thread from the backtrace decoding
> function, when it finds a parent xattr that's too out-of-date, or as a
> separate client (ceph-fsck?).
>
> Signed-off-by: Alexandre Oliva <oliva@gnu.org>
> ---
>  src/mds/MDCache.cc |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
> index e592dde..b6c37aec 100644
> --- a/src/mds/MDCache.cc
> +++ b/src/mds/MDCache.cc
> @@ -7940,7 +7940,7 @@ void MDCache::_open_ino_backtrace_fetched(inodeno_t ino, bufferlist& bl, int err
>    inode_backtrace_t backtrace;
>    if (err == 0) {
>      ::decode(backtrace, bl);
> -    if (backtrace.pool != info.pool) {
> +    if (backtrace.pool != info.pool && backtrace.pool != -1) {
>        dout(10) << " old object in pool " << info.pool
>                << ", retrying pool " << backtrace.pool << dendl;
>        info.pool = backtrace.pool;
>

Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>

Regards
Yan, Zheng

> --
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 22, 2013, 3:16 p.m. UTC | #2
Applied this to next and opened http://tracker.ceph.com/issues/6087 so we 
don't forget to backport it.

Is there any other change we want to make?  We could make the fetch code 
write the directory backpointers if they are not already present, which 
would at least clean up the dir objects (as they are touched).  :/

sage



On Thu, 22 Aug 2013, Yan, Zheng wrote:
> On Thu, Aug 22, 2013 at 2:40 PM, Alexandre Oliva <oliva@gnu.org> wrote:
> > When the parent xattrs of active inodes that the mds attempts to open
> > during rejoin lack pool info (struct_v < 5), this field will be filled
> > in with -1, causing the mds to retry fetching a backtrace with a pool
> > number that matches the expected value, which fails and causes the
> > err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
> > with pool -1, and so keeps on bouncing between the two retry cases
> > forever.
> >
> > This patch arranges for the mds to go along with pool -1 instead of
> > insisting that it be refetched, enabling it to complete recovery
> > instead of eating cpu, network bandwidth and metadata osd's resources
> > like there's no tomorrow, in what AFAICT is an infinite and very busy
> > loop.
> >
> > This is not a new problem: I've had it even before upgrading from
> > Cuttlefish to Dumpling, I'd just never managed to track it down, and
> > force-unmounting the filesystem and then restarting the mds was an
> > easier (if inconvenient) work-around, particularly because it always
> > hit when the filesystem was under active, heavy-ish use (or there
> > wouldn't be much reason for caps recovery ;-)
> >
> My fault, I didn't do serious upgrade test for my code.  Thank you for nail
> the issue down.
> 
> >
> > There are two issues not addressed in this patch, however.  One is
> > that nothing seems to proactively update the parent xattr when it is
> > found to be outdated, so it remains out of date forever.  Not even
> > renaming top-level directories causes the xattrs to be recursively
> > rewritten.  AFAICT that's a bug.
> 
> This is not bug. Only the tail entry of the path encoded in the parent xattrs
> need to be updated. (the entry for inode's parent directory)
> 
> >
> > The other is that inodes that don't have a parent xattr (created by
> > even older versions of ceph) are reported as non-existing in the mds
> > rejoin message, because the absence of the parent xattr is signaled as
> > a missing inode (?failed to reconnect caps for missing inodes?).  I
> > suppose this may cause more serious recovery problems.
> 
> Cuttlefish also has this issue, it just does not print the error message
> to console.
> 
> >
> > I suppose a global pass over the filesystem tree updating parent
> > xattrs that are out-of-date would be desirable, if we find any parent
> > xattrs still lacking current information; it might make sense to
> > activate it as a background thread from the backtrace decoding
> > function, when it finds a parent xattr that's too out-of-date, or as a
> > separate client (ceph-fsck?).
> >
> > Signed-off-by: Alexandre Oliva <oliva@gnu.org>
> > ---
> >  src/mds/MDCache.cc |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
> > index e592dde..b6c37aec 100644
> > --- a/src/mds/MDCache.cc
> > +++ b/src/mds/MDCache.cc
> > @@ -7940,7 +7940,7 @@ void MDCache::_open_ino_backtrace_fetched(inodeno_t ino, bufferlist& bl, int err
> >    inode_backtrace_t backtrace;
> >    if (err == 0) {
> >      ::decode(backtrace, bl);
> > -    if (backtrace.pool != info.pool) {
> > +    if (backtrace.pool != info.pool && backtrace.pool != -1) {
> >        dout(10) << " old object in pool " << info.pool
> >                << ", retrying pool " << backtrace.pool << dendl;
> >        info.pool = backtrace.pool;
> >
> 
> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
> 
> Regards
> Yan, Zheng
> 
> > --
> > Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> > You must be the change you wish to see in the world. -- Gandhi
> > Be Free! -- http://FSFLA.org/   FSF Latin America board member
> > Free Software Evangelist      Red Hat Brazil Compiler Engineer
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre Oliva Aug. 23, 2013, 11 a.m. UTC | #3
On Aug 22, 2013, "Yan, Zheng" <ukernel@gmail.com> wrote:

> This is not bug. Only the tail entry of the path encoded in the parent xattrs
> need to be updated. (the entry for inode's parent directory)

Why store the others, if they're not usable, then?  IMHO it just
introduces a risk of their being accidentally misused.

> Cuttlefish also has this issue, it just does not print the error message
> to console.

Speaking of Cuttlefish, I don't think my statement that I'd run into
this issue in Cuttlefish was correct: I don't see similar code to the
one I patched there, and indeed, (accidentally) starting a Cuttlefish
mds on an otherwise Dumpling cluster enabled recovery to complete where
Dumpling's mds wouldn't.  So, no backporting for this patch of mine, I
think.
Gregory Farnum Aug. 23, 2013, 8:11 p.m. UTC | #4
On Fri, Aug 23, 2013 at 4:00 AM, Alexandre Oliva <oliva@gnu.org> wrote:
> On Aug 22, 2013, "Yan, Zheng" <ukernel@gmail.com> wrote:
>
>> This is not bug. Only the tail entry of the path encoded in the parent xattrs
>> need to be updated. (the entry for inode's parent directory)
>
> Why store the others, if they're not usable, then?  IMHO it just
> introduces a risk of their being accidentally misused.

We want to write a correct path whenever we touch the file, but we
don't want to have to go out to each file's object in order to
complete a directory rename. So for a given file object we consider
only the immediate parent to be authoritative, but keep around the
full path for disaster recovery. This is why each path is versioned.
If we are touching the file's ancestor xattr and not updating the
whole thing, that would be a bug, but I don't think that's what you're
describing?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre Oliva Aug. 24, 2013, 10:38 a.m. UTC | #5
On Aug 23, 2013, Gregory Farnum <greg@inktank.com> wrote:

> On Fri, Aug 23, 2013 at 4:00 AM, Alexandre Oliva <oliva@gnu.org> wrote:
>> On Aug 22, 2013, "Yan, Zheng" <ukernel@gmail.com> wrote:
>> 
>>> This is not bug. Only the tail entry of the path encoded in the parent xattrs
>>> need to be updated. (the entry for inode's parent directory)
>> 
>> Why store the others, if they're not usable, then?  IMHO it just
>> introduces a risk of their being accidentally misused.

> We want to write a correct path whenever we touch the file, but we
> don't want to have to go out to each file's object in order to
> complete a directory rename. So for a given file object we consider
> only the immediate parent to be authoritative, but keep around the
> full path for disaster recovery.

Ok, so all this extra info is for disaster recovery only.  I guess that
makes some sense to me, as long as that's clear to everone who might
want to tap on that info.

> If we are touching the file's ancestor xattr and not updating the
> whole thing, that would be a bug, but I don't think that's what you're
> describing?

It was not, indeed.  Thanks for the clarification.
diff mbox

Patch

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index e592dde..b6c37aec 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -7940,7 +7940,7 @@  void MDCache::_open_ino_backtrace_fetched(inodeno_t ino, bufferlist& bl, int err
   inode_backtrace_t backtrace;
   if (err == 0) {
     ::decode(backtrace, bl);
-    if (backtrace.pool != info.pool) {
+    if (backtrace.pool != info.pool && backtrace.pool != -1) {
       dout(10) << " old object in pool " << info.pool
 	       << ", retrying pool " << backtrace.pool << dendl;
       info.pool = backtrace.pool;