Add old_inodes to emetablob

Message ID	alpine.DEB.2.00.1208191422110.26858@cobra.newdream.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> Date: Sun, 19 Aug 2012 14:22:24 -0700 (PDT) From: Sage Weil <sage@inktank.com> To: Alexandre Oliva <oliva@lsd.ic.unicamp.br> cc: Sage Weil <sage@newdream.net>, ceph-devel@vger.kernel.org Subject: Re: [PATCH] Add old_inodes to emetablob In-Reply-To: <alpine.DEB.2.00.1208191415240.26858@cobra.newdream.net> Message-ID: <alpine.DEB.2.00.1208191422110.26858@cobra.newdream.net> References: <orboos1qyu.fsf@livre.localdomain> <Pine.LNX.4.64.1202222159060.27988@cobra.newdream.net> <ork43bbbbb.fsf@livre.localdomain> <Pine.LNX.4.64.1203031551560.11992@cobra.newdream.net> <ora9xsbm9w.fsf@livre.localdomain> <orzk5ra8m9.fsf@livre.localdomain> <alpine.DEB.2.00.1208191415240.26858@cobra.newdream.net> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Sage Weil Aug. 19, 2012, 9:22 p.m. UTC

On Sun, 19 Aug 2012, Sage Weil wrote:
> On Sun, 19 Aug 2012, Alexandre Oliva wrote:
> > On Aug 18, 2012, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:
> > 
> > > I've looked further into this tonight, and I found out what modifies
> > > the timestamps in the inode is (re)issuing caps to a client.
> > 
> > > So, how can we stop pre_cow_old_inode from messing with the old_inode?
> > > Any suggestions?
> > 
> > This patch seems to avoid the problem, but is it correct, or is it just
> > papering over a problem elsewhere?
> 
> I think the real bug is that CInode::first is less than one of the 
> old_inodes keys.  Looking at cow_old_inode(), it looks like first should 
> always be strictly > than all old_inodes keys.  We should probably assert 
> as much wherever it is convenient to do so...
> 
> In any case, can you see where/how that is happening?  If this is a dir 
> inode that has been commited and then refetched from the parent dir, that 
> probably explains it.  For non-multiversion inodes, dn->first == 
> in->first, but not so for these guys, and at first glance I'm not seeing 
> it in the CDir::_fetched() code.

This might do the trick?



> 
> sage
> 
> > 
> > Subject: mds: Don't modify already-created old_inode
> > 
> > In cow_old_inode, do not modify an old_inode that was created before.
> > 
> > Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
> > ---
> >  src/mds/CInode.cc |   14 +++++++++-----
> >  1 files changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
> > index 53f9e69..fdff86c 100644
> > --- a/src/mds/CInode.cc
> > +++ b/src/mds/CInode.cc
> > @@ -2030,12 +2030,16 @@ old_inode_t& CInode::cow_old_inode(snapid_t follows, bool cow_head)
> >    inode_t *pi = cow_head ? get_projected_inode() : get_previous_projected_inode();
> >    map<string,bufferptr> *px = cow_head ? get_projected_xattrs() : get_previous_projected_xattrs();
> >  
> > +  bool found = old_inodes.find(follows) != old_inodes.end();
> >    old_inode_t &old = old_inodes[follows];
> > -  old.first = first;
> > -  old.inode = *pi;
> > -  old.xattrs = *px;
> > -  
> > -  dout(10) << " " << px->size() << " xattrs cowed, " << *px << dendl;
> > +
> > +  if (!found) {
> > +    old.first = first;
> > +    old.inode = *pi;
> > +    old.xattrs = *px;
> > +
> > +    dout(10) << " " << px->size() << " xattrs cowed, " << *px << dendl;
> > +  }
> >  
> >    old.inode.trim_client_ranges(follows);
> >  
> > -- 
> > 1.7.7.6
> > 
> > 
> > 
> > -- 
> > Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> > You must be the change you wish to see in the world. -- Gandhi
> > Be Free! -- http://FSFLA.org/   FSF Latin America board member
> > Free Software Evangelist      Red Hat Brazil Compiler Engineer
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryan Nicholson Aug. 23, 2012, 6:51 p.m. UTC | #1

All:

I have a 16-OSD cluster running 0.48 (Argonaut), built from source.

I rebuilt the entire cluster on Sunday Evening 8-19-2012, and started some rados testing.

I have a custom CRUSH map, that calls for the "rbd", "metadata" pools and a custom pool called "SCSI" to be pulled from osd.0-11, while the "data" pool is pulled from osd.12-15. While testing, I find that the cluster is putting data where I want it to, with one exception: the SCSI pool is not storing data evenly thoughout the osd.0-11. Through "df", I find that about every other osd is seeing space utilization.

So, whether good or bad, I did a "ceph osd reweight-by-utilization", which did improve the situation.

Now, after doing some more research in the mailing lists, I find that I should have just let the cluster figure it out on its own.

All that to lead to the problem I'm having now, and, I wish to use this mistake as a learning tool. My ceph status is this:

ceph -s
##
health HEALTH_WARN 377 pgs stale; 4 pgs stuck inactive; 377 pgs stuck stale; 948 pgs stuck unclean
monmap e1: 3 mons at {a=10.9.181.10:6789/0,b=10.9.181.11:6789/0,c=10.9.181.12:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
osdmap e90: 16 osds: 16 up, 16 in
pgmap v5085: 3080 pgs: 4 creating, 1755 active+clean, 377 stale+active+clean, 944 active+remapped; 10175 MB data, 52057 MB used, 12244 GB / 12815 GB avail
mdsmap e16: 1/1/1 up {0=b=up:replay}, 2 up:standby
##

Side-affects: I can create and map any Rados pools. I cannot for the life of me, write to them, format them, anything them. Making my entire cluster offline to clients.

While I've parse and poured over the documentation, I really need experienced help, just to know how to get Ceph to recover, and then allow for operation again.

I've restarted each daemon individually several times, after which I've also tried a complete stop and start of the cluster. After things settle, this reveals the same ceph -s status as I've posted above.

Thanks for your time!

Ryan Nicholson

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Gregory Farnum Aug. 23, 2012, 7:41 p.m. UTC | #2

On Thu, Aug 23, 2012 at 2:51 PM, Ryan Nicholson <Ryan.Nicholson@kcrg.com> wrote:
> All:
>
> I have a 16-OSD cluster running 0.48 (Argonaut), built from source.
>
> I rebuilt the entire cluster on Sunday Evening 8-19-2012, and started some rados testing.
>
> I have a custom CRUSH map, that calls for the "rbd", "metadata" pools and a custom pool called "SCSI" to be pulled from osd.0-11, while the "data" pool is pulled from osd.12-15. While testing, I find that the cluster is putting data where I want it to, with one exception: the SCSI pool is not storing data evenly thoughout the osd.0-11. Through "df", I find that about every other osd is seeing space utilization.
>
> So, whether good or bad, I did a "ceph osd reweight-by-utilization", which did improve the situation.
>
> Now, after doing some more research in the mailing lists, I find that I should have just let the cluster figure it out on its own.
>
> All that to lead to the problem I'm having now, and, I wish to use this mistake as a learning tool. My ceph status is this:
>
> ceph -s
> ##
>    health HEALTH_WARN 377 pgs stale; 4 pgs stuck inactive; 377 pgs stuck stale; 948 pgs stuck unclean
>    monmap e1: 3 mons at {a=10.9.181.10:6789/0,b=10.9.181.11:6789/0,c=10.9.181.12:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
>    osdmap e90: 16 osds: 16 up, 16 in
>     pgmap v5085: 3080 pgs: 4 creating, 1755 active+clean, 377 stale+active+clean, 944 active+remapped; 10175 MB data, 52057 MB used, 12244 GB / 12815 GB avail
>    mdsmap e16: 1/1/1 up {0=b=up:replay}, 2 up:standby
> ##
>
> Side-affects: I can create and map any Rados pools. I cannot for the life of me, write to them, format them, anything them. Making my entire cluster offline to clients.
>
> While I've parse and poured over the documentation, I really need experienced help, just to know how to get Ceph to recover, and then allow for operation again.
>
> I've restarted each daemon individually several times, after which I've also tried a complete stop and start of the cluster. After things settle, this reveals the same ceph -s status as I've posted above.
>
> Thanks for your time!

You'll want to start by running "ceph pg dump" and trying to find
patterns in the PGs that are stale. If you put it up on pastebin or
something I'm sure somebody will be happy to check it out too. PGs
that are remapped are a problem with your CRUSH map — can you also
post it?
And just for good measure we might as well see the output of "ceph osd
dump" as well.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryan Nicholson Aug. 23, 2012, 9:38 p.m. UTC | #3

Thanks, Greg!  Here's the link to all of the dumps you asked for: http://pastebin.com/y4bPwSz8

Let me know what you think!

Ryan Nicholson

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Thursday, August 23, 2012 2:41 PM
To: Ryan Nicholson
Cc: ceph-devel@vger.kernel.org
Subject: Re: PG's

On Thu, Aug 23, 2012 at 2:51 PM, Ryan Nicholson <Ryan.Nicholson@kcrg.com> wrote:
> All:
>
> I have a 16-OSD cluster running 0.48 (Argonaut), built from source.
>
> I rebuilt the entire cluster on Sunday Evening 8-19-2012, and started some rados testing.
>
> I have a custom CRUSH map, that calls for the "rbd", "metadata" pools and a custom pool called "SCSI" to be pulled from osd.0-11, while the "data" pool is pulled from osd.12-15. While testing, I find that the cluster is putting data where I want it to, with one exception: the SCSI pool is not storing data evenly thoughout the osd.0-11. Through "df", I find that about every other osd is seeing space utilization.
>
> So, whether good or bad, I did a "ceph osd reweight-by-utilization", which did improve the situation.
>
> Now, after doing some more research in the mailing lists, I find that I should have just let the cluster figure it out on its own.
>
> All that to lead to the problem I'm having now, and, I wish to use this mistake as a learning tool. My ceph status is this:
>
> ceph -s
> ##
>    health HEALTH_WARN 377 pgs stale; 4 pgs stuck inactive; 377 pgs stuck stale; 948 pgs stuck unclean
>    monmap e1: 3 mons at {a=10.9.181.10:6789/0,b=10.9.181.11:6789/0,c=10.9.181.12:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
>    osdmap e90: 16 osds: 16 up, 16 in
>     pgmap v5085: 3080 pgs: 4 creating, 1755 active+clean, 377 stale+active+clean, 944 active+remapped; 10175 MB data, 52057 MB used, 12244 GB / 12815 GB avail
>    mdsmap e16: 1/1/1 up {0=b=up:replay}, 2 up:standby ##
>
> Side-affects: I can create and map any Rados pools. I cannot for the life of me, write to them, format them, anything them. Making my entire cluster offline to clients.
>
> While I've parse and poured over the documentation, I really need experienced help, just to know how to get Ceph to recover, and then allow for operation again.
>
> I've restarted each daemon individually several times, after which I've also tried a complete stop and start of the cluster. After things settle, this reveals the same ceph -s status as I've posted above.
>
> Thanks for your time!

You'll want to start by running "ceph pg dump" and trying to find patterns in the PGs that are stale. If you put it up on pastebin or something I'm sure somebody will be happy to check it out too. PGs that are remapped are a problem with your CRUSH map - can you also post it?
And just for good measure we might as well see the output of "ceph osd dump" as well.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryan Nicholson Aug. 24, 2012, 2:07 a.m. UTC | #4

http://pastebin.com/5bRiUTxf

Greg I've also attached a "ceph osd tree" dump (above). From what I can tell, the tree is correct, and lines up with how I desire to weight the cluster(s), however, I do see that the reweight for the smaller osds (SCSI-Nodes) are less than 1. Perhaps I need to look at this?

Thanks,

Ryan Nicholson
(ryann)

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Thursday, August 23, 2012 2:41 PM
To: Ryan Nicholson
Cc: ceph-devel@vger.kernel.org
Subject: Re: PG's

On Thu, Aug 23, 2012 at 2:51 PM, Ryan Nicholson <Ryan.Nicholson@kcrg.com> wrote:
> All:
>
> I have a 16-OSD cluster running 0.48 (Argonaut), built from source.
>
> I rebuilt the entire cluster on Sunday Evening 8-19-2012, and started some rados testing.
>
> I have a custom CRUSH map, that calls for the "rbd", "metadata" pools and a custom pool called "SCSI" to be pulled from osd.0-11, while the "data" pool is pulled from osd.12-15. While testing, I find that the cluster is putting data where I want it to, with one exception: the SCSI pool is not storing data evenly thoughout the osd.0-11. Through "df", I find that about every other osd is seeing space utilization.
>
> So, whether good or bad, I did a "ceph osd reweight-by-utilization", which did improve the situation.
>
> Now, after doing some more research in the mailing lists, I find that I should have just let the cluster figure it out on its own.
>
> All that to lead to the problem I'm having now, and, I wish to use this mistake as a learning tool. My ceph status is this:
>
> ceph -s
> ##
>    health HEALTH_WARN 377 pgs stale; 4 pgs stuck inactive; 377 pgs stuck stale; 948 pgs stuck unclean
>    monmap e1: 3 mons at {a=10.9.181.10:6789/0,b=10.9.181.11:6789/0,c=10.9.181.12:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
>    osdmap e90: 16 osds: 16 up, 16 in
>     pgmap v5085: 3080 pgs: 4 creating, 1755 active+clean, 377 stale+active+clean, 944 active+remapped; 10175 MB data, 52057 MB used, 12244 GB / 12815 GB avail
>    mdsmap e16: 1/1/1 up {0=b=up:replay}, 2 up:standby ##
>
> Side-affects: I can create and map any Rados pools. I cannot for the life of me, write to them, format them, anything them. Making my entire cluster offline to clients.
>
> While I've parse and poured over the documentation, I really need experienced help, just to know how to get Ceph to recover, and then allow for operation again.
>
> I've restarted each daemon individually several times, after which I've also tried a complete stop and start of the cluster. After things settle, this reveals the same ceph -s status as I've posted above.
>
> Thanks for your time!

You'll want to start by running "ceph pg dump" and trying to find patterns in the PGs that are stale. If you put it up on pastebin or something I'm sure somebody will be happy to check it out too. PGs that are remapped are a problem with your CRUSH map - can you also post it?
And just for good measure we might as well see the output of "ceph osd dump" as well.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Farnum Aug. 24, 2012, 2:25 a.m. UTC | #5

Wow, those are quite a dynamic range of numbers — I'm not sure you can
count on the cluster behaving well with that much happening in the
overrides. If you actually have OSDs with varying capacity, you should
give them different CRUSH weights (using "ceph osd crush set ...")
rather than using the monitor's reweight functionality.

I confess I don't see exactly why so many PGs are stale, but fixing
the weights certainly won't hurt.
-Greg

On Thu, Aug 23, 2012 at 7:07 PM, Ryan Nicholson <Ryan.Nicholson@kcrg.com> wrote:
> http://pastebin.com/5bRiUTxf
>
> Greg I've also attached a "ceph osd tree" dump (above). From what I can tell, the tree is correct, and lines up with how I desire to weight the cluster(s), however, I do see that the reweight for the smaller osds (SCSI-Nodes) are less than 1. Perhaps I need to look at this?
>
> Thanks,
>
> Ryan Nicholson
> (ryann)
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Gregory Farnum
> Sent: Thursday, August 23, 2012 2:41 PM
> To: Ryan Nicholson
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: PG's
>
> On Thu, Aug 23, 2012 at 2:51 PM, Ryan Nicholson <Ryan.Nicholson@kcrg.com> wrote:
>> All:
>>
>> I have a 16-OSD cluster running 0.48 (Argonaut), built from source.
>>
>> I rebuilt the entire cluster on Sunday Evening 8-19-2012, and started some rados testing.
>>
>> I have a custom CRUSH map, that calls for the "rbd", "metadata" pools and a custom pool called "SCSI" to be pulled from osd.0-11, while the "data" pool is pulled from osd.12-15. While testing, I find that the cluster is putting data where I want it to, with one exception: the SCSI pool is not storing data evenly thoughout the osd.0-11. Through "df", I find that about every other osd is seeing space utilization.
>>
>> So, whether good or bad, I did a "ceph osd reweight-by-utilization", which did improve the situation.
>>
>> Now, after doing some more research in the mailing lists, I find that I should have just let the cluster figure it out on its own.
>>
>> All that to lead to the problem I'm having now, and, I wish to use this mistake as a learning tool. My ceph status is this:
>>
>> ceph -s
>> ##
>>    health HEALTH_WARN 377 pgs stale; 4 pgs stuck inactive; 377 pgs stuck stale; 948 pgs stuck unclean
>>    monmap e1: 3 mons at {a=10.9.181.10:6789/0,b=10.9.181.11:6789/0,c=10.9.181.12:6789/0}, election epoch 2, quorum 0,1,2 a,b,c
>>    osdmap e90: 16 osds: 16 up, 16 in
>>     pgmap v5085: 3080 pgs: 4 creating, 1755 active+clean, 377 stale+active+clean, 944 active+remapped; 10175 MB data, 52057 MB used, 12244 GB / 12815 GB avail
>>    mdsmap e16: 1/1/1 up {0=b=up:replay}, 2 up:standby ##
>>
>> Side-affects: I can create and map any Rados pools. I cannot for the life of me, write to them, format them, anything them. Making my entire cluster offline to clients.
>>
>> While I've parse and poured over the documentation, I really need experienced help, just to know how to get Ceph to recover, and then allow for operation again.
>>
>> I've restarted each daemon individually several times, after which I've also tried a complete stop and start of the cluster. After things settle, this reveals the same ceph -s status as I've posted above.
>>
>> Thanks for your time!
>
> You'll want to start by running "ceph pg dump" and trying to find patterns in the PGs that are stale. If you put it up on pastebin or something I'm sure somebody will be happy to check it out too. PGs that are remapped are a problem with your CRUSH map - can you also post it?
> And just for good measure we might as well see the output of "ceph osd dump" as well.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Add old_inodes to emetablob

Commit Message

Comments

Patch