diff mbox

Luminous RC feedback - device classes and osd df weirdness

Message ID 71ce32a8-6232-0f5b-35a5-d86f30a045db@catalyst.net.nz (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Kirkwood June 29, 2017, 5:04 a.m. UTC
Hi,

I'm running a 4 node test 'cluster' (VMs on my workstation) that I've 
upgraded to Luminous RC. Specifically I wanted to test having each node 
with 1 spinning device and one solid state state so I could try out 
device classes to create fast and slow(er) pools.

I started with 4 filestore osds (comimg from the Jewel pre-upgrade), and 
added 4 more, all of which were Bluestore on the ssds.

I used crushtool to set the device classes (see crush test diff below).

That all went very smoothly, with only a couple of things that seemed 
weird. Firstly the crush/osd tree output is a bit strange (but I could 
get to the point where it make sense):

$ sudo ceph osd tree
ID  WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
-15 0.23196 root default~ssd
-11 0.05699     host ceph1~ssd
   4 0.05699         osd.4           up  1.00000 1.00000
-12 0.05899     host ceph2~ssd
   5 0.05899         osd.5           up  1.00000 1.00000
-13 0.05699     host ceph3~ssd
   6 0.05699         osd.6           up  1.00000 1.00000
-14 0.05899     host ceph4~ssd
   7 0.05899         osd.7           up  1.00000 1.00000
-10 0.07996 root default~hdd
  -6 0.01999     host ceph1~hdd
   0 0.01999         osd.0           up  1.00000 1.00000
  -7 0.01999     host ceph2~hdd
   1 0.01999         osd.1           up  1.00000 1.00000
  -8 0.01999     host ceph3~hdd
   2 0.01999         osd.2           up  1.00000 1.00000
  -9 0.01999     host ceph4~hdd
   3 0.01999         osd.3           up  1.00000 1.00000
  -1 0.31198 root default
  -2 0.07700     host ceph1
   0 0.01999         osd.0           up  1.00000 1.00000
   4 0.05699         osd.4           up  1.00000 1.00000
  -3 0.07899     host ceph2
   1 0.01999         osd.1           up  1.00000 1.00000
   5 0.05899         osd.5           up  1.00000 1.00000
  -4 0.07700     host ceph3
   2 0.01999         osd.2           up  1.00000 1.00000
   6 0.05699         osd.6           up  1.00000 1.00000
  -5 0.07899     host ceph4
   3 0.01999         osd.3           up  1.00000 1.00000
   7 0.05899         osd.7           up  1.00000 1.00000


But the osd df output is baffling, I've got two identical lines for each 
osd (hard to see immediately - sorting by osd id would make it easier). 
This is not ideal, particularly as for the bluestore guys there is no 
other way to work out utilization. Any ideas - have I done something 
obviously wrong here that is triggering the 2 lines?

$ sudo ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS
  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
               TOTAL   338G  4955M   333G 1.43
MIN/MAX VAR: 0.11/1.45  STDDEV: 0.97


The modifications to crush map

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sage Weil June 29, 2017, 1:27 p.m. UTC | #1
On Thu, 29 Jun 2017, Mark Kirkwood wrote:
> Hi,
> 
> I'm running a 4 node test 'cluster' (VMs on my workstation) that I've upgraded
> to Luminous RC. Specifically I wanted to test having each node with 1 spinning
> device and one solid state state so I could try out device classes to create
> fast and slow(er) pools.
> 
> I started with 4 filestore osds (comimg from the Jewel pre-upgrade), and added
> 4 more, all of which were Bluestore on the ssds.
> 
> I used crushtool to set the device classes (see crush test diff below).
> 
> That all went very smoothly, with only a couple of things that seemed weird.
> Firstly the crush/osd tree output is a bit strange (but I could get to the
> point where it make sense):
> 
> $ sudo ceph osd tree
> ID  WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -15 0.23196 root default~ssd
> -11 0.05699     host ceph1~ssd
>   4 0.05699         osd.4           up  1.00000 1.00000
> -12 0.05899     host ceph2~ssd
>   5 0.05899         osd.5           up  1.00000 1.00000
> -13 0.05699     host ceph3~ssd
>   6 0.05699         osd.6           up  1.00000 1.00000
> -14 0.05899     host ceph4~ssd
>   7 0.05899         osd.7           up  1.00000 1.00000
> -10 0.07996 root default~hdd
>  -6 0.01999     host ceph1~hdd
>   0 0.01999         osd.0           up  1.00000 1.00000
>  -7 0.01999     host ceph2~hdd
>   1 0.01999         osd.1           up  1.00000 1.00000
>  -8 0.01999     host ceph3~hdd
>   2 0.01999         osd.2           up  1.00000 1.00000
>  -9 0.01999     host ceph4~hdd
>   3 0.01999         osd.3           up  1.00000 1.00000
>  -1 0.31198 root default
>  -2 0.07700     host ceph1
>   0 0.01999         osd.0           up  1.00000 1.00000
>   4 0.05699         osd.4           up  1.00000 1.00000
>  -3 0.07899     host ceph2
>   1 0.01999         osd.1           up  1.00000 1.00000
>   5 0.05899         osd.5           up  1.00000 1.00000
>  -4 0.07700     host ceph3
>   2 0.01999         osd.2           up  1.00000 1.00000
>   6 0.05699         osd.6           up  1.00000 1.00000
>  -5 0.07899     host ceph4
>   3 0.01999         osd.3           up  1.00000 1.00000
>   7 0.05899         osd.7           up  1.00000 1.00000

I was a bit divided when we were doing this about whether the unfiltered 
(above) output or a view that hides the per-class view is better.  Maybe

 ceph osd tree

would show the traditional view (with a device class column) and

 ceph osd class-tree <class>

would show a single class?

> But the osd df output is baffling, I've got two identical lines for each osd
> (hard to see immediately - sorting by osd id would make it easier). This is
> not ideal, particularly as for the bluestore guys there is no other way to
> work out utilization. Any ideas - have I done something obviously wrong here
> that is triggering the 2 lines?
> 
> $ sudo ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS
>  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
>  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
>  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
>  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
>  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
>  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
>  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
>  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
>  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
>  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
>  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
>  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
>  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
>  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
>  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
>  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
>               TOTAL   338G  4955M   333G 1.43
> MIN/MAX VAR: 0.11/1.45  STDDEV: 0.97

This is just a bug, fixing.

Thanks!
sage



> 
> 
> The modifications to crush map
> --- crush.txt.orig    2017-06-28 14:38:38.067669000 +1200
> +++ crush.txt    2017-06-28 14:41:22.071669000 +1200
> @@ -8,14 +8,14 @@
>  tunable allowed_bucket_algs 54
> 
>  # devices
> -device 0 osd.0
> -device 1 osd.1
> -device 2 osd.2
> -device 3 osd.3
> -device 4 osd.4
> -device 5 osd.5
> -device 6 osd.6
> -device 7 osd.7
> +device 0 osd.0 class hdd
> +device 1 osd.1 class hdd
> +device 2 osd.2 class hdd
> +device 3 osd.3 class hdd
> +device 4 osd.4 class ssd
> +device 5 osd.5 class ssd
> +device 6 osd.6 class ssd
> +device 7 osd.7 class ssd
> 
>  # types
>  type 0 osd
> @@ -80,7 +80,7 @@
>      type replicated
>      min_size 1
>      max_size 10
> -    step take default
> +    step take default class hdd
>      step chooseleaf firstn 0 type host
>      step emit
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil June 29, 2017, 2:04 p.m. UTC | #2
On Thu, 29 Jun 2017, Sage Weil wrote:
> On Thu, 29 Jun 2017, Mark Kirkwood wrote:
> > Hi,
> > 
> > I'm running a 4 node test 'cluster' (VMs on my workstation) that I've upgraded
> > to Luminous RC. Specifically I wanted to test having each node with 1 spinning
> > device and one solid state state so I could try out device classes to create
> > fast and slow(er) pools.
> > 
> > I started with 4 filestore osds (comimg from the Jewel pre-upgrade), and added
> > 4 more, all of which were Bluestore on the ssds.
> > 
> > I used crushtool to set the device classes (see crush test diff below).
> > 
> > That all went very smoothly, with only a couple of things that seemed weird.
> > Firstly the crush/osd tree output is a bit strange (but I could get to the
> > point where it make sense):
> > 
> > $ sudo ceph osd tree
> > ID  WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -15 0.23196 root default~ssd
> > -11 0.05699     host ceph1~ssd
> >   4 0.05699         osd.4           up  1.00000 1.00000
> > -12 0.05899     host ceph2~ssd
> >   5 0.05899         osd.5           up  1.00000 1.00000
> > -13 0.05699     host ceph3~ssd
> >   6 0.05699         osd.6           up  1.00000 1.00000
> > -14 0.05899     host ceph4~ssd
> >   7 0.05899         osd.7           up  1.00000 1.00000
> > -10 0.07996 root default~hdd
> >  -6 0.01999     host ceph1~hdd
> >   0 0.01999         osd.0           up  1.00000 1.00000
> >  -7 0.01999     host ceph2~hdd
> >   1 0.01999         osd.1           up  1.00000 1.00000
> >  -8 0.01999     host ceph3~hdd
> >   2 0.01999         osd.2           up  1.00000 1.00000
> >  -9 0.01999     host ceph4~hdd
> >   3 0.01999         osd.3           up  1.00000 1.00000
> >  -1 0.31198 root default
> >  -2 0.07700     host ceph1
> >   0 0.01999         osd.0           up  1.00000 1.00000
> >   4 0.05699         osd.4           up  1.00000 1.00000
> >  -3 0.07899     host ceph2
> >   1 0.01999         osd.1           up  1.00000 1.00000
> >   5 0.05899         osd.5           up  1.00000 1.00000
> >  -4 0.07700     host ceph3
> >   2 0.01999         osd.2           up  1.00000 1.00000
> >   6 0.05699         osd.6           up  1.00000 1.00000
> >  -5 0.07899     host ceph4
> >   3 0.01999         osd.3           up  1.00000 1.00000
> >   7 0.05899         osd.7           up  1.00000 1.00000
> 
> I was a bit divided when we were doing this about whether the unfiltered 
> (above) output or a view that hides the per-class view is better.  Maybe
> 
>  ceph osd tree
> 
> would show the traditional view (with a device class column) and
> 
>  ceph osd class-tree <class>
> 
> would show a single class?

For now, see 

	https://github.com/ceph/ceph/pull/16016

- Do not show each per-class variation of the hieararchy
- Do include a CLASS column in the tree view

This is still somewhat incomplete in that the MIN/MAX/VAR values are 
overall and not per-class, which makes them less useful if you are 
actually using the classes.

sage


> 
> > But the osd df output is baffling, I've got two identical lines for each osd
> > (hard to see immediately - sorting by osd id would make it easier). This is
> > not ideal, particularly as for the bluestore guys there is no other way to
> > work out utilization. Any ideas - have I done something obviously wrong here
> > that is triggering the 2 lines?
> > 
> > $ sudo ceph osd df
> > ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS
> >  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
> >  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
> >  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
> >  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
> >  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
> >  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
> >  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
> >  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
> >  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
> >  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
> >  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
> >  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
> >  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
> >  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
> >  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
> >  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
> >               TOTAL   338G  4955M   333G 1.43
> > MIN/MAX VAR: 0.11/1.45  STDDEV: 0.97
> 
> This is just a bug, fixing.
> 
> Thanks!
> sage
> 
> 
> 
> > 
> > 
> > The modifications to crush map
> > --- crush.txt.orig    2017-06-28 14:38:38.067669000 +1200
> > +++ crush.txt    2017-06-28 14:41:22.071669000 +1200
> > @@ -8,14 +8,14 @@
> >  tunable allowed_bucket_algs 54
> > 
> >  # devices
> > -device 0 osd.0
> > -device 1 osd.1
> > -device 2 osd.2
> > -device 3 osd.3
> > -device 4 osd.4
> > -device 5 osd.5
> > -device 6 osd.6
> > -device 7 osd.7
> > +device 0 osd.0 class hdd
> > +device 1 osd.1 class hdd
> > +device 2 osd.2 class hdd
> > +device 3 osd.3 class hdd
> > +device 4 osd.4 class ssd
> > +device 5 osd.5 class ssd
> > +device 6 osd.6 class ssd
> > +device 7 osd.7 class ssd
> > 
> >  # types
> >  type 0 osd
> > @@ -80,7 +80,7 @@
> >      type replicated
> >      min_size 1
> >      max_size 10
> > -    step take default
> > +    step take default class hdd
> >      step chooseleaf firstn 0 type host
> >      step emit
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood July 19, 2017, 6:57 a.m. UTC | #3
On 29/06/17 17:04, Mark Kirkwood wrote:

>
>
> That all went very smoothly, with only a couple of things that seemed 
> weird. Firstly the crush/osd tree output is a bit strange (but I could 
> get to the point where it make sense):
>
> $ sudo ceph osd tree
> ID  WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -15 0.23196 root default~ssd
> -11 0.05699     host ceph1~ssd
>   4 0.05699         osd.4           up  1.00000 1.00000
> -12 0.05899     host ceph2~ssd
>   5 0.05899         osd.5           up  1.00000 1.00000
> -13 0.05699     host ceph3~ssd
>   6 0.05699         osd.6           up  1.00000 1.00000
> -14 0.05899     host ceph4~ssd
>   7 0.05899         osd.7           up  1.00000 1.00000
> -10 0.07996 root default~hdd
>  -6 0.01999     host ceph1~hdd
>   0 0.01999         osd.0           up  1.00000 1.00000
>  -7 0.01999     host ceph2~hdd
>   1 0.01999         osd.1           up  1.00000 1.00000
>  -8 0.01999     host ceph3~hdd
>   2 0.01999         osd.2           up  1.00000 1.00000
>  -9 0.01999     host ceph4~hdd
>   3 0.01999         osd.3           up  1.00000 1.00000
>  -1 0.31198 root default
>  -2 0.07700     host ceph1
>   0 0.01999         osd.0           up  1.00000 1.00000
>   4 0.05699         osd.4           up  1.00000 1.00000
>  -3 0.07899     host ceph2
>   1 0.01999         osd.1           up  1.00000 1.00000
>   5 0.05899         osd.5           up  1.00000 1.00000
>  -4 0.07700     host ceph3
>   2 0.01999         osd.2           up  1.00000 1.00000
>   6 0.05699         osd.6           up  1.00000 1.00000
>  -5 0.07899     host ceph4
>   3 0.01999         osd.3           up  1.00000 1.00000
>   7 0.05899         osd.7           up  1.00000 1.00000
>
>
> But the osd df output is baffling, I've got two identical lines for 
> each osd (hard to see immediately - sorting by osd id would make it 
> easier). This is not ideal, particularly as for the bluestore guys 
> there is no other way to work out utilization. Any ideas - have I done 
> something obviously wrong here that is triggering the 2 lines?
>
> $ sudo ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS
>  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
>  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
>  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
>  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
>  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
>  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
>  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
>  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
>  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
>  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
>  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
>  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
>  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
>  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
>  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
>  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
>               TOTAL   338G  4955M   333G 1.43
> MIN/MAX VAR: 0.11/1.45  STDDEV: 0.97

Revisiting these points after reverting to Jewel again and freshly 
upgrading to 12.1.1:

$ sudo ceph osd tree
ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
-1       0.32996 root default
-2       0.08199     host ceph1
  0   hdd 0.02399         osd.0       up  1.00000 1.00000
  4   ssd 0.05699         osd.4       up  1.00000 1.00000
-3       0.08299     host ceph2
  1   hdd 0.02399         osd.1       up  1.00000 1.00000
  5   ssd 0.05899         osd.5       up  1.00000 1.00000
-4       0.08199     host ceph3
  2   hdd 0.02399         osd.2       up  1.00000 1.00000
  6   ssd 0.05699         osd.6       up  1.00000 1.00000
-5       0.08299     host ceph4
  3   hdd 0.02399         osd.3       up  1.00000 1.00000
  7   ssd 0.05899         osd.7       up  1.00000 1.00000

This looks much more friendly!

$ sudo ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR PGS
  0   hdd 0.02399  1.00000 25586M 89848k 25498M  0.34 0.03 109
  4   ssd 0.05699  1.00000 60314M 10096M 50218M 16.74 1.34 60
  1   hdd 0.02399  1.00000 25586M 93532k 25495M  0.36 0.03 103
  5   ssd 0.05899  1.00000 61586M  9987M 51598M 16.22 1.30 59
  2   hdd 0.02399  1.00000 25586M 88120k 25500M  0.34 0.03 111
  6   ssd 0.05699  1.00000 60314M 12403M 47911M 20.56 1.64 75
  3   hdd 0.02399  1.00000 25586M 94688k 25494M  0.36 0.03 125
  7   ssd 0.05899  1.00000 61586M 10435M 51151M 16.94 1.36 62
                     TOTAL   338G 43280M   295G 12.50
MIN/MAX VAR: 0.03/1.64  STDDEV: 9.40

...and this is vastly better too. Bit of a toss up whether ordering by 
host (which is what it seems to be happening here) or ordering by osd id 
is better, but clearly there are bound to be differing POV on this - I'm 
happy with the current choice.

One (I think) new thing compared to the 12.1.0 is that restarting the 
services blitzes the modified crushmap, and we get back to:

$ sudo ceph osd tree
ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
-1       0.32996 root default
-2       0.08199     host ceph1
  0   hdd 0.02399         osd.0       up  1.00000 1.00000
  4   hdd 0.05699         osd.4       up  1.00000 1.00000
-3       0.08299     host ceph2
  1   hdd 0.02399         osd.1       up  1.00000 1.00000
  5   hdd 0.05899         osd.5       up  1.00000 1.00000
-4       0.08199     host ceph3
  2   hdd 0.02399         osd.2       up  1.00000 1.00000
  6   hdd 0.05699         osd.6       up  1.00000 1.00000
-5       0.08299     host ceph4
  3   hdd 0.02399         osd.3       up  1.00000 1.00000
  7   hdd 0.05899         osd.7       up  1.00000 1.00000

...and all the PG are remapped again. Now I might have just missed this 
happening with 12.1.0 - but I'm (moderately) confident that I did 
restart stuff and not see this happening. For now I've added:

osd crush update on start = false

to my ceph.conf to avoid being caught by this.

regards

Mark


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil July 19, 2017, 2:53 p.m. UTC | #4
On Wed, 19 Jul 2017, Mark Kirkwood wrote:
> On 29/06/17 17:04, Mark Kirkwood wrote:
> 
> > 
> > 
> > That all went very smoothly, with only a couple of things that seemed weird.
> > Firstly the crush/osd tree output is a bit strange (but I could get to the
> > point where it make sense):
> > 
> > $ sudo ceph osd tree
> > ID  WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -15 0.23196 root default~ssd
> > -11 0.05699     host ceph1~ssd
> >   4 0.05699         osd.4           up  1.00000 1.00000
> > -12 0.05899     host ceph2~ssd
> >   5 0.05899         osd.5           up  1.00000 1.00000
> > -13 0.05699     host ceph3~ssd
> >   6 0.05699         osd.6           up  1.00000 1.00000
> > -14 0.05899     host ceph4~ssd
> >   7 0.05899         osd.7           up  1.00000 1.00000
> > -10 0.07996 root default~hdd
> >  -6 0.01999     host ceph1~hdd
> >   0 0.01999         osd.0           up  1.00000 1.00000
> >  -7 0.01999     host ceph2~hdd
> >   1 0.01999         osd.1           up  1.00000 1.00000
> >  -8 0.01999     host ceph3~hdd
> >   2 0.01999         osd.2           up  1.00000 1.00000
> >  -9 0.01999     host ceph4~hdd
> >   3 0.01999         osd.3           up  1.00000 1.00000
> >  -1 0.31198 root default
> >  -2 0.07700     host ceph1
> >   0 0.01999         osd.0           up  1.00000 1.00000
> >   4 0.05699         osd.4           up  1.00000 1.00000
> >  -3 0.07899     host ceph2
> >   1 0.01999         osd.1           up  1.00000 1.00000
> >   5 0.05899         osd.5           up  1.00000 1.00000
> >  -4 0.07700     host ceph3
> >   2 0.01999         osd.2           up  1.00000 1.00000
> >   6 0.05699         osd.6           up  1.00000 1.00000
> >  -5 0.07899     host ceph4
> >   3 0.01999         osd.3           up  1.00000 1.00000
> >   7 0.05899         osd.7           up  1.00000 1.00000
> > 
> > 
> > But the osd df output is baffling, I've got two identical lines for each osd
> > (hard to see immediately - sorting by osd id would make it easier). This is
> > not ideal, particularly as for the bluestore guys there is no other way to
> > work out utilization. Any ideas - have I done something obviously wrong here
> > that is triggering the 2 lines?
> > 
> > $ sudo ceph osd df
> > ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS
> >  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
> >  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
> >  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
> >  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
> >  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
> >  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
> >  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
> >  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
> >  0 0.01999  1.00000 25586M 43812k 25543M 0.17 0.12  45
> >  4 0.05699  1.00000 60314M  1093M 59221M 1.81 1.27   0
> >  1 0.01999  1.00000 25586M 42636k 25544M 0.16 0.11  37
> >  5 0.05899  1.00000 61586M  1234M 60351M 2.00 1.40   0
> >  2 0.01999  1.00000 25586M 44336k 25543M 0.17 0.12  53
> >  6 0.05699  1.00000 60314M  1248M 59066M 2.07 1.45   0
> >  3 0.01999  1.00000 25586M 42716k 25544M 0.16 0.11  57
> >  7 0.05899  1.00000 61586M  1209M 60376M 1.96 1.37   0
> >               TOTAL   338G  4955M   333G 1.43
> > MIN/MAX VAR: 0.11/1.45  STDDEV: 0.97
> 
> Revisiting these points after reverting to Jewel again and freshly upgrading
> to 12.1.1:
> 
> $ sudo ceph osd tree
> ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
> -1       0.32996 root default
> -2       0.08199     host ceph1
>  0   hdd 0.02399         osd.0       up  1.00000 1.00000
>  4   ssd 0.05699         osd.4       up  1.00000 1.00000
> -3       0.08299     host ceph2
>  1   hdd 0.02399         osd.1       up  1.00000 1.00000
>  5   ssd 0.05899         osd.5       up  1.00000 1.00000
> -4       0.08199     host ceph3
>  2   hdd 0.02399         osd.2       up  1.00000 1.00000
>  6   ssd 0.05699         osd.6       up  1.00000 1.00000
> -5       0.08299     host ceph4
>  3   hdd 0.02399         osd.3       up  1.00000 1.00000
>  7   ssd 0.05899         osd.7       up  1.00000 1.00000
> 
> This looks much more friendly!
> 
> $ sudo ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR PGS
>  0   hdd 0.02399  1.00000 25586M 89848k 25498M  0.34 0.03 109
>  4   ssd 0.05699  1.00000 60314M 10096M 50218M 16.74 1.34 60
>  1   hdd 0.02399  1.00000 25586M 93532k 25495M  0.36 0.03 103
>  5   ssd 0.05899  1.00000 61586M  9987M 51598M 16.22 1.30 59
>  2   hdd 0.02399  1.00000 25586M 88120k 25500M  0.34 0.03 111
>  6   ssd 0.05699  1.00000 60314M 12403M 47911M 20.56 1.64 75
>  3   hdd 0.02399  1.00000 25586M 94688k 25494M  0.36 0.03 125
>  7   ssd 0.05899  1.00000 61586M 10435M 51151M 16.94 1.36 62
>                     TOTAL   338G 43280M   295G 12.50
> MIN/MAX VAR: 0.03/1.64  STDDEV: 9.40
> 
> ...and this is vastly better too. Bit of a toss up whether ordering by host
> (which is what it seems to be happening here) or ordering by osd id is better,
> but clearly there are bound to be differing POV on this - I'm happy with the
> current choice.

Great!  I think the result is actaully ordered by the tree code but just 
doesn't format that way (and show the tree nodes).  It is a little weird, 
I agree.
 
> One (I think) new thing compared to the 12.1.0 is that restarting the services
> blitzes the modified crushmap, and we get back to:
> 
> $ sudo ceph osd tree
> ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
> -1       0.32996 root default
> -2       0.08199     host ceph1
>  0   hdd 0.02399         osd.0       up  1.00000 1.00000
>  4   hdd 0.05699         osd.4       up  1.00000 1.00000
> -3       0.08299     host ceph2
>  1   hdd 0.02399         osd.1       up  1.00000 1.00000
>  5   hdd 0.05899         osd.5       up  1.00000 1.00000
> -4       0.08199     host ceph3
>  2   hdd 0.02399         osd.2       up  1.00000 1.00000
>  6   hdd 0.05699         osd.6       up  1.00000 1.00000
> -5       0.08299     host ceph4
>  3   hdd 0.02399         osd.3       up  1.00000 1.00000
>  7   hdd 0.05899         osd.7       up  1.00000 1.00000
> 
> ...and all the PG are remapped again. Now I might have just missed this
> happening with 12.1.0 - but I'm (moderately) confident that I did restart
> stuff and not see this happening. For now I've added:
> 
> osd crush update on start = false
> 
> to my ceph.conf to avoid being caught by this.

Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata 
4'?  I'm not sure why it's getting the class wrong.  I haven't seen this 
on my cluster (it's bluestore; maybe that's the difference).

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood July 19, 2017, 10:46 p.m. UTC | #5
On 20/07/17 02:53, Sage Weil wrote:

> On Wed, 19 Jul 2017, Mark Kirkwood wrote:
>
>   
>> One (I think) new thing compared to the 12.1.0 is that restarting the services
>> blitzes the modified crushmap, and we get back to:
>>
>> $ sudo ceph osd tree
>> ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
>> -1       0.32996 root default
>> -2       0.08199     host ceph1
>>   0   hdd 0.02399         osd.0       up  1.00000 1.00000
>>   4   hdd 0.05699         osd.4       up  1.00000 1.00000
>> -3       0.08299     host ceph2
>>   1   hdd 0.02399         osd.1       up  1.00000 1.00000
>>   5   hdd 0.05899         osd.5       up  1.00000 1.00000
>> -4       0.08199     host ceph3
>>   2   hdd 0.02399         osd.2       up  1.00000 1.00000
>>   6   hdd 0.05699         osd.6       up  1.00000 1.00000
>> -5       0.08299     host ceph4
>>   3   hdd 0.02399         osd.3       up  1.00000 1.00000
>>   7   hdd 0.05899         osd.7       up  1.00000 1.00000
>>
>> ...and all the PG are remapped again. Now I might have just missed this
>> happening with 12.1.0 - but I'm (moderately) confident that I did restart
>> stuff and not see this happening. For now I've added:
>>
>> osd crush update on start = false
>>
>> to my ceph.conf to avoid being caught by this.

Actually setting the above does *not* prevent the crushmap getting changed.

> Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata
> 4'?  I'm not sure why it's getting the class wrong.  I haven't seen this
> on my cluster (it's bluestore; maybe that's the difference).
>
>

Yes, and it is quite interesting: osd 0 is filestore on hdd, osd 4 is 
bluestore on ssd but (see below) the metadata suggests ceph thinks it is 
hdd (the fact that the hosts are VMs might be not helping here):

$ sudo ceph osd metadata 0
{
     "id": 0,
     "arch": "x86_64",
     "back_addr": "192.168.122.21:6806/1712",
     "backend_filestore_dev_node": "unknown",
     "backend_filestore_partition_path": "unknown",
     "ceph_version": "ceph version 12.1.1 
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
     "cpu": "QEMU Virtual CPU version 1.7.0",
     "distro": "ubuntu",
     "distro_description": "Ubuntu 16.04.2 LTS",
     "distro_version": "16.04",
     "filestore_backend": "xfs",
     "filestore_f_type": "0x58465342",
     "front_addr": "192.168.122.21:6805/1712",
     "hb_back_addr": "192.168.122.21:6807/1712",
     "hb_front_addr": "192.168.122.21:6808/1712",
     "hostname": "ceph1",
     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
     "kernel_version": "4.4.0-83-generic",
     "mem_swap_kb": "1047548",
     "mem_total_kb": "2048188",
     "os": "Linux",
     "osd_data": "/var/lib/ceph/osd/ceph-0",
     "osd_journal": "/var/lib/ceph/osd/ceph-0/journal",
     "osd_objectstore": "filestore",
     "rotational": "1"
}

$ sudo ceph osd metadata 4
{
     "id": 4,
     "arch": "x86_64",
     "back_addr": "192.168.122.21:6802/1488",
     "bluefs": "1",
     "bluefs_db_access_mode": "blk",
     "bluefs_db_block_size": "4096",
     "bluefs_db_dev": "253:32",
     "bluefs_db_dev_node": "vdc",
     "bluefs_db_driver": "KernelDevice",
     "bluefs_db_model": "",
     "bluefs_db_partition_path": "/dev/vdc2",
     "bluefs_db_rotational": "1",
     "bluefs_db_size": "63244840960",
     "bluefs_db_type": "hdd",
     "bluefs_single_shared_device": "1",
     "bluestore_bdev_access_mode": "blk",
     "bluestore_bdev_block_size": "4096",
     "bluestore_bdev_dev": "253:32",
     "bluestore_bdev_dev_node": "vdc",
     "bluestore_bdev_driver": "KernelDevice",
     "bluestore_bdev_model": "",
     "bluestore_bdev_partition_path": "/dev/vdc2",
     "bluestore_bdev_rotational": "1",
     "bluestore_bdev_size": "63244840960",
     "bluestore_bdev_type": "hdd",
     "ceph_version": "ceph version 12.1.1 
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
     "cpu": "QEMU Virtual CPU version 1.7.0",
     "distro": "ubuntu",
     "distro_description": "Ubuntu 16.04.2 LTS",
     "distro_version": "16.04",
     "front_addr": "192.168.122.21:6801/1488",
     "hb_back_addr": "192.168.122.21:6803/1488",
     "hb_front_addr": "192.168.122.21:6804/1488",
     "hostname": "ceph1",
     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
     "kernel_version": "4.4.0-83-generic",
     "mem_swap_kb": "1047548",
     "mem_total_kb": "2048188",
     "os": "Linux",
     "osd_data": "/var/lib/ceph/osd/ceph-4",
     "osd_journal": "/var/lib/ceph/osd/ceph-4/journal",
     "osd_objectstore": "bluestore",
     "rotational": "1"
}


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood July 20, 2017, 12:36 a.m. UTC | #6
On 20/07/17 10:46, Mark Kirkwood wrote:

> On 20/07/17 02:53, Sage Weil wrote:
>
>> On Wed, 19 Jul 2017, Mark Kirkwood wrote:
>>
>>> One (I think) new thing compared to the 12.1.0 is that restarting 
>>> the services
>>> blitzes the modified crushmap, and we get back to:
>>>
>>> $ sudo ceph osd tree
>>> ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
>>> -1       0.32996 root default
>>> -2       0.08199     host ceph1
>>>   0   hdd 0.02399         osd.0       up  1.00000 1.00000
>>>   4   hdd 0.05699         osd.4       up  1.00000 1.00000
>>> -3       0.08299     host ceph2
>>>   1   hdd 0.02399         osd.1       up  1.00000 1.00000
>>>   5   hdd 0.05899         osd.5       up  1.00000 1.00000
>>> -4       0.08199     host ceph3
>>>   2   hdd 0.02399         osd.2       up  1.00000 1.00000
>>>   6   hdd 0.05699         osd.6       up  1.00000 1.00000
>>> -5       0.08299     host ceph4
>>>   3   hdd 0.02399         osd.3       up  1.00000 1.00000
>>>   7   hdd 0.05899         osd.7       up  1.00000 1.00000
>>>
>>> ...and all the PG are remapped again. Now I might have just missed this
>>> happening with 12.1.0 - but I'm (moderately) confident that I did 
>>> restart
>>> stuff and not see this happening. For now I've added:
>>>
>>> osd crush update on start = false
>>>
>>> to my ceph.conf to avoid being caught by this.
>
> Actually setting the above does *not* prevent the crushmap getting 
> changed.
>
>> Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata
>> 4'?  I'm not sure why it's getting the class wrong.  I haven't seen this
>> on my cluster (it's bluestore; maybe that's the difference).
>>
>>
>
> Yes, and it is quite interesting: osd 0 is filestore on hdd, osd 4 is 
> bluestore on ssd but (see below) the metadata suggests ceph thinks it 
> is hdd (the fact that the hosts are VMs might be not helping here):
>
> $ sudo ceph osd metadata 0
> {
>     "id": 0,
>     "arch": "x86_64",
>     "back_addr": "192.168.122.21:6806/1712",
>     "backend_filestore_dev_node": "unknown",
>     "backend_filestore_partition_path": "unknown",
>     "ceph_version": "ceph version 12.1.1 
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
>     "cpu": "QEMU Virtual CPU version 1.7.0",
>     "distro": "ubuntu",
>     "distro_description": "Ubuntu 16.04.2 LTS",
>     "distro_version": "16.04",
>     "filestore_backend": "xfs",
>     "filestore_f_type": "0x58465342",
>     "front_addr": "192.168.122.21:6805/1712",
>     "hb_back_addr": "192.168.122.21:6807/1712",
>     "hb_front_addr": "192.168.122.21:6808/1712",
>     "hostname": "ceph1",
>     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
>     "kernel_version": "4.4.0-83-generic",
>     "mem_swap_kb": "1047548",
>     "mem_total_kb": "2048188",
>     "os": "Linux",
>     "osd_data": "/var/lib/ceph/osd/ceph-0",
>     "osd_journal": "/var/lib/ceph/osd/ceph-0/journal",
>     "osd_objectstore": "filestore",
>     "rotational": "1"
> }
>
> $ sudo ceph osd metadata 4
> {
>     "id": 4,
>     "arch": "x86_64",
>     "back_addr": "192.168.122.21:6802/1488",
>     "bluefs": "1",
>     "bluefs_db_access_mode": "blk",
>     "bluefs_db_block_size": "4096",
>     "bluefs_db_dev": "253:32",
>     "bluefs_db_dev_node": "vdc",
>     "bluefs_db_driver": "KernelDevice",
>     "bluefs_db_model": "",
>     "bluefs_db_partition_path": "/dev/vdc2",
>     "bluefs_db_rotational": "1",
>     "bluefs_db_size": "63244840960",
>     "bluefs_db_type": "hdd",
>     "bluefs_single_shared_device": "1",
>     "bluestore_bdev_access_mode": "blk",
>     "bluestore_bdev_block_size": "4096",
>     "bluestore_bdev_dev": "253:32",
>     "bluestore_bdev_dev_node": "vdc",
>     "bluestore_bdev_driver": "KernelDevice",
>     "bluestore_bdev_model": "",
>     "bluestore_bdev_partition_path": "/dev/vdc2",
>     "bluestore_bdev_rotational": "1",
>     "bluestore_bdev_size": "63244840960",
>     "bluestore_bdev_type": "hdd",
>     "ceph_version": "ceph version 12.1.1 
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
>     "cpu": "QEMU Virtual CPU version 1.7.0",
>     "distro": "ubuntu",
>     "distro_description": "Ubuntu 16.04.2 LTS",
>     "distro_version": "16.04",
>     "front_addr": "192.168.122.21:6801/1488",
>     "hb_back_addr": "192.168.122.21:6803/1488",
>     "hb_front_addr": "192.168.122.21:6804/1488",
>     "hostname": "ceph1",
>     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
>     "kernel_version": "4.4.0-83-generic",
>     "mem_swap_kb": "1047548",
>     "mem_total_kb": "2048188",
>     "os": "Linux",
>     "osd_data": "/var/lib/ceph/osd/ceph-4",
>     "osd_journal": "/var/lib/ceph/osd/ceph-4/journal",
>     "osd_objectstore": "bluestore",
>     "rotational": "1"
> }
>
>

I note that /sys/block/vdc/queue/rotational is 1 , so this looks like 
libvirt is being dense about the virtual disk...if I cat '0' into the 
file then the osd restarts *do not* blitz the crushmap anymore - so it 
looks the previous behaviour is brought on by my use of VMs - I'll try 
mashing it with a udev rule to get the 0 in there :-)

It is possibly worthy of a doco note about how this detection works at 
the Ceph level, just in case there are some weird SSD firmwares out 
there that result in the flag being set wrong in bare metal environments.

Cheers

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil July 20, 2017, 1:14 a.m. UTC | #7
On Thu, 20 Jul 2017, Mark Kirkwood wrote:
> On 20/07/17 10:46, Mark Kirkwood wrote:
> > On 20/07/17 02:53, Sage Weil wrote:
> > > On Wed, 19 Jul 2017, Mark Kirkwood wrote:
> > > 
> > > > One (I think) new thing compared to the 12.1.0 is that restarting the
> > > > services
> > > > blitzes the modified crushmap, and we get back to:
> > > > 
> > > > $ sudo ceph osd tree
> > > > ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
> > > > -1       0.32996 root default
> > > > -2       0.08199     host ceph1
> > > >   0   hdd 0.02399         osd.0       up  1.00000 1.00000
> > > >   4   hdd 0.05699         osd.4       up  1.00000 1.00000
> > > > -3       0.08299     host ceph2
> > > >   1   hdd 0.02399         osd.1       up  1.00000 1.00000
> > > >   5   hdd 0.05899         osd.5       up  1.00000 1.00000
> > > > -4       0.08199     host ceph3
> > > >   2   hdd 0.02399         osd.2       up  1.00000 1.00000
> > > >   6   hdd 0.05699         osd.6       up  1.00000 1.00000
> > > > -5       0.08299     host ceph4
> > > >   3   hdd 0.02399         osd.3       up  1.00000 1.00000
> > > >   7   hdd 0.05899         osd.7       up  1.00000 1.00000
> > > > 
> > > > ...and all the PG are remapped again. Now I might have just missed this
> > > > happening with 12.1.0 - but I'm (moderately) confident that I did
> > > > restart
> > > > stuff and not see this happening. For now I've added:
> > > > 
> > > > osd crush update on start = false
> > > > 
> > > > to my ceph.conf to avoid being caught by this.
> > 
> > Actually setting the above does *not* prevent the crushmap getting changed.
> > 
> > > Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata
> > > 4'?  I'm not sure why it's getting the class wrong.  I haven't seen this
> > > on my cluster (it's bluestore; maybe that's the difference).
> > > 
> > > 
> > 
> > Yes, and it is quite interesting: osd 0 is filestore on hdd, osd 4 is
> > bluestore on ssd but (see below) the metadata suggests ceph thinks it is hdd
> > (the fact that the hosts are VMs might be not helping here):
> > 
> > $ sudo ceph osd metadata 0
> > {
> >     "id": 0,
> >     "arch": "x86_64",
> >     "back_addr": "192.168.122.21:6806/1712",
> >     "backend_filestore_dev_node": "unknown",
> >     "backend_filestore_partition_path": "unknown",
> >     "ceph_version": "ceph version 12.1.1
> > (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
> >     "cpu": "QEMU Virtual CPU version 1.7.0",
> >     "distro": "ubuntu",
> >     "distro_description": "Ubuntu 16.04.2 LTS",
> >     "distro_version": "16.04",
> >     "filestore_backend": "xfs",
> >     "filestore_f_type": "0x58465342",
> >     "front_addr": "192.168.122.21:6805/1712",
> >     "hb_back_addr": "192.168.122.21:6807/1712",
> >     "hb_front_addr": "192.168.122.21:6808/1712",
> >     "hostname": "ceph1",
> >     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
> >     "kernel_version": "4.4.0-83-generic",
> >     "mem_swap_kb": "1047548",
> >     "mem_total_kb": "2048188",
> >     "os": "Linux",
> >     "osd_data": "/var/lib/ceph/osd/ceph-0",
> >     "osd_journal": "/var/lib/ceph/osd/ceph-0/journal",
> >     "osd_objectstore": "filestore",
> >     "rotational": "1"
> > }
> > 
> > $ sudo ceph osd metadata 4
> > {
> >     "id": 4,
> >     "arch": "x86_64",
> >     "back_addr": "192.168.122.21:6802/1488",
> >     "bluefs": "1",
> >     "bluefs_db_access_mode": "blk",
> >     "bluefs_db_block_size": "4096",
> >     "bluefs_db_dev": "253:32",
> >     "bluefs_db_dev_node": "vdc",
> >     "bluefs_db_driver": "KernelDevice",
> >     "bluefs_db_model": "",
> >     "bluefs_db_partition_path": "/dev/vdc2",
> >     "bluefs_db_rotational": "1",
> >     "bluefs_db_size": "63244840960",
> >     "bluefs_db_type": "hdd",
> >     "bluefs_single_shared_device": "1",
> >     "bluestore_bdev_access_mode": "blk",
> >     "bluestore_bdev_block_size": "4096",
> >     "bluestore_bdev_dev": "253:32",
> >     "bluestore_bdev_dev_node": "vdc",
> >     "bluestore_bdev_driver": "KernelDevice",
> >     "bluestore_bdev_model": "",
> >     "bluestore_bdev_partition_path": "/dev/vdc2",
> >     "bluestore_bdev_rotational": "1",
> >     "bluestore_bdev_size": "63244840960",
> >     "bluestore_bdev_type": "hdd",
> >     "ceph_version": "ceph version 12.1.1
> > (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
> >     "cpu": "QEMU Virtual CPU version 1.7.0",
> >     "distro": "ubuntu",
> >     "distro_description": "Ubuntu 16.04.2 LTS",
> >     "distro_version": "16.04",
> >     "front_addr": "192.168.122.21:6801/1488",
> >     "hb_back_addr": "192.168.122.21:6803/1488",
> >     "hb_front_addr": "192.168.122.21:6804/1488",
> >     "hostname": "ceph1",
> >     "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
> >     "kernel_version": "4.4.0-83-generic",
> >     "mem_swap_kb": "1047548",
> >     "mem_total_kb": "2048188",
> >     "os": "Linux",
> >     "osd_data": "/var/lib/ceph/osd/ceph-4",
> >     "osd_journal": "/var/lib/ceph/osd/ceph-4/journal",
> >     "osd_objectstore": "bluestore",
> >     "rotational": "1"
> > }
> > 
> > 
> 
> I note that /sys/block/vdc/queue/rotational is 1 , so this looks like libvirt
> is being dense about the virtual disk...if I cat '0' into the file then the
> osd restarts *do not* blitz the crushmap anymore - so it looks the previous
> behaviour is brought on by my use of VMs - I'll try mashing it with a udev
> rule to get the 0 in there :-)
> 
> It is possibly worthy of a doco note about how this detection works at the
> Ceph level, just in case there are some weird SSD firmwares out there that
> result in the flag being set wrong in bare metal environments.

Yeah.  Note that there is also a PR in flight streamlining some of the 
device class code that could potentially make the class more difficult to 
change, specifically to avoid situations like this.  For example, if the 
class is already set, the update on OSD start could be a no-op (do not 
change), and only set it if there is no class at all.  To change, you 
would (from the cli), 'ceph osd crush rm-device-class osd.0' and then 
'ceph osd crush set-device-class osd.0' (or restart the osd, or perhaps 
pass a --force flag to set-device-class.  Does that seem like a reasonable 
path?

I'm mostly worried about future changes to our auto-detect class logic.  
If we change anything (intentionally or not) we don't want to trigger a 
ton of data rebalancing on OSD restart because the class changes from, 
say, 'ssd' to 'nvme' due to improved detection logic.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood July 20, 2017, 1:31 a.m. UTC | #8
On 20/07/17 13:14, Sage Weil wrote:

> On Thu, 20 Jul 2017, Mark Kirkwood wrote:
>> On 20/07/17 10:46, Mark Kirkwood wrote:
>>> On 20/07/17 02:53, Sage Weil wrote:
>>>> On Wed, 19 Jul 2017, Mark Kirkwood wrote:
>>>>
>>>>> One (I think) new thing compared to the 12.1.0 is that restarting the
>>>>> services
>>>>> blitzes the modified crushmap, and we get back to:
>>>>>
>>>>> $ sudo ceph osd tree
>>>>> ID CLASS WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRI-AFF
>>>>> -1       0.32996 root default
>>>>> -2       0.08199     host ceph1
>>>>>    0   hdd 0.02399         osd.0       up  1.00000 1.00000
>>>>>    4   hdd 0.05699         osd.4       up  1.00000 1.00000
>>>>> -3       0.08299     host ceph2
>>>>>    1   hdd 0.02399         osd.1       up  1.00000 1.00000
>>>>>    5   hdd 0.05899         osd.5       up  1.00000 1.00000
>>>>> -4       0.08199     host ceph3
>>>>>    2   hdd 0.02399         osd.2       up  1.00000 1.00000
>>>>>    6   hdd 0.05699         osd.6       up  1.00000 1.00000
>>>>> -5       0.08299     host ceph4
>>>>>    3   hdd 0.02399         osd.3       up  1.00000 1.00000
>>>>>    7   hdd 0.05899         osd.7       up  1.00000 1.00000
>>>>>
>>>>> ...and all the PG are remapped again. Now I might have just missed this
>>>>> happening with 12.1.0 - but I'm (moderately) confident that I did
>>>>> restart
>>>>> stuff and not see this happening. For now I've added:
>>>>>
>>>>> osd crush update on start = false
>>>>>
>>>>> to my ceph.conf to avoid being caught by this.
>>> Actually setting the above does *not* prevent the crushmap getting changed.
>>>
>>>> Can you share teh output of 'ceph osd metadata 0' vs 'cpeh osd metadata
>>>> 4'?  I'm not sure why it's getting the class wrong.  I haven't seen this
>>>> on my cluster (it's bluestore; maybe that's the difference).
>>>>
>>>>
>>> Yes, and it is quite interesting: osd 0 is filestore on hdd, osd 4 is
>>> bluestore on ssd but (see below) the metadata suggests ceph thinks it is hdd
>>> (the fact that the hosts are VMs might be not helping here):
>>>
>>> $ sudo ceph osd metadata 0
>>> {
>>>      "id": 0,
>>>      "arch": "x86_64",
>>>      "back_addr": "192.168.122.21:6806/1712",
>>>      "backend_filestore_dev_node": "unknown",
>>>      "backend_filestore_partition_path": "unknown",
>>>      "ceph_version": "ceph version 12.1.1
>>> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
>>>      "cpu": "QEMU Virtual CPU version 1.7.0",
>>>      "distro": "ubuntu",
>>>      "distro_description": "Ubuntu 16.04.2 LTS",
>>>      "distro_version": "16.04",
>>>      "filestore_backend": "xfs",
>>>      "filestore_f_type": "0x58465342",
>>>      "front_addr": "192.168.122.21:6805/1712",
>>>      "hb_back_addr": "192.168.122.21:6807/1712",
>>>      "hb_front_addr": "192.168.122.21:6808/1712",
>>>      "hostname": "ceph1",
>>>      "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
>>>      "kernel_version": "4.4.0-83-generic",
>>>      "mem_swap_kb": "1047548",
>>>      "mem_total_kb": "2048188",
>>>      "os": "Linux",
>>>      "osd_data": "/var/lib/ceph/osd/ceph-0",
>>>      "osd_journal": "/var/lib/ceph/osd/ceph-0/journal",
>>>      "osd_objectstore": "filestore",
>>>      "rotational": "1"
>>> }
>>>
>>> $ sudo ceph osd metadata 4
>>> {
>>>      "id": 4,
>>>      "arch": "x86_64",
>>>      "back_addr": "192.168.122.21:6802/1488",
>>>      "bluefs": "1",
>>>      "bluefs_db_access_mode": "blk",
>>>      "bluefs_db_block_size": "4096",
>>>      "bluefs_db_dev": "253:32",
>>>      "bluefs_db_dev_node": "vdc",
>>>      "bluefs_db_driver": "KernelDevice",
>>>      "bluefs_db_model": "",
>>>      "bluefs_db_partition_path": "/dev/vdc2",
>>>      "bluefs_db_rotational": "1",
>>>      "bluefs_db_size": "63244840960",
>>>      "bluefs_db_type": "hdd",
>>>      "bluefs_single_shared_device": "1",
>>>      "bluestore_bdev_access_mode": "blk",
>>>      "bluestore_bdev_block_size": "4096",
>>>      "bluestore_bdev_dev": "253:32",
>>>      "bluestore_bdev_dev_node": "vdc",
>>>      "bluestore_bdev_driver": "KernelDevice",
>>>      "bluestore_bdev_model": "",
>>>      "bluestore_bdev_partition_path": "/dev/vdc2",
>>>      "bluestore_bdev_rotational": "1",
>>>      "bluestore_bdev_size": "63244840960",
>>>      "bluestore_bdev_type": "hdd",
>>>      "ceph_version": "ceph version 12.1.1
>>> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)",
>>>      "cpu": "QEMU Virtual CPU version 1.7.0",
>>>      "distro": "ubuntu",
>>>      "distro_description": "Ubuntu 16.04.2 LTS",
>>>      "distro_version": "16.04",
>>>      "front_addr": "192.168.122.21:6801/1488",
>>>      "hb_back_addr": "192.168.122.21:6803/1488",
>>>      "hb_front_addr": "192.168.122.21:6804/1488",
>>>      "hostname": "ceph1",
>>>      "kernel_description": "#106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017",
>>>      "kernel_version": "4.4.0-83-generic",
>>>      "mem_swap_kb": "1047548",
>>>      "mem_total_kb": "2048188",
>>>      "os": "Linux",
>>>      "osd_data": "/var/lib/ceph/osd/ceph-4",
>>>      "osd_journal": "/var/lib/ceph/osd/ceph-4/journal",
>>>      "osd_objectstore": "bluestore",
>>>      "rotational": "1"
>>> }
>>>
>>>
>> I note that /sys/block/vdc/queue/rotational is 1 , so this looks like libvirt
>> is being dense about the virtual disk...if I cat '0' into the file then the
>> osd restarts *do not* blitz the crushmap anymore - so it looks the previous
>> behaviour is brought on by my use of VMs - I'll try mashing it with a udev
>> rule to get the 0 in there :-)
>>
>> It is possibly worthy of a doco note about how this detection works at the
>> Ceph level, just in case there are some weird SSD firmwares out there that
>> result in the flag being set wrong in bare metal environments.
> Yeah.  Note that there is also a PR in flight streamlining some of the
> device class code that could potentially make the class more difficult to
> change, specifically to avoid situations like this.  For example, if the
> class is already set, the update on OSD start could be a no-op (do not
> change), and only set it if there is no class at all.  To change, you
> would (from the cli), 'ceph osd crush rm-device-class osd.0' and then
> 'ceph osd crush set-device-class osd.0' (or restart the osd, or perhaps
> pass a --force flag to set-device-class.  Does that seem like a reasonable
> path?
>
> I'm mostly worried about future changes to our auto-detect class logic.
> If we change anything (intentionally or not) we don't want to trigger a
> ton of data rebalancing on OSD restart because the class changes from,
> say, 'ssd' to 'nvme' due to improved detection logic.
>
>

Yeah, some resistance to change once a class is set sounds like a good plan!

regards

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- crush.txt.orig    2017-06-28 14:38:38.067669000 +1200
+++ crush.txt    2017-06-28 14:41:22.071669000 +1200
@@ -8,14 +8,14 @@ 
  tunable allowed_bucket_algs 54

  # devices
-device 0 osd.0
-device 1 osd.1
-device 2 osd.2
-device 3 osd.3
-device 4 osd.4
-device 5 osd.5
-device 6 osd.6
-device 7 osd.7
+device 0 osd.0 class hdd
+device 1 osd.1 class hdd
+device 2 osd.2 class hdd
+device 3 osd.3 class hdd
+device 4 osd.4 class ssd
+device 5 osd.5 class ssd
+device 6 osd.6 class ssd
+device 7 osd.7 class ssd

  # types
  type 0 osd
@@ -80,7 +80,7 @@ 
      type replicated
      min_size 1
      max_size 10
-    step take default
+    step take default class hdd
      step chooseleaf firstn 0 type host
      step emit
  }