reinstate ceph cluster_snap support

Message ID	ortxifi1cs.fsf@livre.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: Alexandre Oliva <oliva@gnu.org> To: Sage Weil <sage@inktank.com> Cc: sam.just@inktank.com, ceph-devel@vger.kernel.org Subject: Re: [PATCH] reinstate ceph cluster_snap support Organization: Free thinker, not speaking for the GNU Project References: <orob8qt7kf.fsf@livre.home> <alpine.DEB.2.00.1308231715500.28981@cobra.newdream.net> Date: Sat, 24 Aug 2013 11:56:35 -0300 In-Reply-To: <alpine.DEB.2.00.1308231715500.28981@cobra.newdream.net> (Sage Weil's message of "Fri, 23 Aug 2013 17:17:22 -0700 (PDT)") Message-ID: <ortxifi1cs.fsf@livre.home> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Message ID

ortxifi1cs.fsf@livre.home (mailing list archive)

State

New, archived

Headers

From: Alexandre Oliva <oliva@gnu.org>
To: Sage Weil <sage@inktank.com>
Cc: sam.just@inktank.com, ceph-devel@vger.kernel.org
Subject: Re: [PATCH] reinstate ceph cluster_snap support
Organization: Free thinker, not speaking for the GNU Project
References: <orob8qt7kf.fsf@livre.home>
	<alpine.DEB.2.00.1308231715500.28981@cobra.newdream.net>
Date: Sat, 24 Aug 2013 11:56:35 -0300
In-Reply-To: <alpine.DEB.2.00.1308231715500.28981@cobra.newdream.net> (Sage
	Weil's message of "Fri, 23 Aug 2013 17:17:22 -0700 (PDT)")
Message-ID: <ortxifi1cs.fsf@livre.home>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Sender: ceph-devel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Alexandre Oliva Aug. 24, 2013, 2:56 p.m. UTC

On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:

> FWIW Alexandre, this feature was never really complete.  For it to work, 
> we also need to snapshot the monitors, and roll them back as well.

That depends on what's expected from the feature, actually.

One use is to roll back a single osd, and for that, the feature works
just fine.  Of course, for that one doesn't need the multi-osd snapshots
to be mutually consistent, but it's still convenient to be able to take
a global snapshot with a single command.

Another use is to roll back the entire cluster to an earlier state, and
for that, you *probably* want to roll back the monitors too, although it
doesn't seem like this is strictly necessary, unless some significant
configuration changes occurrend in the cluster since the snapshot was
taken, and you want to roll back those too.

In my experience, rolling back only osds has worked just fine, with the
exception of cases in which the snapshot is much too old, and the mons
have already expired osdmaps after the last one the osd got when the
snapshot was taken.  For this one case, I have a patch that enables the
osd to rejoin the cluster in spite of the expired osdmaps, which has
always worked for me, but I understand there may be exceptional cluster
reconfigurations in which this wouldn't have worked.

As for snapshotting monitors...  I suppose the way to go is to start a
store.db dump in background, instead of taking a btrfs snapshot, since
the store.db is not created as a subvolume.  That said, it would make
some sense to make it so, to make it trivially snapshottable.

Anyway, I found a problem in the earlier patch: when I added a new disk
to my cluster this morning, it tried to iterate over osdmaps that were
not available (e.g. the long-gone osdmap 1), and crashed.

Here's a fixed version, that makes sure we don't start the iteration
before m->get_first().

Comments

Sage Weil Aug. 27, 2013, 10:21 p.m. UTC | #1

Hi,

On Sat, 24 Aug 2013, Alexandre Oliva wrote:
> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
> 
> > FWIW Alexandre, this feature was never really complete.  For it to work, 
> > we also need to snapshot the monitors, and roll them back as well.
> 
> That depends on what's expected from the feature, actually.
> 
> One use is to roll back a single osd, and for that, the feature works
> just fine.  Of course, for that one doesn't need the multi-osd snapshots
> to be mutually consistent, but it's still convenient to be able to take
> a global snapshot with a single command.
> 
> Another use is to roll back the entire cluster to an earlier state, and
> for that, you *probably* want to roll back the monitors too, although it
> doesn't seem like this is strictly necessary, unless some significant
> configuration changes occurrend in the cluster since the snapshot was
> taken, and you want to roll back those too.
> 
> In my experience, rolling back only osds has worked just fine, with the
> exception of cases in which the snapshot is much too old, and the mons
> have already expired osdmaps after the last one the osd got when the
> snapshot was taken.  For this one case, I have a patch that enables the
> osd to rejoin the cluster in spite of the expired osdmaps, which has
> always worked for me, but I understand there may be exceptional cluster
> reconfigurations in which this wouldn't have worked.
> 
> 
> As for snapshotting monitors...  I suppose the way to go is to start a
> store.db dump in background, instead of taking a btrfs snapshot, since
> the store.db is not created as a subvolume.  That said, it would make
> some sense to make it so, to make it trivially snapshottable.
> 
> 
> Anyway, I found a problem in the earlier patch: when I added a new disk
> to my cluster this morning, it tried to iterate over osdmaps that were
> not available (e.g. the long-gone osdmap 1), and crashed.
> 
> Here's a fixed version, that makes sure we don't start the iteration
> before m->get_first().

In principle, we can add this back in.  I think it needs a few changes, 
though.

First, FileStore::snapshot() needs to pause and drain the workqueue before 
taking the snapshot, similar to what is done with the sync sequence.  
Otherwise it isn't a transactionally consistent snapshot and may tear some 
update.  Because it is draining the work queue, it *might* also need to 
drop some locks, but I'm hopeful that that isn't necessary.

Second, the call in handle_osd_map() should probably go in the loop a bit 
further down that is consuming maps.  It probably won't matter most of the 
time, but I'm paranoid about corner conditions.  It also avoids iterating 
over the new OSDMaps multiple times in the common case where there is no 
cluster_snap (a minor win).

Finally, eventually we should make this do a checkpoint on the mons too.  
We can add the osd snapping back in first, but before this can/should 
really be used the mons need to be snapshotted as well.  Probably that's 
just adding in a snapshot() method to MonitorStore.h and doing either a 
leveldb snap or making a full copy of store.db... I forget what leveldb is 
capable of here.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yan, Zheng Aug. 28, 2013, 12:54 a.m. UTC | #2

On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@inktank.com> wrote:
> Hi,
>
> On Sat, 24 Aug 2013, Alexandre Oliva wrote:
>> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
>>
>> > FWIW Alexandre, this feature was never really complete.  For it to work,
>> > we also need to snapshot the monitors, and roll them back as well.
>>
>> That depends on what's expected from the feature, actually.
>>
>> One use is to roll back a single osd, and for that, the feature works
>> just fine.  Of course, for that one doesn't need the multi-osd snapshots
>> to be mutually consistent, but it's still convenient to be able to take
>> a global snapshot with a single command.
>>
>> Another use is to roll back the entire cluster to an earlier state, and
>> for that, you *probably* want to roll back the monitors too, although it
>> doesn't seem like this is strictly necessary, unless some significant
>> configuration changes occurrend in the cluster since the snapshot was
>> taken, and you want to roll back those too.
>>
>> In my experience, rolling back only osds has worked just fine, with the
>> exception of cases in which the snapshot is much too old, and the mons
>> have already expired osdmaps after the last one the osd got when the
>> snapshot was taken.  For this one case, I have a patch that enables the
>> osd to rejoin the cluster in spite of the expired osdmaps, which has
>> always worked for me, but I understand there may be exceptional cluster
>> reconfigurations in which this wouldn't have worked.
>>
>>
>> As for snapshotting monitors...  I suppose the way to go is to start a
>> store.db dump in background, instead of taking a btrfs snapshot, since
>> the store.db is not created as a subvolume.  That said, it would make
>> some sense to make it so, to make it trivially snapshottable.
>>
>>
>> Anyway, I found a problem in the earlier patch: when I added a new disk
>> to my cluster this morning, it tried to iterate over osdmaps that were
>> not available (e.g. the long-gone osdmap 1), and crashed.
>>
>> Here's a fixed version, that makes sure we don't start the iteration
>> before m->get_first().
>
> In principle, we can add this back in.  I think it needs a few changes,
> though.
>
> First, FileStore::snapshot() needs to pause and drain the workqueue before
> taking the snapshot, similar to what is done with the sync sequence.
> Otherwise it isn't a transactionally consistent snapshot and may tear some
> update.  Because it is draining the work queue, it *might* also need to
> drop some locks, but I'm hopeful that that isn't necessary.
>
> Second, the call in handle_osd_map() should probably go in the loop a bit
> further down that is consuming maps.  It probably won't matter most of the
> time, but I'm paranoid about corner conditions.  It also avoids iterating
> over the new OSDMaps multiple times in the common case where there is no
> cluster_snap (a minor win).
>
> Finally, eventually we should make this do a checkpoint on the mons too.
> We can add the osd snapping back in first, but before this can/should
> really be used the mons need to be snapshotted as well.  Probably that's
> just adding in a snapshot() method to MonitorStore.h and doing either a
> leveldb snap or making a full copy of store.db... I forget what leveldb is
> capable of here.
>

I think we also need to snapshot the osd journal

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil Aug. 28, 2013, 4:34 a.m. UTC | #3

On Wed, 28 Aug 2013, Yan, Zheng wrote:
> On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@inktank.com> wrote:
> > Hi,
> >
> > On Sat, 24 Aug 2013, Alexandre Oliva wrote:
> >> On Aug 23, 2013, Sage Weil <sage@inktank.com> wrote:
> >>
> >> > FWIW Alexandre, this feature was never really complete.  For it to work,
> >> > we also need to snapshot the monitors, and roll them back as well.
> >>
> >> That depends on what's expected from the feature, actually.
> >>
> >> One use is to roll back a single osd, and for that, the feature works
> >> just fine.  Of course, for that one doesn't need the multi-osd snapshots
> >> to be mutually consistent, but it's still convenient to be able to take
> >> a global snapshot with a single command.
> >>
> >> Another use is to roll back the entire cluster to an earlier state, and
> >> for that, you *probably* want to roll back the monitors too, although it
> >> doesn't seem like this is strictly necessary, unless some significant
> >> configuration changes occurrend in the cluster since the snapshot was
> >> taken, and you want to roll back those too.
> >>
> >> In my experience, rolling back only osds has worked just fine, with the
> >> exception of cases in which the snapshot is much too old, and the mons
> >> have already expired osdmaps after the last one the osd got when the
> >> snapshot was taken.  For this one case, I have a patch that enables the
> >> osd to rejoin the cluster in spite of the expired osdmaps, which has
> >> always worked for me, but I understand there may be exceptional cluster
> >> reconfigurations in which this wouldn't have worked.
> >>
> >>
> >> As for snapshotting monitors...  I suppose the way to go is to start a
> >> store.db dump in background, instead of taking a btrfs snapshot, since
> >> the store.db is not created as a subvolume.  That said, it would make
> >> some sense to make it so, to make it trivially snapshottable.
> >>
> >>
> >> Anyway, I found a problem in the earlier patch: when I added a new disk
> >> to my cluster this morning, it tried to iterate over osdmaps that were
> >> not available (e.g. the long-gone osdmap 1), and crashed.
> >>
> >> Here's a fixed version, that makes sure we don't start the iteration
> >> before m->get_first().
> >
> > In principle, we can add this back in.  I think it needs a few changes,
> > though.
> >
> > First, FileStore::snapshot() needs to pause and drain the workqueue before
> > taking the snapshot, similar to what is done with the sync sequence.
> > Otherwise it isn't a transactionally consistent snapshot and may tear some
> > update.  Because it is draining the work queue, it *might* also need to
> > drop some locks, but I'm hopeful that that isn't necessary.
> >
> > Second, the call in handle_osd_map() should probably go in the loop a bit
> > further down that is consuming maps.  It probably won't matter most of the
> > time, but I'm paranoid about corner conditions.  It also avoids iterating
> > over the new OSDMaps multiple times in the common case where there is no
> > cluster_snap (a minor win).
> >
> > Finally, eventually we should make this do a checkpoint on the mons too.
> > We can add the osd snapping back in first, but before this can/should
> > really be used the mons need to be snapshotted as well.  Probably that's
> > just adding in a snapshot() method to MonitorStore.h and doing either a
> > leveldb snap or making a full copy of store.db... I forget what leveldb is
> > capable of here.
> >
> 
> I think we also need to snapshot the osd journal

If the snapshot does a sync (drain op_tp before doing the snap) that puts 
the file subvol in a consistent state.  To actually use it ceph-osd rolls 
back to that point on startup.  I didn't check that code, but I think what 
it should do is ignore/reset the journal then.

This is annoying code to test, unfortunately.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

reinstate ceph cluster_snap support

From: Alexandre Oliva <oliva@gnu.org>

This patch brings back and updates (for dumpling) the code originally
introduced to support “ceph osd cluster_snap <snap>”, that was
disabled and partially removed before cuttlefish.

Some minimal testing appears to indicate this even works: the modified
mon actually generated an osdmap with the cluster_snap request, and
starting a modified osd that was down and letting it catch up caused
the osd to take the requested snapshot.  I see no reason why it
wouldn't have taken it if it was up and running, so...  Why was this
feature disabled in the first place?

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
---
 src/mon/MonCommands.h |    6 ++++--
 src/mon/OSDMonitor.cc |   11 +++++++----
 src/osd/OSD.cc        |   14 ++++++++++++++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8e9c2bb..225c687 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -431,8 +431,10 @@  COMMAND("osd set " \
 COMMAND("osd unset " \
 	"name=key,type=CephChoices,strings=pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub", \
 	"unset <key>", "osd", "rw", "cli,rest")
-COMMAND("osd cluster_snap", "take cluster snapshot (disabled)", \
-	"osd", "r", "")
+COMMAND("osd cluster_snap " \
+	"name=snap,type=CephString", \
+	"take cluster snapshot",	\
+	"osd", "r", "cli")
 COMMAND("osd down " \
 	"type=CephString,name=ids,n=N", \
 	"set osd(s) <id> [<id>...] down", "osd", "rw", "cli,rest")
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 07022ae..9bf9511 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -3099,10 +3099,13 @@  bool OSDMonitor::prepare_command(MMonCommand *m)
       return prepare_unset_flag(m, CEPH_OSDMAP_NODEEP_SCRUB);
 
   } else if (prefix == "osd cluster_snap") {
-    // ** DISABLE THIS FOR NOW **
-    ss << "cluster snapshot currently disabled (broken implementation)";
-    // ** DISABLE THIS FOR NOW **
-
+    string snap;
+    cmd_getval(g_ceph_context, cmdmap, "snap", snap);
+    pending_inc.cluster_snapshot = snap;
+    ss << "creating cluster snap " << snap;
+    getline(ss, rs);
+    wait_for_finished_proposal(new Monitor::C_Command(mon, m, 0, rs, get_last_committed()));
+    return true;
   } else if (prefix == "osd down" ||
 	     prefix == "osd out" ||
 	     prefix == "osd in" ||
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 1a77dae..01527a6 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -5022,6 +5022,20 @@  void OSD::handle_osd_map(MOSDMap *m)
     assert(0 == "MOSDMap lied about what maps it had?");
   }
 
+  // check for cluster snapshots
+  for (epoch_t cur = MAX(superblock.current_epoch + 1, m->get_first());
+       cur <= m->get_last(); cur++) {
+    OSDMapRef newmap = get_map(cur);
+    string cluster_snap = newmap->get_cluster_snapshot();
+    if (cluster_snap.length() == 0)
+      continue;
+
+    dout(0) << "creating cluster snapshot '" << cluster_snap << "'" << dendl;
+    int r = store->snapshot(cluster_snap);
+    if (r)
+      dout(0) << "failed to create cluster snapshot: " << cpp_strerror(r) << dendl;
+  }
+
   if (superblock.oldest_map) {
     int num = 0;
     epoch_t min(

reinstate ceph cluster_snap support

Commit Message

Comments

Patch