From patchwork Thu Dec 1 14:17:15 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Disseldorp X-Patchwork-Id: 9456293 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 2E37560585 for ; Thu, 1 Dec 2016 14:17:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1DDB028484 for ; Thu, 1 Dec 2016 14:17:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 12997284EB; Thu, 1 Dec 2016 14:17:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_TVD_MIME_EPI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9B765284D9 for ; Thu, 1 Dec 2016 14:17:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758153AbcLAORU (ORCPT ); Thu, 1 Dec 2016 09:17:20 -0500 Received: from mx2.suse.de ([195.135.220.15]:39463 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754166AbcLAORT (ORCPT ); Thu, 1 Dec 2016 09:17:19 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 3BC34AAA3; Thu, 1 Dec 2016 14:17:17 +0000 (UTC) Date: Thu, 1 Dec 2016 15:17:15 +0100 From: David Disseldorp To: Samba Technical Cc: "ceph-devel@vger.kernel.org" , Martin Schwenke Subject: [PATCH] Ceph RADOS cluster mutex helper for Samba CTDB Message-ID: <20161201151715.019228c1@suse.de> X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.31; x86_64-suse-linux-gnu) MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi, The attached patch-set implements a cluster mutex helper for Samba CTDB using Ceph librados. ctdb_mutex_ceph_rados_helper_lock can be used as a recovery lock provider for CTDB. When configured, split brain avoidance during CTDB recovery will be handled using locks against an object located in a Ceph RADOS pool. I've also attached a standalone test script - @Martin: does this belong in the ctdb test suite, or can I just commit it as a standalone test? It has a few non-standard dependencies: a running Ceph cluster, the rados and jq binaries. Feedback appreciated. Cheers, David --- ctdb/doc/Makefile | 3 +- ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml | 90 ++++++ ctdb/tools/ctdb_mutex_ceph_rados_helper.c | 334 ++++++++++++++++++++ ctdb/wscript | 19 ++ 4 files changed, 445 insertions(+), 1 deletion(-) From 47e653373073f1564844465eb416cdebb1dc4aa6 Mon Sep 17 00:00:00 2001 From: David Disseldorp Date: Thu, 1 Dec 2016 13:33:22 +0100 Subject: [PATCH 1/2] ctdb: cluster mutex helper using Ceph RADOS ctdb_mutex_ceph_rados_helper implements the cluster mutex helper API atop Ceph using the librados rados_lock_exclusive()/rados_unlock() functionality. Once configured, split brain avoidance during CTDB recovery will be handled using locks against an object located in a Ceph RADOS pool. Signed-off-by: David Disseldorp --- ctdb/tools/ctdb_mutex_ceph_rados_helper.c | 334 ++++++++++++++++++++++++++++++ ctdb/wscript | 19 ++ 2 files changed, 353 insertions(+) create mode 100644 ctdb/tools/ctdb_mutex_ceph_rados_helper.c diff --git a/ctdb/tools/ctdb_mutex_ceph_rados_helper.c b/ctdb/tools/ctdb_mutex_ceph_rados_helper.c new file mode 100644 index 0000000..8d19965 --- /dev/null +++ b/ctdb/tools/ctdb_mutex_ceph_rados_helper.c @@ -0,0 +1,334 @@ +/* + CTDB mutex helper using Ceph librados locks + + Copyright (C) David Disseldorp 2016 + + Based on ctdb_mutex_fcntl_helper.c, which is: + Copyright (C) Martin Schwenke 2015 + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, see . +*/ + +#include "replace.h" +#include "system/filesys.h" +#include "system/network.h" + +/* protocol.h is just needed for ctdb_sock_addr, which is used in system.h */ +#include "protocol/protocol.h" +#include "common/system.h" +#include "lib/util/time.h" +#include "tevent.h" +#include "talloc.h" +#include "rados/librados.h" + +#define CTDB_MUTEX_CEPH_LOCK_NAME "ctdb_reclock_mutex" +#define CTDB_MUTEX_CEPH_LOCK_COOKIE CTDB_MUTEX_CEPH_LOCK_NAME +#define CTDB_MUTEX_CEPH_LOCK_DESC "CTDB recovery lock" + +#define CTDB_MUTEX_STATUS_HOLDING "0" +#define CTDB_MUTEX_STATUS_CONTENDED "1" +#define CTDB_MUTEX_STATUS_TIMEOUT "2" +#define CTDB_MUTEX_STATUS_ERROR "3" + +static char *progname = NULL; + +static int ctdb_mutex_rados_ctx_create(const char *ceph_cluster_name, + const char *ceph_auth_name, + const char *pool_name, + rados_t *_ceph_cluster, + rados_ioctx_t *_ioctx) +{ + rados_t ceph_cluster = NULL; + rados_ioctx_t ioctx = NULL; + int ret; + + ret = rados_create2(&ceph_cluster, ceph_cluster_name, ceph_auth_name, 0); + if (ret < 0) { + fprintf(stderr, "%s: failed to initialise Ceph cluster %s as %s" + " - (%s)\n", progname, ceph_cluster_name, ceph_auth_name, + strerror(-ret)); + return ret; + } + + /* path=NULL tells librados to use default locations */ + ret = rados_conf_read_file(ceph_cluster, NULL); + if (ret < 0) { + fprintf(stderr, "%s: failed to parse Ceph cluster config" + " - (%s)\n", progname, strerror(-ret)); + rados_shutdown(ceph_cluster); + return ret; + } + + ret = rados_connect(ceph_cluster); + if (ret < 0) { + fprintf(stderr, "%s: failed to connect to Ceph cluster %s as %s" + " - (%s)\n", progname, ceph_cluster_name, ceph_auth_name, + strerror(-ret)); + rados_shutdown(ceph_cluster); + return ret; + } + + + ret = rados_ioctx_create(ceph_cluster, pool_name, &ioctx); + if (ret < 0) { + fprintf(stderr, "%s: failed to create Ceph ioctx for pool %s" + " - (%s)\n", progname, pool_name, strerror(-ret)); + rados_shutdown(ceph_cluster); + return ret; + } + + *_ceph_cluster = ceph_cluster; + *_ioctx = ioctx; + + return 0; +} + +static void ctdb_mutex_rados_ctx_destroy(rados_t ceph_cluster, + rados_ioctx_t ioctx) +{ + rados_ioctx_destroy(ioctx); + rados_shutdown(ceph_cluster); +} + +static int ctdb_mutex_rados_lock(rados_ioctx_t *ioctx, + const char *oid) +{ + int ret; + + ret = rados_lock_exclusive(ioctx, oid, + CTDB_MUTEX_CEPH_LOCK_NAME, + CTDB_MUTEX_CEPH_LOCK_COOKIE, + CTDB_MUTEX_CEPH_LOCK_DESC, + NULL, /* infinite duration */ + 0); + if ((ret == -EEXIST) || (ret == -EBUSY)) { + /* lock contention */ + return ret; + } else if (ret < 0) { + /* unexpected failure */ + fprintf(stderr, + "%s: Failed to get lock on RADOS object '%s' - (%s)\n", + progname, oid, strerror(-ret)); + return ret; + } + + /* lock obtained */ + return 0; +} + +static int ctdb_mutex_rados_unlock(rados_ioctx_t *ioctx, + const char *oid) +{ + int ret; + + ret = rados_unlock(ioctx, oid, + CTDB_MUTEX_CEPH_LOCK_NAME, + CTDB_MUTEX_CEPH_LOCK_COOKIE); + if (ret < 0) { + fprintf(stderr, + "%s: Failed to drop lock on RADOS object '%s' - (%s)\n", + progname, oid, strerror(-ret)); + return ret; + } + + return 0; +} + +struct ctdb_mutex_rados_state { + bool holding_mutex; + const char *ceph_cluster_name; + const char *ceph_auth_name; + const char *pool_name; + const char *object; + int ppid; + struct tevent_context *ev; + struct tevent_signal *sig_ev; + struct tevent_timer *timer_ev; + rados_t ceph_cluster; + rados_ioctx_t ioctx; +}; + +static void ctdb_mutex_rados_sigterm_cb(struct tevent_context *ev, + struct tevent_signal *se, + int signum, + int count, + void *siginfo, + void *private_data) +{ + struct ctdb_mutex_rados_state *cmr_state = private_data; + int ret; + + if (!cmr_state->holding_mutex) { + fprintf(stderr, "Sigterm callback invoked without mutex!\n"); + ret = -EINVAL; + goto err_ctx_cleanup; + } + + ret = ctdb_mutex_rados_unlock(cmr_state->ioctx, cmr_state->object); +err_ctx_cleanup: + ctdb_mutex_rados_ctx_destroy(cmr_state->ceph_cluster, + cmr_state->ioctx); + talloc_free(cmr_state); + exit(ret ? 1 : 0); +} + +static void ctdb_mutex_rados_timer_cb(struct tevent_context *ev, + struct tevent_timer *te, + struct timeval current_time, + void *private_data) +{ + struct ctdb_mutex_rados_state *cmr_state = private_data; + int ret; + + if (!cmr_state->holding_mutex) { + fprintf(stderr, "Timer callback invoked without mutex!\n"); + ret = -EINVAL; + goto err_ctx_cleanup; + } + + if ((kill(cmr_state->ppid, 0) == 0) || (errno != ESRCH)) { + /* parent still around, keep waiting */ + cmr_state->timer_ev = tevent_add_timer(cmr_state->ev, cmr_state, + timeval_current_ofs(5,0), + ctdb_mutex_rados_timer_cb, + cmr_state); + if (cmr_state->timer_ev == NULL) { + fprintf(stderr, "Failed to create timer event\n"); + /* rely on signal cb */ + } + return; + } + + /* parent ended, drop lock and exit */ + ret = ctdb_mutex_rados_unlock(cmr_state->ioctx, cmr_state->object); +err_ctx_cleanup: + ctdb_mutex_rados_ctx_destroy(cmr_state->ceph_cluster, + cmr_state->ioctx); + talloc_free(cmr_state); + exit(ret ? 1 : 0); +} + +int main(int argc, char *argv[]) +{ + int ret; + struct ctdb_mutex_rados_state *cmr_state; + + progname = argv[0]; + + if (argc != 5) { + fprintf(stderr, "Usage: %s " + " \n", + progname); + ret = -EINVAL; + goto err_out; + } + + ret = setvbuf(stdout, NULL, _IONBF, 0); + if (ret != 0) { + fprintf(stderr, "Failed to configure unbuffered stdout I/O\n"); + } + + cmr_state = talloc_zero(NULL, struct ctdb_mutex_rados_state); + if (cmr_state == NULL) { + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + ret = -ENOMEM; + goto err_out; + } + + cmr_state->ceph_cluster_name = argv[1]; + cmr_state->ceph_auth_name = argv[2]; + cmr_state->pool_name = argv[3]; + cmr_state->object = argv[4]; + + cmr_state->ppid = getppid(); + if (cmr_state->ppid == 1) { + /* + * The original parent is gone and the process has + * been reparented to init. This can happen if the + * helper is started just as the parent is killed + * during shutdown. The error message doesn't need to + * be stellar, since there won't be anything around to + * capture and log it... + */ + fprintf(stderr, "%s: PPID == 1\n", progname); + ret = -EPIPE; + goto err_state_free; + } + + cmr_state->ev = tevent_context_init(cmr_state); + if (cmr_state->ev == NULL) { + fprintf(stderr, "tevent_context_init failed\n"); + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + ret = -ENOMEM; + goto err_state_free; + } + + /* wait for sigterm */ + cmr_state->sig_ev = tevent_add_signal(cmr_state->ev, cmr_state, SIGTERM, 0, + ctdb_mutex_rados_sigterm_cb, + cmr_state); + if (cmr_state->sig_ev == NULL) { + fprintf(stderr, "Failed to create signal event\n"); + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + ret = -ENOMEM; + goto err_state_free; + } + + /* periodically check parent */ + cmr_state->timer_ev = tevent_add_timer(cmr_state->ev, cmr_state, + timeval_current_ofs(5,0), + ctdb_mutex_rados_timer_cb, + cmr_state); + if (cmr_state->timer_ev == NULL) { + fprintf(stderr, "Failed to create timer event\n"); + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + ret = -ENOMEM; + goto err_state_free; + } + + ret = ctdb_mutex_rados_ctx_create(cmr_state->ceph_cluster_name, + cmr_state->ceph_auth_name, + cmr_state->pool_name, + &cmr_state->ceph_cluster, + &cmr_state->ioctx); + if (ret < 0) { + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + goto err_state_free; + } + + ret = ctdb_mutex_rados_lock(cmr_state->ioctx, cmr_state->object); + if ((ret == -EEXIST) || (ret == -EBUSY)) { + fprintf(stdout, CTDB_MUTEX_STATUS_CONTENDED); + goto err_ctx_cleanup; + } else if (ret < 0) { + fprintf(stdout, CTDB_MUTEX_STATUS_ERROR); + goto err_ctx_cleanup; + } + + cmr_state->holding_mutex = true; + fprintf(stdout, CTDB_MUTEX_STATUS_HOLDING); + + /* wait for the signal / timer events to do their work */ + ret = tevent_loop_wait(cmr_state->ev); + if (ret < 0) { + goto err_ctx_cleanup; + } +err_ctx_cleanup: + ctdb_mutex_rados_ctx_destroy(cmr_state->ceph_cluster, + cmr_state->ioctx); +err_state_free: + talloc_free(cmr_state); +err_out: + return ret ? 1 : 0; +} diff --git a/ctdb/wscript b/ctdb/wscript index f4bccef..75ddee2 100644 --- a/ctdb/wscript +++ b/ctdb/wscript @@ -69,6 +69,9 @@ def set_options(opt): opt.add_option('--enable-pmda', help=("Turn on PCP pmda support (default=no)"), action="store_true", dest='ctdb_pmda', default=False) + opt.add_option('--enable-ceph-reclock', + help=("Enable Ceph CTDB recovery lock helper (default=no)"), + action="store_true", dest='ctdb_ceph_reclock', default=False) opt.add_option('--with-logdir', help=("Path to log directory"), @@ -159,6 +162,15 @@ def configure(conf): conf.env.CTDB_PMDADIR = os.path.join(conf.env.LOCALSTATEDIR, 'lib/pcp/pmdas/ctdb') + if Options.options.ctdb_ceph_reclock: + if (conf.CHECK_HEADERS('rados/librados.h', False, False, 'rados') and + conf.CHECK_LIB('rados', shlib=True)): + Logs.info('Building with Ceph librados recovery lock support') + conf.define('HAVE_LIBRADOS', 1) + else: + Logs.error("Missing librados for Ceph recovery lock support") + sys.exit(1) + have_infiniband = False if Options.options.ctdb_infiniband: ib_support = True @@ -517,6 +529,13 @@ def build(bld): bld.INSTALL_FILES('${CTDB_PMDADIR}', 'utils/pmda/README', destname='README') + if bld.env.HAVE_LIBRADOS: + bld.SAMBA_BINARY('ctdb_mutex_ceph_rados_helper', + source='tools/ctdb_mutex_ceph_rados_helper.c', + deps='ctdb-system rados', + includes='include', + install_path='${CTDB_HELPER_BINDIR}') + sed_expr1 = 's|/usr/local/var/lib/ctdb|%s|g' % (bld.env.CTDB_VARDIR) sed_expr2 = 's|/usr/local/etc/ctdb|%s|g' % (bld.env.CTDB_ETCDIR) sed_expr3 = 's|/usr/local/var/log|%s|g' % (bld.env.CTDB_LOGDIR) -- 2.10.2 From fb53f42d1e02e0f33d1a1304c7d708203d75a14d Mon Sep 17 00:00:00 2001 From: David Disseldorp Date: Thu, 1 Dec 2016 14:22:45 +0100 Subject: [PATCH 2/2] ctdb/doc: man page for Ceph RADOS cluster mutex helper Signed-off-by: David Disseldorp --- ctdb/doc/Makefile | 3 +- ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml | 90 +++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml diff --git a/ctdb/doc/Makefile b/ctdb/doc/Makefile index f0f8215..5bbd748 100644 --- a/ctdb/doc/Makefile +++ b/ctdb/doc/Makefile @@ -8,7 +8,8 @@ DOCS = ctdb.1 ctdb.1.html \ ctdbd.conf.5 ctdbd.conf.5.html \ ctdb.7 ctdb.7.html \ ctdb-statistics.7 ctdb-statistics.7.html \ - ctdb-tunables.7 ctdb-tunables.7.html + ctdb-tunables.7 ctdb-tunables.7.html \ + ctdb_mutex_ceph_rados_helper.7 ctdb_mutex_ceph_rados_helper.7.html all: $(DOCS) diff --git a/ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml b/ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml new file mode 100644 index 0000000..e5dedc7 --- /dev/null +++ b/ctdb/doc/ctdb_mutex_ceph_rados_helper.7.xml @@ -0,0 +1,90 @@ + + + + + + Ceph RADOS Mutex + 7 + ctdb + CTDB - clustered TDB database + + + + ctdb_mutex_ceph_rados_helper + Ceph RADOS cluster mutex helper + + + + DESCRIPTION + + ctdb_mutex_ceph_rados_helper_lock can be used as a recovery lock provider + for CTDB. When configured, split brain avoidance during CTDB recovery + will be handled using locks against an object located in a Ceph RADOS + pool. + To enable this functionality, include the following line in your CTDB + config file: + + +CTDB_RECOVERY_LOCK="!ctdb_mutex_ceph_rados_helper_lock [Cluster] [User] [Pool] [Object]" + +Cluster: Ceph cluster name (e.g. ceph) +User: Ceph cluster user name (e.g. client.admin) +Pool: Ceph RADOS pool name +Object: Ceph RADOS object name + + + The Ceph cluster Cluster must be up and running, + with a configuration, and keyring file for User + located in a librados default search path (e.g. /etc/ceph/). + Pool must already exist. + + + + + SEE ALSO + + ctdb + 7, + + ctdbd + 1, + + + + + + + + + This documentation was written by David Disseldorp + + + + + 2016 + David Disseldorp + + + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 3 of + the License, or (at your option) any later version. + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR + PURPOSE. See the GNU General Public License for more details. + + + You should have received a copy of the GNU General Public + License along with this program; if not, see + . + + + + + -- 2.10.2