[V3] nbd: add multi-connection support

Message ID	1475092892-8230-1-git-send-email-jbacik@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Josef Bacik <jbacik@fb.com> To: <axboe@fb.com>, <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>, <nbd-general@lists.sourceforge.net>, <w@uter.be> Subject: [PATCH][V3] nbd: add multi-connection support Date: Wed, 28 Sep 2016 16:01:32 -0400 Message-ID: <1475092892-8230-1-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; CY4PR15MB1320; 23:u2v9bEaPMiUIVJDTVv2J490AhFgIUkn49YRBlwNni?= =?us-ascii?Q?XzB/6BzVs1QFmozV7UjFLAru4OIgHltA4lOcHwNniZ6fiOALo1EWHk/Jzt0X?= =?us-ascii?Q?+XtNrA7BwlbnTiB6zTjWA2KhTDdLrUg9lAFAA+0ZqPeou4Z6hdqBcGwi9Lhg?= =?us-ascii?Q?oOYn/exBIAwiYv+tyfGXLWP9vLA8EU0W/933EPo9XqbLqsOHVzNbKZ+cHONl?= =?us-ascii?Q?qz3EzFoeZDpX2QqHKMSI4dVsfyy41fQZptP+UCeKhcrpaL4B5pEpB/RHzPqu?= =?us-ascii?Q?EcHdFgeWRZFF+/SMLCdypKnGc55PWCjbgtOmUpkWxlXWO+daHPos1mpBeEAC?= =?us-ascii?Q?CBac24e6WTGuXzLQwfHUl+vxWNoGTKHTFQRkvgY/Jc1yQ9gecmtcbmtoJIQ9?= =?us-ascii?Q?WanFKwHFjDbqvZ0z3axpj4yuTk9XoT4UExE0+AxsyEl85/G3phW7oKtCSs/3?= =?us-ascii?Q?2lo4hmKBSi6dW+IzR9uzo9iczq/oA7HpMY6YuS/6soc8VDYn6PphQK5UxSTh?= =?us-ascii?Q?m4faXgq94vgRxXk+DG9A92NucVyrnGycl8eYHloLb8lsmU9x3x+e+CTaDnSb?= =?us-ascii?Q?TnqYzrQVPX3tMe68sbrzwbu0jpkSFJ6BrQGcMcd/nO5S7jhTzs86Za9YKgF9?= =?us-ascii?Q?0d+Tu5AgpC57jJ6KwGA1RuKd52nGDwApezHnTX1emR7AeqSXniirDVpalO+E?= =?us-ascii?Q?Ui4ctLOyrM2TRH2PUKG76qjlnKoVMnUyUYPSz4mTUE3cZhiMihxOeciSAHjr?= =?us-ascii?Q?mZCsbFprjjQnOhT9ZFCFc0QQiTDDLPK2YYTuOGIPed1Ez/xt9ndHmxYzEmB+?= =?us-ascii?Q?mowUnMTmquzwDe2OhIff8trKYJF/u6G0qwP7UM0DuNUiA9NL0mNuMnlSr8/H?= =?us-ascii?Q?Im297j2+iiOySl913VyikeLQqbcrPAowaqVvYLGU/G3EJDAvKhzJ77X9VtqA?= =?us-ascii?Q?lXVSUv/EYaVM5I5k4fLIp55o+ilh7FdPD08F6YhTkTlMkMjYdFpIiUD9s2YD?= =?us-ascii?Q?jNS7ynLewDag4LsfpwT8B2zFATZfik4wtPFp67/WKhevUy2YcC3Cs4GkWd2u?= =?us-ascii?Q?WJrX35bbPWlFj3ilOPegaPyk7nRlDu3xngc0hOR38Zp6MPZcJSNerF1MGA3d?= =?us-ascii?Q?6bMvoJ3KaieC35eHMgd/dp7ULLWXXvF?= X-Microsoft-Exchange-Diagnostics: 1; CY4PR15MB1320; 6:xbs3Mxr+Pp1F9a/jSA4rnXZPdL8tCyx4DS9eYXXNiOvv8fSgbDrY/oKKJV+GOmXZtzo3cwGtkE05JF+OayOeLtD5hGhaF3BTGul5ECUkxhVRioUmWYvHyDKYUZjRxIy0LeFFqm3nkDtCS9YSaYyOqBGKCEsRDax3vzhV0d/lM2d+h/7ncgh+p6sqwHABeRniqGYvOyV3BCJiLiIeDppSEKMpI0qieF+GdKzW3j6vqHZhep0CVsmPvltMx3tlh6oMPIMEJL43tVy0yBqHBdTmfRpsuUggZSSg+uQvOHdx5aE=; 5:ywMJqB1BsIceXaIiwCVL86PZYdkSQR6ASYo0Y42IvZHPMnrbGOFjszXtdoORbx6xtGqoENKm5KtWLg+/QT32Un0N57/YPwUQI/OSMxCGXoeh6NorsDGhU4pnhSCRnqMllHY5felSUzbK3uV/s2LPSw==; 24:P0MlvM0gT/w8GZp6WtY/MEiIbL94fwJnbZrb+lTaYEdfrtoClmtYz5xwEUvyhKKflvBv85+0sKGO3qrpgMD4FG/bYNOegQVMpKCYXVrUcGQ=; 7:Qr71fdyPJ3kz4i+blHOLO5LccGNALPy+xx1UuqBasvEluA8CaDDlSMA2k3LdXZiYf1BzgmycifxLHcygqF6P1SIa7+fEB5CuHa0621JmlNSMZhV+rIsHqxMw2WCZTvxBAiKwkc/zgpH2BnC0B3pH5HKiVKIzNIF/TeNEWr5UBWfplxSfTay48HOrWpjBByyxzpTBMa20ED7FaReJGwnP39RkcXcSQl3W3dnHTahtHUO9DtMZc/gsSXd6qx6mxtKsYyD6wWkvuHpYr+0XL8snwbXMGPb6uLU4XrSWxnAmz5mzpuC+avHJ6f8tL1qmUour SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; CY4PR15MB1320; 20:C12QpfNjzd+UXKPwZzwFGHoFRx89E1UTxqkNf0dkvJ0MAQOePtcYL8YdcXYFrRiaZe8MpuLVDk5uRnsDbuX6b7Qwj359NtaEipu9HmXcIbED2VruZLxpwXaSJ1yGh3JvcgfVibZFQnjXK17hML1FhZ4bwQWERzz8Zt31l94hQJU= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Sep 2016 20:01:36.3690 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR15MB1320 Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Wouter, >>> It is impossible for nbd to make such a guarantee, due to head-of-line >>> blocking on TCP. >> >> this is perfectly accurate as far as it goes, but this isn't the current >> NBD definition of 'flush'. > > I didn't read it that way. > >> That is (from the docs): >> >>> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) >>> that the server completes (i.e. replies to) prior to processing to a >>> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to >>> replying to thatNBD_CMD_FLUSH. > > This is somewhat ambiguous, in that (IMO) it doesn't clearly state the > point where the cutoff of "may not be on disk yet" is. What is > "processing"? OK. I now get the problem. There are actually two types of HOL blocking, server to client and client to server. Before we allowed the server to issue replies out of order with requests. However, the protocol did guarantee that the server saw requests in the order presented by clients. With the proposed multi-connection support, this changes. Whilst the client needed to be prepared for things to be disordered by the server, the server did not previously need to be be prepared for things being disordered by the client. And (more subtly) the client could assume that the server got its own requests in the order it sent them, which is important for flush the way written at the moment. Here's an actual illustration of the problem: Currently we have: Client Server ====== ====== TX: WRITE RX: FLUSH RX: WRITE RX: FLUSH Process write Process flush including write TX: write reply TX: flush reply RX: write reply RX: flush reply Currently the RX statements cannot be disordered. However the server can process the requests in a different order. If it does, the flush need not include the write, like this: Client Server ====== ====== TX: WRITE RX: FLUSH RX: WRITE RX: FLUSH Process flush not including write Process write TX: flush reply TX: write reply RX: flush reply RX: write reply and the client gets to know of the fact, because the flush reply comes before the write reply. It can know it's data has not been flushed. It could send another flush in this case, or simply change its code to not send the flush until the write has been received. However, with the multi-connection support, both the replies and the requests can be disordered. So the client can ONLY know a flush has been completed if it has received a reply to the write before it sends the flush. This is in my opinion problematic, as what you want to do as a client is stream requests (write, write, write, flush, write, write, write). If those go down different channels, AND you don't wait for a reply, you can no longer safely stream requests at all. Now you need to wait for the flush request to respond before sending another write (if write ordering to the platter is important), which seems to defeat the object of streaming commands. An 'in extremis' example would be a sequence of write / flush requests sent down two channels, where the write requests all end up on one channel, and the flush requests on the other, and the write channel is serviced immediately and the flush requests delayed indefinitely. > We don't define that, and therefore it could be any point > between "receipt of the request message" and "sending the reply > message". I had interpreted it closer to the latter than was apparently > intended, but that isn't very useful; The thing is the server doesn't know what replies the client has received, only the replies it has sent. Equally the server doesn't know what commands the client has sent, only what commands it has received. As currently written, it's a simple rule, NBD_CMD_FLUSH means "Mr Server: you must make sure that any write you have sent a reply to must now be persisted on disk. If you haven't yet sent a reply to a write - perhaps because due to HOL blocking you haven't received it, or perhaps it's still in progress, or perhaps it's finished but you haven't sent the reply - don't worry". The promise to the the client is that all the writes to which the server has sent a reply are now on disk. But the client doesn't know what replies the server has sent a reply to. It only knows which replies it has received (which will be a subset of those). So to the client it means that the server has persisted to disk all those commands to which it has received a reply. However, to the server, the 'MUST' condition needs to refer to the replies it sent, not the replies the client receives. I think "before processing" here really just means "before sending a reply to the NBD_CMD_FLUSH". I believe the 'before processing' phrase came from the kernel wording. > I see now that it should be closer > to the former; a more useful definition is probably something along the > following lines: > > All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM) > for which a reply was received on the client side prior to the No, that's wrong as the server has no knowledge of whether the client has actually received them so no way of knowing to which writes that would reply. It thus has to ensure it covers a potentially larger set of writes - those for which a reply has been sent, as all those MIGHT have been received by the client. > transmission of the NBD_CMD_FLUSH message MUST be written to no that's wrong because the server has no idea when the client transmitted the NBD_CMD_FLUSH message. It can only deal with events it knows about. And the delimiter is (in essence) those write commands that were replied to prior to the sending of the reply to the NBD_CMD_FLUSH - of course it can flush others too. > non-volatile storage prior to replying to that NBD_CMD_FLUSH. A > server MAY process this command in ways that result committing more > data to non-volatile storage than is strictly required. I think the wording is basically right for the current semantic, but here's a slight improvement: The server MUST NOT send a non-error reply to NBD_CMD_FLUSH until it has ensured that the contents of all writes to which it has already completed (i.e. replied to) have been persisted to non-volatile storage. However, given that the replies can subsequently be reordered, I now think we do have a problem, as the client can't tell to which those refer. > [...] >> I don't think there is actually a problem here - Wouter if I'm wrong >> about this, I'd like to understand your argument better. > > No, I now see that there isn't, and I misunderstood things. However, I > do think we should update the spec to clarify this. Haha - I now think there is. You accidentally convinced me! >> b) What I'm describing - which is the lack of synchronisation between >> channels. > [... long explanation snipped...] > > Yes, and I acknowledge that. However, I think that should not be a > blocker. It's fine to mark this feature as experimental; it will not > ever be required to use multiple connections to connect to a server. > > When this feature lands in nbd-client, I plan to ensure that the man > page and -help output says something along the following lines: > > use N connections to connect to the NBD server, improving performance > at the cost of a possible loss of reliability. So in essence we are relying on (userspace) nbd-client not to open more connections if it's unsafe? IE we can sort out all the negotiation of whether it's safe or unsafe within userspace and not bother Josef about it? I suppose that's fine in that we can at least shorten the CC: line, but I still think it would be helpful if the protocol >> Now, in the reference server, NBD_CMD_FLUSH is implemented through an >> fdatasync(). > > Actually, no, the reference server uses fsync() for reasons that I've > forgotten (side note: you wrote it that way ;-) > > [...] I vaguely remember why - something to do with files expanding when holes were written to. However, I don't think that makes much difference to the question I asked, or at most s/fdatasync/fsync/g

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index ccfcfc1..30f4f58 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -41,26 +41,36 @@ #include <linux/nbd.h> +struct nbd_sock { + struct socket *sock; + struct mutex tx_lock; +}; + #define NBD_TIMEDOUT 0 #define NBD_DISCONNECT_REQUESTED 1 +#define NBD_DISCONNECTED 2 +#define NBD_RUNNING 3 struct nbd_device { u32 flags; unsigned long runtime_flags; - struct socket * sock; /* If == NULL, device is not ready, yet */ + struct nbd_sock **socks; int magic; struct blk_mq_tag_set tag_set; - struct mutex tx_lock; + struct mutex config_lock; struct gendisk *disk; + int num_connections; + atomic_t recv_threads; + wait_queue_head_t recv_wq; int blksize; loff_t bytesize; /* protects initialization and shutdown of the socket */ spinlock_t sock_lock; struct task_struct *task_recv; - struct task_struct *task_send; + struct task_struct *task_setup; #if IS_ENABLED(CONFIG_DEBUG_FS) struct dentry *dbg_dir; @@ -69,7 +79,6 @@ struct nbd_device { struct nbd_cmd { struct nbd_device *nbd; - struct list_head list; }; #if IS_ENABLED(CONFIG_DEBUG_FS) @@ -159,22 +168,20 @@ static void nbd_end_request(struct nbd_cmd *cmd) */ static void sock_shutdown(struct nbd_device *nbd) { - struct socket *sock; - - spin_lock(&nbd->sock_lock); + int i; - if (!nbd->sock) { - spin_unlock_irq(&nbd->sock_lock); + if (nbd->num_connections == 0) + return; + if (test_and_set_bit(NBD_DISCONNECTED, &nbd->runtime_flags)) return; - } - - sock = nbd->sock; - dev_warn(disk_to_dev(nbd->disk), "shutting down socket\n"); - nbd->sock = NULL; - spin_unlock(&nbd->sock_lock); - kernel_sock_shutdown(sock, SHUT_RDWR); - sockfd_put(sock); + for (i = 0; i < nbd->num_connections; i++) { + struct nbd_sock *nsock = nbd->socks[i]; + mutex_lock(&nsock->tx_lock); + kernel_sock_shutdown(nsock->sock, SHUT_RDWR); + mutex_unlock(&nsock->tx_lock); + } + dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n"); } static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, @@ -182,35 +189,31 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, { struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req); struct nbd_device *nbd = cmd->nbd; - struct socket *sock = NULL; - - spin_lock(&nbd->sock_lock); + dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down connection\n"); set_bit(NBD_TIMEDOUT, &nbd->runtime_flags); - - if (nbd->sock) { - sock = nbd->sock; - get_file(sock->file); - } - - spin_unlock(&nbd->sock_lock); - if (sock) { - kernel_sock_shutdown(sock, SHUT_RDWR); - sockfd_put(sock); - } - req->errors++; - dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down connection\n"); + + /* + * If our disconnect packet times out then we're already holding the + * config_lock and could deadlock here, so just set an error and return, + * we'll handle shutting everything down later. + */ + if (req->cmd_type == REQ_TYPE_DRV_PRIV) + return BLK_EH_HANDLED; + mutex_lock(&nbd->config_lock); + sock_shutdown(nbd); + mutex_unlock(&nbd->config_lock); return BLK_EH_HANDLED; } /* * Send or receive packet. */ -static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size, - int msg_flags) +static int sock_xmit(struct nbd_device *nbd, int index, int send, void *buf, + int size, int msg_flags) { - struct socket *sock = nbd->sock; + struct socket *sock = nbd->socks[index]->sock; int result; struct msghdr msg; struct kvec iov; @@ -254,29 +257,28 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size, return result; } -static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec, - int flags) +static inline int sock_send_bvec(struct nbd_device *nbd, int index, + struct bio_vec *bvec, int flags) { int result; void *kaddr = kmap(bvec->bv_page); - result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset, + result = sock_xmit(nbd, index, 1, kaddr + bvec->bv_offset, bvec->bv_len, flags); kunmap(bvec->bv_page); return result; } /* always call with the tx_lock held */ -static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd) +static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index) { struct request *req = blk_mq_rq_from_pdu(cmd); int result, flags; struct nbd_request request; unsigned long size = blk_rq_bytes(req); u32 type; + u32 tag = blk_mq_unique_tag(req); - if (req->cmd_type == REQ_TYPE_DRV_PRIV) - type = NBD_CMD_DISC; - else if (req_op(req) == REQ_OP_DISCARD) + if (req_op(req) == REQ_OP_DISCARD) type = NBD_CMD_TRIM; else if (req_op(req) == REQ_OP_FLUSH) type = NBD_CMD_FLUSH; @@ -288,16 +290,16 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd) memset(&request, 0, sizeof(request)); request.magic = htonl(NBD_REQUEST_MAGIC); request.type = htonl(type); - if (type != NBD_CMD_FLUSH && type != NBD_CMD_DISC) { + if (type != NBD_CMD_FLUSH) { request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9); request.len = htonl(size); } - memcpy(request.handle, &req->tag, sizeof(req->tag)); + memcpy(request.handle, &tag, sizeof(tag)); dev_dbg(nbd_to_dev(nbd), "request %p: sending control (%s@%llu,%uB)\n", cmd, nbdcmd_to_ascii(type), (unsigned long long)blk_rq_pos(req) << 9, blk_rq_bytes(req)); - result = sock_xmit(nbd, 1, &request, sizeof(request), + result = sock_xmit(nbd, index, 1, &request, sizeof(request), (type == NBD_CMD_WRITE) ? MSG_MORE : 0); if (result <= 0) { dev_err(disk_to_dev(nbd->disk), @@ -318,7 +320,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd) flags = MSG_MORE; dev_dbg(nbd_to_dev(nbd), "request %p: sending %d bytes data\n", cmd, bvec.bv_len); - result = sock_send_bvec(nbd, &bvec, flags); + result = sock_send_bvec(nbd, index, &bvec, flags); if (result <= 0) { dev_err(disk_to_dev(nbd->disk), "Send data failed (result %d)\n", @@ -330,31 +332,34 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd) return 0; } -static inline int sock_recv_bvec(struct nbd_device *nbd, struct bio_vec *bvec) +static inline int sock_recv_bvec(struct nbd_device *nbd, int index, + struct bio_vec *bvec) { int result; void *kaddr = kmap(bvec->bv_page); - result = sock_xmit(nbd, 0, kaddr + bvec->bv_offset, bvec->bv_len, - MSG_WAITALL); + result = sock_xmit(nbd, index, 0, kaddr + bvec->bv_offset, + bvec->bv_len, MSG_WAITALL); kunmap(bvec->bv_page); return result; } /* NULL returned = something went wrong, inform userspace */ -static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd) +static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) { int result; struct nbd_reply reply; struct nbd_cmd *cmd; struct request *req = NULL; u16 hwq; - int tag; + u32 tag; reply.magic = 0; - result = sock_xmit(nbd, 0, &reply, sizeof(reply), MSG_WAITALL); + result = sock_xmit(nbd, index, 0, &reply, sizeof(reply), MSG_WAITALL); if (result <= 0) { - dev_err(disk_to_dev(nbd->disk), - "Receive control failed (result %d)\n", result); + if (!test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) && + !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags)) + dev_err(disk_to_dev(nbd->disk), + "Receive control failed (result %d)\n", result); return ERR_PTR(result); } @@ -364,7 +369,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd) return ERR_PTR(-EPROTO); } - memcpy(&tag, reply.handle, sizeof(int)); + memcpy(&tag, reply.handle, sizeof(u32)); hwq = blk_mq_unique_tag_to_hwq(tag); if (hwq < nbd->tag_set.nr_hw_queues) @@ -390,7 +395,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd) struct bio_vec bvec; rq_for_each_segment(bvec, req, iter) { - result = sock_recv_bvec(nbd, &bvec); + result = sock_recv_bvec(nbd, index, &bvec); if (result <= 0) { dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n", result); @@ -418,25 +423,24 @@ static struct device_attribute pid_attr = { .show = pid_show, }; -static int nbd_thread_recv(struct nbd_device *nbd, struct block_device *bdev) +struct recv_thread_args { + struct work_struct work; + struct nbd_device *nbd; + int index; +}; + +static void recv_work(struct work_struct *work) { + struct recv_thread_args *args = container_of(work, + struct recv_thread_args, + work); + struct nbd_device *nbd = args->nbd; struct nbd_cmd *cmd; - int ret; + int ret = 0; BUG_ON(nbd->magic != NBD_MAGIC); - - sk_set_memalloc(nbd->sock->sk); - - ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr); - if (ret) { - dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n"); - return ret; - } - - nbd_size_update(nbd, bdev); - while (1) { - cmd = nbd_read_stat(nbd); + cmd = nbd_read_stat(nbd, args->index); if (IS_ERR(cmd)) { ret = PTR_ERR(cmd); break; @@ -445,10 +449,14 @@ static int nbd_thread_recv(struct nbd_device *nbd, struct block_device *bdev) nbd_end_request(cmd); } - nbd_size_clear(nbd, bdev); - - device_remove_file(disk_to_dev(nbd->disk), &pid_attr); - return ret; + /* + * We got an error, shut everybody down if this wasn't the result of a + * disconnect request. + */ + if (ret && !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags)) + sock_shutdown(nbd); + atomic_dec(&nbd->recv_threads); + wake_up(&nbd->recv_wq); } static void nbd_clear_req(struct request *req, void *data, bool reserved) @@ -466,26 +474,35 @@ static void nbd_clear_que(struct nbd_device *nbd) { BUG_ON(nbd->magic != NBD_MAGIC); - /* - * Because we have set nbd->sock to NULL under the tx_lock, all - * modifications to the list must have completed by now. - */ - BUG_ON(nbd->sock); - blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL); dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n"); } -static void nbd_handle_cmd(struct nbd_cmd *cmd) +static void nbd_handle_cmd(struct nbd_cmd *cmd, int index) { struct request *req = blk_mq_rq_from_pdu(cmd); struct nbd_device *nbd = cmd->nbd; + struct nbd_sock *nsock; - if (req->cmd_type != REQ_TYPE_FS) + if (index >= nbd->num_connections) { + dev_err(disk_to_dev(nbd->disk), + "Attempted send on invalid socket\n"); goto error_out; + } + + if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags)) { + dev_err(disk_to_dev(nbd->disk), + "Attempted send on closed socket\n"); + goto error_out; + } - if (rq_data_dir(req) == WRITE && + if (req->cmd_type != REQ_TYPE_FS && + req->cmd_type != REQ_TYPE_DRV_PRIV) + goto error_out; + + if (req->cmd_type == REQ_TYPE_FS && + rq_data_dir(req) == WRITE && (nbd->flags & NBD_FLAG_READ_ONLY)) { dev_err(disk_to_dev(nbd->disk), "Write on read-only\n"); @@ -494,23 +511,22 @@ static void nbd_handle_cmd(struct nbd_cmd *cmd) req->errors = 0; - mutex_lock(&nbd->tx_lock); - nbd->task_send = current; - if (unlikely(!nbd->sock)) { - mutex_unlock(&nbd->tx_lock); + nsock = nbd->socks[index]; + mutex_lock(&nsock->tx_lock); + if (unlikely(!nsock->sock)) { + mutex_unlock(&nsock->tx_lock); dev_err(disk_to_dev(nbd->disk), "Attempted send on closed socket\n"); goto error_out; } - if (nbd_send_cmd(nbd, cmd) != 0) { + if (nbd_send_cmd(nbd, cmd, index) != 0) { dev_err(disk_to_dev(nbd->disk), "Request send failed\n"); req->errors++; nbd_end_request(cmd); } - nbd->task_send = NULL; - mutex_unlock(&nbd->tx_lock); + mutex_unlock(&nsock->tx_lock); return; @@ -525,38 +541,57 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx, struct nbd_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); blk_mq_start_request(bd->rq); - nbd_handle_cmd(cmd); + nbd_handle_cmd(cmd, hctx->queue_num); return BLK_MQ_RQ_QUEUE_OK; } -static int nbd_set_socket(struct nbd_device *nbd, struct socket *sock) +static int nbd_add_socket(struct nbd_device *nbd, struct socket *sock) { - int ret = 0; + struct nbd_sock **socks; + struct nbd_sock *nsock; - spin_lock_irq(&nbd->sock_lock); - - if (nbd->sock) { - ret = -EBUSY; - goto out; + if (!nbd->task_setup) + nbd->task_setup = current; + if (nbd->task_setup != current) { + dev_err(disk_to_dev(nbd->disk), + "Device being setup by another task"); + return -EINVAL; } - nbd->sock = sock; + socks = krealloc(nbd->socks, (nbd->num_connections + 1) * + sizeof(struct nbd_sock *), GFP_KERNEL); + if (!socks) + return -ENOMEM; + nsock = kzalloc(sizeof(struct nbd_sock), GFP_KERNEL); + if (!nsock) + return -ENOMEM; + + nbd->socks = socks; -out: - spin_unlock_irq(&nbd->sock_lock); + mutex_init(&nsock->tx_lock); + nsock->sock = sock; + socks[nbd->num_connections++] = nsock; - return ret; + return 0; } /* Reset all properties of an NBD device */ static void nbd_reset(struct nbd_device *nbd) { + int i; + + for (i = 0; i < nbd->num_connections; i++) + kfree(nbd->socks[i]); + kfree(nbd->socks); + nbd->socks = NULL; nbd->runtime_flags = 0; nbd->blksize = 1024; nbd->bytesize = 0; set_capacity(nbd->disk, 0); nbd->flags = 0; nbd->tag_set.timeout = 0; + nbd->num_connections = 0; + nbd->task_setup = NULL; queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue); } @@ -582,48 +617,67 @@ static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev) blk_queue_write_cache(nbd->disk->queue, false, false); } +static void send_disconnects(struct nbd_device *nbd) +{ + struct nbd_request request = {}; + int i, ret; + + request.magic = htonl(NBD_REQUEST_MAGIC); + request.type = htonl(NBD_CMD_DISC); + + for (i = 0; i < nbd->num_connections; i++) { + ret = sock_xmit(nbd, i, 1, &request, sizeof(request), 0); + if (ret <= 0) + dev_err(disk_to_dev(nbd->disk), + "Send disconnect failed %d\n", ret); + } +} + static int nbd_dev_dbg_init(struct nbd_device *nbd); static void nbd_dev_dbg_close(struct nbd_device *nbd); -/* Must be called with tx_lock held */ - +/* Must be called with config_lock held */ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, unsigned int cmd, unsigned long arg) { switch (cmd) { case NBD_DISCONNECT: { - struct request *sreq; - dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n"); - if (!nbd->sock) + if (!nbd->socks) return -EINVAL; - sreq = blk_mq_alloc_request(bdev_get_queue(bdev), WRITE, 0); - if (!sreq) - return -ENOMEM; - - mutex_unlock(&nbd->tx_lock); + mutex_unlock(&nbd->config_lock); fsync_bdev(bdev); - mutex_lock(&nbd->tx_lock); - sreq->cmd_type = REQ_TYPE_DRV_PRIV; + mutex_lock(&nbd->config_lock); /* Check again after getting mutex back. */ - if (!nbd->sock) { - blk_mq_free_request(sreq); + if (!nbd->socks) return -EINVAL; - } - - set_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags); - nbd_send_cmd(nbd, blk_mq_rq_to_pdu(sreq)); - blk_mq_free_request(sreq); + if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED, + &nbd->runtime_flags)) + send_disconnects(nbd); return 0; } - + case NBD_CLEAR_SOCK: sock_shutdown(nbd); nbd_clear_que(nbd); kill_bdev(bdev); + nbd_bdev_reset(bdev); + /* + * We want to give the run thread a chance to wait for everybody + * to clean up and then do it's own cleanup. + */ + if (!test_bit(NBD_RUNNING, &nbd->runtime_flags)) { + int i; + + for (i = 0; i < nbd->num_connections; i++) + kfree(nbd->socks[i]); + kfree(nbd->socks); + nbd->socks = NULL; + nbd->num_connections = 0; + } return 0; case NBD_SET_SOCK: { @@ -633,7 +687,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, if (!sock) return err; - err = nbd_set_socket(nbd, sock); + err = nbd_add_socket(nbd, sock); if (!err && max_part) bdev->bd_invalidated = 1; @@ -662,26 +716,53 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, return 0; case NBD_DO_IT: { - int error; + struct recv_thread_args *args; + int num_connections = nbd->num_connections; + int error, i; if (nbd->task_recv) return -EBUSY; - if (!nbd->sock) + if (!nbd->socks) return -EINVAL; - /* We have to claim the device under the lock */ + set_bit(NBD_RUNNING, &nbd->runtime_flags); + blk_mq_update_nr_hw_queues(&nbd->tag_set, nbd->num_connections); + args = kcalloc(num_connections, sizeof(*args), GFP_KERNEL); + if (!args) + goto out_err; nbd->task_recv = current; - mutex_unlock(&nbd->tx_lock); + mutex_unlock(&nbd->config_lock); nbd_parse_flags(nbd, bdev); + error = device_create_file(disk_to_dev(nbd->disk), &pid_attr); + if (error) { + dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n"); + goto out_recv; + } + + nbd_size_update(nbd, bdev); + nbd_dev_dbg_init(nbd); - error = nbd_thread_recv(nbd, bdev); + for (i = 0; i < num_connections; i++) { + sk_set_memalloc(nbd->socks[i]->sock->sk); + atomic_inc(&nbd->recv_threads); + INIT_WORK(&args[i].work, recv_work); + args[i].nbd = nbd; + args[i].index = i; + queue_work(system_long_wq, &args[i].work); + } + wait_event_interruptible(nbd->recv_wq, + atomic_read(&nbd->recv_threads) == 0); + for (i = 0; i < num_connections; i++) + flush_work(&args[i].work); nbd_dev_dbg_close(nbd); - - mutex_lock(&nbd->tx_lock); + nbd_size_clear(nbd, bdev); + device_remove_file(disk_to_dev(nbd->disk), &pid_attr); +out_recv: + mutex_lock(&nbd->config_lock); nbd->task_recv = NULL; - +out_err: sock_shutdown(nbd); nbd_clear_que(nbd); kill_bdev(bdev); @@ -694,7 +775,6 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd, error = -ETIMEDOUT; nbd_reset(nbd); - return error; } @@ -726,9 +806,9 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode, BUG_ON(nbd->magic != NBD_MAGIC); - mutex_lock(&nbd->tx_lock); + mutex_lock(&nbd->config_lock); error = __nbd_ioctl(bdev, nbd, cmd, arg); - mutex_unlock(&nbd->tx_lock); + mutex_unlock(&nbd->config_lock); return error; } @@ -748,8 +828,6 @@ static int nbd_dbg_tasks_show(struct seq_file *s, void *unused) if (nbd->task_recv) seq_printf(s, "recv: %d\n", task_pid_nr(nbd->task_recv)); - if (nbd->task_send) - seq_printf(s, "send: %d\n", task_pid_nr(nbd->task_send)); return 0; } @@ -873,9 +951,7 @@ static int nbd_init_request(void *data, struct request *rq, unsigned int numa_node) { struct nbd_cmd *cmd = blk_mq_rq_to_pdu(rq); - cmd->nbd = data; - INIT_LIST_HEAD(&cmd->list); return 0; } @@ -986,13 +1062,13 @@ static int __init nbd_init(void) for (i = 0; i < nbds_max; i++) { struct gendisk *disk = nbd_dev[i].disk; nbd_dev[i].magic = NBD_MAGIC; - spin_lock_init(&nbd_dev[i].sock_lock); - mutex_init(&nbd_dev[i].tx_lock); + mutex_init(&nbd_dev[i].config_lock); disk->major = NBD_MAJOR; disk->first_minor = i << part_shift; disk->fops = &nbd_fops; disk->private_data = &nbd_dev[i]; sprintf(disk->disk_name, "nbd%d", i); + init_waitqueue_head(&nbd_dev[i].recv_wq); nbd_reset(&nbd_dev[i]); add_disk(disk); }

[V3] nbd: add multi-connection support

Commit Message

Comments

Patch