From patchwork Fri Jul 10 13:26:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 11656661 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2FD51618 for ; Fri, 10 Jul 2020 13:26:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1EA56207D0 for ; Fri, 10 Jul 2020 13:26:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727033AbgGJN0V (ORCPT ); Fri, 10 Jul 2020 09:26:21 -0400 Received: from mx2.suse.de ([195.135.220.15]:50054 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726774AbgGJN0V (ORCPT ); Fri, 10 Jul 2020 09:26:21 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 08C05ACBF; Fri, 10 Jul 2020 13:26:20 +0000 (UTC) From: Coly Li To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-bcache@vger.kernel.org Cc: Coly Li , Chaitanya Kulkarni , Christoph Hellwig , Hannes Reinecke , Jan Kara , Jens Axboe , Mikhail Skorzhinskii , Philipp Reisner , Sagi Grimberg , Vlastimil Babka , stable@vger.kernel.org Subject: [PATCH 1/2] nvme-tpc: don't use sendpage for pages not taking reference counter Date: Fri, 10 Jul 2020 21:26:09 +0800 Message-Id: <20200710132610.11756-1-colyli@suse.de> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic. When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache device, writing meta data e.g. cache_set->disk_buckets to remote SSD may trigger a kernel panic due to the above problem. Bcause the meta data pages for cache_set->disk_buckets are allocated by __get_free_pages() without __GFP_COMP. This problem should be fixed both in upper layer driver (bcache) and nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking whether the page refcount is 0, if yes then don't use kernel_sendpage() and call sock_no_sendpage() to send the page into network stack. The code comments in this patch is copied and modified from drbd where the similar problem already gets solved by Philipp Reisner. This is the best code comment including my own version. Signed-off-by: Coly Li Cc: Chaitanya Kulkarni Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Jan Kara Cc: Jens Axboe Cc: Mikhail Skorzhinskii Cc: Philipp Reisner Cc: Sagi Grimberg Cc: Vlastimil Babka Cc: stable@vger.kernel.org --- drivers/nvme/host/tcp.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 79ef2b8e2b3c..faa71db7522a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; - /* can't zcopy slab pages */ - if (unlikely(PageSlab(page))) { + /* + * e.g. XFS meta- & log-data is in slab pages, or bcache meta + * data pages, or other high order pages allocated by + * __get_free_pages() without __GFP_COMP, which have a page_count + * of 0 and/or have PageSlab() set. We cannot use send_page for + * those, as that does get_page(); put_page(); and would cause + * either a VM_BUG directly, or __page_cache_release a page that + * would actually still be referenced by someone, leading to some + * obscure delayed Oops somewhere else. + */ + if (unlikely(PageSlab(page) || page_count(page) < 1)) { ret = sock_no_sendpage(queue->sock, page, offset, len, flags); } else { From patchwork Fri Jul 10 13:26:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Coly Li X-Patchwork-Id: 11656665 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A04F7618 for ; Fri, 10 Jul 2020 13:26:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 90F2C20842 for ; Fri, 10 Jul 2020 13:26:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727082AbgGJN01 (ORCPT ); Fri, 10 Jul 2020 09:26:27 -0400 Received: from mx2.suse.de ([195.135.220.15]:50192 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726774AbgGJN00 (ORCPT ); Fri, 10 Jul 2020 09:26:26 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 8E585AD1E; Fri, 10 Jul 2020 13:26:25 +0000 (UTC) From: Coly Li To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, linux-bcache@vger.kernel.org Cc: Coly Li , Chaitanya Kulkarni , Christoph Hellwig , Hannes Reinecke , Jan Kara , Jens Axboe , Mikhail Skorzhinskii , Philipp Reisner , Sagi Grimberg , Vlastimil Babka , stable@vger.kernel.org Subject: [PATCH 2/2] bcache: allocate meta data pages as compound pages Date: Fri, 10 Jul 2020 21:26:10 +0800 Message-Id: <20200710132610.11756-2-colyli@suse.de> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200710132610.11756-1-colyli@suse.de> References: <20200710132610.11756-1-colyli@suse.de> MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org There are some meta data of bcache are allocated by multiple pages, and they are used as bio bv_page for I/Os to the cache device. for example cache_set->uuids, cache->disk_buckets, journal_write->data, bset_tree->data. For such meta data memory, all the allocated pages should be treated as a single memory block. Then the memory management and underlying I/O code can treat them more clearly. This patch adds __GFP_COMP flag to all the location allocating >0 order pages for the above mentioned meta data. Then their pages are treated as compound pages now. Signed-off-by: Coly Li Cc: Chaitanya Kulkarni Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Jan Kara Cc: Jens Axboe Cc: Mikhail Skorzhinskii Cc: Philipp Reisner Cc: Sagi Grimberg Cc: Vlastimil Babka Cc: stable@vger.kernel.org --- drivers/md/bcache/bset.c | 2 +- drivers/md/bcache/btree.c | 2 +- drivers/md/bcache/journal.c | 4 ++-- drivers/md/bcache/super.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 4995fcaefe29..67a2c47f4201 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -322,7 +322,7 @@ int bch_btree_keys_alloc(struct btree_keys *b, b->page_order = page_order; - t->data = (void *) __get_free_pages(gfp, b->page_order); + t->data = (void *) __get_free_pages(__GFP_COMP|gfp, b->page_order); if (!t->data) goto err; diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 6548a601edf0..dd116c83de80 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -785,7 +785,7 @@ int bch_btree_cache_alloc(struct cache_set *c) mutex_init(&c->verify_lock); c->verify_ondisk = (void *) - __get_free_pages(GFP_KERNEL, ilog2(bucket_pages(c))); + __get_free_pages(GFP_KERNEL|__GFP_COMP, ilog2(bucket_pages(c))); c->verify_data = mca_bucket_alloc(c, &ZERO_KEY, GFP_KERNEL); diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 90aac4e2333f..d8586b6ccb76 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -999,8 +999,8 @@ int bch_journal_alloc(struct cache_set *c) j->w[1].c = c; if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) || - !(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) || - !(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS))) + !(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL|__GFP_COMP, JSET_BITS)) || + !(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL|__GFP_COMP, JSET_BITS))) return -ENOMEM; return 0; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 2014016f9a60..daa4626024a2 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1776,7 +1776,7 @@ void bch_cache_set_unregister(struct cache_set *c) } #define alloc_bucket_pages(gfp, c) \ - ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c)))) + ((void *) __get_free_pages(__GFP_ZERO|__GFP_COMP|gfp, ilog2(bucket_pages(c)))) struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) {