diff mbox series

bcache: Revert "bcache: use bvec_virt"

Message ID 20211103151041.70516-1-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series bcache: Revert "bcache: use bvec_virt" | expand

Commit Message

Coly Li Nov. 3, 2021, 3:10 p.m. UTC
This reverts commit 2fd3e5efe791946be0957c8e1eed9560b541fe46.

The above commit replaces page_address(bv->bv_page) by bvec_virt(bv) to
avoid directly access to bv->bv_page, but in situation bv->bv_offset is
not zero and page_address(bv->bv_page) is not equal to bvec_virt(bv). In
such case a memory corruption may happen because memory in next page is
tainted by following line in do_btree_node_write(),
	memcpy(bvec_virt(bv), addr, PAGE_SIZE);

This patch reverts the mentioned commit to avoid the memory corruption.

Fixes: 2fd3e5efe791 ("bcache: use bvec_virt")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 5.15
---
 drivers/md/bcache/btree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christoph Hellwig Nov. 3, 2021, 3:46 p.m. UTC | #1
On Wed, Nov 03, 2021 at 11:10:41PM +0800, Coly Li wrote:
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 93b67b8d31c3..88c573eeb598 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -378,7 +378,7 @@ static void do_btree_node_write(struct btree *b)
>  		struct bvec_iter_all iter_all;
>  
>  		bio_for_each_segment_all(bv, b->bio, iter_all) {
> -			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
> +			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);

How could there be an offset?  bch_bio_alloc_pages allocates a
fresh page for each vec, and bio_for_each_segment_all iterates page
by page.  IFF there is an offset there is proble in the surrounding
code as bch_bio_alloc_pages assumes that it is called on a freshly
allocate and initialized bio.
Coly Li Nov. 3, 2021, 4:11 p.m. UTC | #2
On 11/3/21 11:46 PM, Christoph Hellwig wrote:
> On Wed, Nov 03, 2021 at 11:10:41PM +0800, Coly Li wrote:
>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
>> index 93b67b8d31c3..88c573eeb598 100644
>> --- a/drivers/md/bcache/btree.c
>> +++ b/drivers/md/bcache/btree.c
>> @@ -378,7 +378,7 @@ static void do_btree_node_write(struct btree *b)
>>   		struct bvec_iter_all iter_all;
>>   
>>   		bio_for_each_segment_all(bv, b->bio, iter_all) {
>> -			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
>> +			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
> How could there be an offset?  bch_bio_alloc_pages allocates a
> fresh page for each vec, and bio_for_each_segment_all iterates page
> by page.  IFF there is an offset there is proble in the surrounding
> code as bch_bio_alloc_pages assumes that it is called on a freshly
> allocate and initialized bio.

Yes, the offset is modified in bch_bio_alloc_pages(). Normally the 
bcache defined block size is 4KB so the issue was not triggered 
frequently. I found it during testing my nvdimm enabling code for 
bcache, where I happen to make the bcache defined block size to non-4KB. 
The offset is from the previous written bkey set, which the minimized 
unit size is 1 bcache-defined-block-size.

Coly Li
Christoph Hellwig Nov. 3, 2021, 4:19 p.m. UTC | #3
On Thu, Nov 04, 2021 at 12:11:45AM +0800, Coly Li wrote:
>> fresh page for each vec, and bio_for_each_segment_all iterates page
>> by page.  IFF there is an offset there is proble in the surrounding
>> code as bch_bio_alloc_pages assumes that it is called on a freshly
>> allocate and initialized bio.
>
> Yes, the offset is modified in bch_bio_alloc_pages().

Where?   In my upstream copy of bch_bio_alloc_pages there is no bv_offset
manipulation, and I could not see how such a manipulation would make
sense.

> Normally the bcache 
> defined block size is 4KB so the issue was not triggered frequently. I 
> found it during testing my nvdimm enabling code for bcache, where I happen 
> to make the bcache defined block size to non-4KB. The offset is from the 
> previous written bkey set, which the minimized unit size is 1 
> bcache-defined-block-size.

So you have some out of tree changes here?  Copying a PAGE_SIZE into
a 'segment' bvec just does not make any sense if there is an offset,
as segments are defined as bvecs that do not span page boundaries.

I suspect the best thing to do in do_btree_node_write would be something
like the patch below instead of poking into the internals here, but I'd
also really like to understand the root cause as it does point to a bug
somewhere else.


diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 93b67b8d31c3d..f69914848f32f 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -378,8 +378,8 @@ static void do_btree_node_write(struct btree *b)
 		struct bvec_iter_all iter_all;
 
 		bio_for_each_segment_all(bv, b->bio, iter_all) {
-			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
-			addr += PAGE_SIZE;
+			memcpy_to_bvec(bvec_virt(bv), addr);
+			addr += bv->bv_len;
 		}
 
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
Coly Li Nov. 4, 2021, 4:25 a.m. UTC | #4
On 11/4/21 12:19 AM, Christoph Hellwig wrote:
> On Thu, Nov 04, 2021 at 12:11:45AM +0800, Coly Li wrote:
>>> fresh page for each vec, and bio_for_each_segment_all iterates page
>>> by page.  IFF there is an offset there is proble in the surrounding
>>> code as bch_bio_alloc_pages assumes that it is called on a freshly
>>> allocate and initialized bio.
>> Yes, the offset is modified in bch_bio_alloc_pages().
> Where?   In my upstream copy of bch_bio_alloc_pages there is no bv_offset
> manipulation, and I could not see how such a manipulation would make
> sense.

Forgive my typo. It was bch_bio_map() before bch_bio_alloc_pages(), both 
in do_btree_node_write() and in util.c, bv->bv_offset is set by,
     bv->bv_offset = base ? offset_in_page(base) : 0;

Here base is bset *i which is initialized in do_btree_node_write() as,
     struct bset *i = btree_bset_last(b);

The unit size of bset is 1 bcache-defined-block size, minimized value is 
512.

>> Normally the bcache
>> defined block size is 4KB so the issue was not triggered frequently. I
>> found it during testing my nvdimm enabling code for bcache, where I happen
>> to make the bcache defined block size to non-4KB. The offset is from the
>> previous written bkey set, which the minimized unit size is 1
>> bcache-defined-block-size.
> So you have some out of tree changes here?  Copying a PAGE_SIZE into
> a 'segment' bvec just does not make any sense if there is an offset,
> as segments are defined as bvecs that do not span page boundaries.
>
> I suspect the best thing to do in do_btree_node_write would be something
> like the patch below instead of poking into the internals here, but I'd
> also really like to understand the root cause as it does point to a bug
> somewhere else.
>
>
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 93b67b8d31c3d..f69914848f32f 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -378,8 +378,8 @@ static void do_btree_node_write(struct btree *b)
>   		struct bvec_iter_all iter_all;
>   
>   		bio_for_each_segment_all(bv, b->bio, iter_all) {
> -			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
> -			addr += PAGE_SIZE;
> +			memcpy_to_bvec(bvec_virt(bv), addr);
> +			addr += bv->bv_len;
>   		}
>   
>   		bch_submit_bbio(b->bio, b->c, &k.key, 0);

The above change doesn't work, still observe panic[1].

Before calling bio_for_each_segment_all(), void *addr is calculated by,
     void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
which is a page size aligned address.  When writing down a dirty btree 
node, it requires whole page to be written. Your original patch works 
fine when there is not previously unwirtten keys in the page as previous 
bkey set (and corrupts memory when bv->bv_offset is non-zero). The above 
change seems missing the part in offset [0, bv->bv_offset] inside the 
dirty page, I am not sure how the bellowed panic happens by the above 
change, it seems like wild pointer from the missing part of btree node 
when iterating the btree.

If you don't want to directly access members inside struct bio_vec, is 
there something like page_base(vb) which returns bv->bv_page ?

Thanks.

Coly Li



[1] the panic message starts like,

[ 1926.350362] ------------[ cut here ]------------
[ 1926.405603] kernel BUG at ./include/linux/highmem.h:316!
[ 1926.469172] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[ 1926.540006] CPU: 10 PID: 477 Comm: kworker/10:2 Kdump: loaded 
Tainted: G            E     5.15.0-59.27-default+ #57
[ 1926.350362 1926.540009] Hardware name: Lenovo ThinkSystem SR650 
-[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE164L-2.80]- 10/23/2020
[ 1926.540010] Workqueue: bcache bch_data_insert_keys [bcache]
[ 1926.806482] RIP: 0010:__bch_btree_node_write+0x662/0x690 [bcache]
<snipped>
Christoph Hellwig Nov. 4, 2021, 8:23 p.m. UTC | #5
Ok, because it takes the offset away we're not aligned any more.
I think the right fix is to fix this properly and stop poking into
the bio internals.  The patch below has surived and xfstests quick
run on two bcache devices:

---
From 3bf1068e2dc6a75442513e8f9f64055740c0b507 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 4 Nov 2021 10:40:37 +0100
Subject: bcache: lift bvec mapping into bch_bio_alloc_pages

Use the proper block APIs to add pages to the bio and remove the need
for the extra call to bch_bio_map and the open coded data copy for the
write case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/bcache/btree.c     | 31 ++++++++------------------
 drivers/md/bcache/debug.c     |  4 +---
 drivers/md/bcache/movinggc.c  |  7 +++---
 drivers/md/bcache/request.c   |  5 ++---
 drivers/md/bcache/util.c      | 41 +++++++++++++++++++++++------------
 drivers/md/bcache/util.h      |  2 +-
 drivers/md/bcache/writeback.c |  7 +++---
 7 files changed, 48 insertions(+), 49 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 93b67b8d31c3d..c435b8f2bca04 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -339,6 +339,7 @@ static void do_btree_node_write(struct btree *b)
 {
 	struct closure *cl = &b->io;
 	struct bset *i = btree_bset_last(b);
+	size_t size;
 	BKEY_PADDED(key) k;
 
 	i->version	= BCACHE_BSET_VERSION;
@@ -349,9 +350,7 @@ static void do_btree_node_write(struct btree *b)
 
 	b->bio->bi_end_io	= btree_node_write_endio;
 	b->bio->bi_private	= cl;
-	b->bio->bi_iter.bi_size	= roundup(set_bytes(i), block_bytes(b->c->cache));
 	b->bio->bi_opf		= REQ_OP_WRITE | REQ_META | REQ_FUA;
-	bch_bio_map(b->bio, i);
 
 	/*
 	 * If we're appending to a leaf node, we don't technically need FUA -
@@ -372,32 +371,20 @@ static void do_btree_node_write(struct btree *b)
 	SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) +
 		       bset_sector_offset(&b->keys, i));
 
-	if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
-		struct bio_vec *bv;
-		void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
-		struct bvec_iter_all iter_all;
-
-		bio_for_each_segment_all(bv, b->bio, iter_all) {
-			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
-			addr += PAGE_SIZE;
-		}
-
-		bch_submit_bbio(b->bio, b->c, &k.key, 0);
-
-		continue_at(cl, btree_node_write_done, NULL);
-	} else {
-		/*
-		 * No problem for multipage bvec since the bio is
-		 * just allocated
-		 */
-		b->bio->bi_vcnt = 0;
+	size = roundup(set_bytes(i), block_bytes(b->c->cache));
+	if (bch_bio_alloc_pages(b->bio, i, size,
+			__GFP_NOWARN | GFP_NOWAIT) < 0) {
+		b->bio->bi_iter.bi_size	= size;
 		bch_bio_map(b->bio, i);
-
 		bch_submit_bbio(b->bio, b->c, &k.key, 0);
 
 		closure_sync(cl);
 		continue_at_nobarrier(cl, __btree_node_write_done, NULL);
+		return;
 	}
+
+	bch_submit_bbio(b->bio, b->c, &k.key, 0);
+	continue_at(cl, btree_node_write_done, NULL);
 }
 
 void __bch_btree_node_write(struct btree *b, struct closure *parent)
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 6230dfdd9286e..ef8ec8090d579 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -117,10 +117,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	bio_set_dev(check, bio->bi_bdev);
 	check->bi_opf = REQ_OP_READ;
 	check->bi_iter.bi_sector = bio->bi_iter.bi_sector;
-	check->bi_iter.bi_size = bio->bi_iter.bi_size;
 
-	bch_bio_map(check, NULL);
-	if (bch_bio_alloc_pages(check, GFP_NOIO))
+	if (bch_bio_alloc_pages(check, NULL, bio->bi_iter.bi_size, GFP_NOIO))
 		goto out_put;
 
 	submit_bio_wait(check);
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index b9c3d27ec093a..7d94e2ec562a4 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -84,9 +84,7 @@ static void moving_init(struct moving_io *io)
 	bio_get(bio);
 	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
-	bio->bi_iter.bi_size	= KEY_SIZE(&io->w->key) << 9;
 	bio->bi_private		= &io->cl;
-	bch_bio_map(bio, NULL);
 }
 
 static void write_moving(struct closure *cl)
@@ -97,6 +95,8 @@ static void write_moving(struct closure *cl)
 	if (!op->status) {
 		moving_init(io);
 
+		io->bio.bio.bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9;
+		bch_bio_map(&io->bio.bio, NULL);
 		io->bio.bio.bi_iter.bi_sector = KEY_START(&io->w->key);
 		op->write_prio		= 1;
 		op->bio			= &io->bio.bio;
@@ -163,7 +163,8 @@ static void read_moving(struct cache_set *c)
 		bio_set_op_attrs(bio, REQ_OP_READ, 0);
 		bio->bi_end_io	= read_moving_endio;
 
-		if (bch_bio_alloc_pages(bio, GFP_KERNEL))
+		if (bch_bio_alloc_pages(bio, NULL, KEY_SIZE(&io->w->key) << 9,
+				GFP_KERNEL))
 			goto err;
 
 		trace_bcache_gc_copy(&w->key);
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index d15aae6c51c13..faa89410c7470 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -921,13 +921,12 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
 
 	cache_bio->bi_iter.bi_sector	= miss->bi_iter.bi_sector;
 	bio_copy_dev(cache_bio, miss);
-	cache_bio->bi_iter.bi_size	= s->insert_bio_sectors << 9;
 
 	cache_bio->bi_end_io	= backing_request_endio;
 	cache_bio->bi_private	= &s->cl;
 
-	bch_bio_map(cache_bio, NULL);
-	if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
+	if (bch_bio_alloc_pages(cache_bio, NULL, s->insert_bio_sectors << 9,
+			__GFP_NOWARN | GFP_NOIO))
 		goto out_put;
 
 	s->cache_miss	= miss;
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index ae380bc3992e3..cf627e7eadc10 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -258,6 +258,8 @@ start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
 /**
  * bch_bio_alloc_pages - allocates a single page for each bvec in a bio
  * @bio: bio to allocate pages for
+ * @data: if non-NULL copy data from here into the newly allocate pages
+ * @size: size to allocate
  * @gfp_mask: flags for allocation
  *
  * Allocates pages up to @bio->bi_vcnt.
@@ -265,23 +267,34 @@ start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
  * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
  * freed.
  */
-int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
+int bch_bio_alloc_pages(struct bio *bio, void *data, size_t size, gfp_t gfp)
 {
-	int i;
-	struct bio_vec *bv;
+	struct bvec_iter iter;
+	struct bio_vec bv;
 
-	/*
-	 * This is called on freshly new bio, so it is safe to access the
-	 * bvec table directly.
-	 */
-	for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
-		bv->bv_page = alloc_page(gfp_mask);
-		if (!bv->bv_page) {
-			while (--bv >= bio->bi_io_vec)
-				__free_page(bv->bv_page);
-			return -ENOMEM;
-		}
+	BUG_ON(bio->bi_vcnt);
+
+	while (size) {
+		struct page *page = alloc_page(gfp);
+		unsigned int offset = offset_in_page(page);
+		size_t len = min(size, PAGE_SIZE - offset);
+
+		if (!page)
+			goto unwind;
+		if (data)
+			memcpy_to_page(page, offset, data, len);
+		if (bio_add_page(bio, page, offset, len) != len)
+			goto unwind;
+
+		size -= len;
+		data += len;
 	}
 
 	return 0;
+
+unwind:
+	bio_for_each_segment(bv, bio, iter)
+		__free_page(bv.bv_page);
+	bio->bi_vcnt = 0;
+	return -ENOMEM;
 }
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 6f3cb7c921303..131d5e874231f 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -557,6 +557,6 @@ static inline unsigned int fract_exp_two(unsigned int x,
 }
 
 void bch_bio_map(struct bio *bio, void *base);
-int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask);
+int bch_bio_alloc_pages(struct bio *bio, void *data, size_t size, gfp_t gfp);
 
 #endif /* _BCACHE_UTIL_H */
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index c7560f66dca88..a153e3a1b3b87 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -297,9 +297,7 @@ static void dirty_init(struct keybuf_key *w)
 	if (!io->dc->writeback_percent)
 		bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
 
-	bio->bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
 	bio->bi_private		= w;
-	bch_bio_map(bio, NULL);
 }
 
 static void dirty_io_destructor(struct closure *cl)
@@ -395,6 +393,8 @@ static void write_dirty(struct closure *cl)
 	 */
 	if (KEY_DIRTY(&w->key)) {
 		dirty_init(w);
+		io->bio.bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
+		bch_bio_map(&io->bio, NULL);
 		bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
 		io->bio.bi_iter.bi_sector = KEY_START(&w->key);
 		bio_set_dev(&io->bio, io->dc->bdev);
@@ -513,7 +513,8 @@ static void read_dirty(struct cached_dev *dc)
 			bio_set_dev(&io->bio, dc->disk.c->cache->bdev);
 			io->bio.bi_end_io	= read_dirty_endio;
 
-			if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
+			if (bch_bio_alloc_pages(&io->bio, NULL,
+					KEY_SIZE(&w->key) << 9, GFP_KERNEL))
 				goto err_free;
 
 			trace_bcache_writeback(&w->key);
Coly Li Nov. 6, 2021, 1:36 p.m. UTC | #6
On 11/5/21 4:23 AM, Christoph Hellwig wrote:
> Ok, because it takes the offset away we're not aligned any more.
> I think the right fix is to fix this properly and stop poking into
> the bio internals.  The patch below has surived and xfstests quick
> run on two bcache devices:
>
> ---
>  From 3bf1068e2dc6a75442513e8f9f64055740c0b507 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 4 Nov 2021 10:40:37 +0100
> Subject: bcache: lift bvec mapping into bch_bio_alloc_pages
>
> Use the proper block APIs to add pages to the bio and remove the need
> for the extra call to bch_bio_map and the open coded data copy for the
> write case.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Hmm, I apply this patch on Linux v5.15 (commit 8bb7eca972ad), still 
happen to observe panic on my system.

How about let's revert the original patch firstly, then I'd like to test 
your further patch for the 5.17 merge window? I have report from 
community user for the regression already.

There is the panic message I observed,

[  655.974283] BUG: kernel NULL pointer dereference, address: 
0000000000000800
[  656.057593] #PF: supervisor read access in kernel mode
[  656.119062] #PF: error_code(0x0000) - not-present page
[  656.180532] PGD 0 P4D 0
[  656.210804] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[  656.271231] CPU: 31 PID: 7728 Comm: systemd-udevd Kdump: loaded 
Tainted: G            E     5.15.0-59.27-default+ #14
[  656.398220] Hardware name: Lenovo ThinkSystem SR650 
-[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE164L-2.80]- 10/23/2020
[  656.523129] RIP: 0010:bch_bio_alloc_pages+0xd0/0x1e0 [bcache]
[  656.591900] Code: 89 c6 48 89 ef e8 90 96 ec df 48 98 48 39 d8 75 61 
49 01 df 49 29 dc 0f 85 72 ff ff ff 5b 31 c0 5d 41 5c 41 5d 41 5e 41 5f 
c3 <49> 8b 37 48 89 31 89 de 49 8b 7c 37 f8 48 89 7c 31 f8 48 8d 79 08
[  656.816646] RSP: 0018:ffffacc78728b730 EFLAGS: 00010216
[  656.879156] RAX: fffff4d02d1aee40 RBX: 00000000000001c0 RCX: 
ffffa120c6bb9e40
[  656.964546] RDX: 0000000000000e40 RSI: 0000000000000e40 RDI: 
0000000000001000
[  657.049932] RBP: ffffa122417e2d60 R08: 0000000000000001 R09: 
0000000000000000
[  657.135322] R10: ffffacc78728b528 R11: 0000000000000000 R12: 
0000000000000800
[  657.220713] R13: 0000000000002c00 R14: 0000000000001000 R15: 
0000000000000800
[  657.306102] FS:  00007f140222d980(0000) GS:ffffa1305fc00000(0000) 
knlGS:0000000000000000
[  657.402931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.471680] CR2: 0000000000000800 CR3: 0000004ba1188002 CR4: 
00000000007706e0
[  657.557070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  657.642457] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  657.727848] PKRU: 55555554
[  657.760197] Call Trace:
[  657.789431]  ? detached_dev_end_io+0x60/0x60 [bcache]
[  657.849870]  cached_dev_cache_miss+0x1a1/0x290 [bcache]
[  657.912389]  ? request_endio+0x30/0x30 [bcache]
[  657.966584]  cache_lookup_fn+0x99/0x310 [bcache]
[  658.021820]  ? request_endio+0x30/0x30 [bcache]
[  658.076016]  ? bch_btree_iter_next_filter+0x1b3/0x320 [bcache]
[  658.145811]  ? request_endio+0x30/0x30 [bcache]
[  658.200007]  bch_btree_map_keys_recurse+0xf1/0x180 [bcache]
[  658.266685]  ? lock_acquire+0x1df/0x300
[  658.312554]  ? rcu_read_lock_sched_held+0x23/0x80
[  658.368825]  ? lock_acquired+0x183/0x350
[  658.415734]  bch_btree_map_keys+0xf4/0x150 [bcache]
[  658.474088]  ? request_endio+0x30/0x30 [bcache]
[  658.528286]  cache_lookup+0xb9/0x1a0 [bcache]
[  658.580403]  cached_dev_submit_bio+0x7e2/0x1030 [bcache]
[  658.643959]  ? submit_bio_checks+0x596/0x670
[  658.695037]  __submit_bio+0x1bf/0x320
[  658.738830]  ? do_mpage_readpage+0x58f/0x800
[  658.789900]  submit_bio_noacct+0xfe/0x2d0
[  658.837851]  ? submit_bio+0x42/0x130
[  658.880599]  submit_bio+0x42/0x130
[  658.921269]  mpage_readahead+0x163/0x1b0
[  658.968182]  ? blkdev_direct_IO+0x4c0/0x4c0
[  659.018217]  read_pages+0x9f/0x2f0
[  659.058894]  ? page_cache_ra_unbounded+0x162/0x240
[  659.116203]  page_cache_ra_unbounded+0x162/0x240
[  659.171434]  filemap_get_pages+0xd3/0x640
[  659.219389]  ? atime_needs_update+0x89/0xf0
[  659.269419]  filemap_read+0xc2/0x370
[  659.312166]  ? rcu_read_lock_held_common+0xe/0x40
[  659.368438]  ? rcu_read_lock_sched_held+0x23/0x80
[  659.424708]  ? print_usage_bug+0x190/0x1d0
[  659.473697]  ? aa_file_perm+0x1ba/0x7a0
[  659.519567]  ? rcu_read_lock_sched_held+0x23/0x80
[  659.575837]  blkdev_read_iter+0x41/0x50
[  659.621707]  new_sync_read+0x11e/0x1b0
[  659.666539]  vfs_read+0x1a2/0x1d0
[  659.706167]  ksys_read+0xa7/0xe0
[  659.744756]  do_syscall_64+0x3a/0x80
[  659.787507]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  659.847935] RIP: 0033:0x7f1401328ffe
[  659.890684] Code: 48 8b 15 d5 7f 20 00 f7 d8 64 89 02 48 c7 c0 ff ff 
ff ff eb b6 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 
05 <48> 3d 00 f0 ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49
[  660.115432] RSP: 002b:00007ffda20256c8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000000
[  660.206023] RAX: ffffffffffffffda RBX: 000055bd9e8816d0 RCX: 
00007f1401328ffe
[  660.291409] RDX: 0000000000000200 RSI: 000055bd9e700c08 RDI: 
000000000000000f
[  660.376800] RBP: 0000000000200000 R08: 000055bd9e700be0 R09: 
0000000000000000
[  660.462190] R10: 000055bd9e6b4010 R11: 0000000000000246 R12: 
0000000000000200
[  660.547577] R13: 000055bd9e881720 R14: 000055bd9e700bf8 R15: 
000055bd9e700be0


Coly Li

> ---
>   drivers/md/bcache/btree.c     | 31 ++++++++------------------
>   drivers/md/bcache/debug.c     |  4 +---
>   drivers/md/bcache/movinggc.c  |  7 +++---
>   drivers/md/bcache/request.c   |  5 ++---
>   drivers/md/bcache/util.c      | 41 +++++++++++++++++++++++------------
>   drivers/md/bcache/util.h      |  2 +-
>   drivers/md/bcache/writeback.c |  7 +++---
>   7 files changed, 48 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 93b67b8d31c3d..c435b8f2bca04 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -339,6 +339,7 @@ static void do_btree_node_write(struct btree *b)
>   {
>   	struct closure *cl = &b->io;
>   	struct bset *i = btree_bset_last(b);
> +	size_t size;
>   	BKEY_PADDED(key) k;
>   
>   	i->version	= BCACHE_BSET_VERSION;
> @@ -349,9 +350,7 @@ static void do_btree_node_write(struct btree *b)
>   
>   	b->bio->bi_end_io	= btree_node_write_endio;
>   	b->bio->bi_private	= cl;
> -	b->bio->bi_iter.bi_size	= roundup(set_bytes(i), block_bytes(b->c->cache));
>   	b->bio->bi_opf		= REQ_OP_WRITE | REQ_META | REQ_FUA;
> -	bch_bio_map(b->bio, i);
>   
>   	/*
>   	 * If we're appending to a leaf node, we don't technically need FUA -
> @@ -372,32 +371,20 @@ static void do_btree_node_write(struct btree *b)
>   	SET_PTR_OFFSET(&k.key, 0, PTR_OFFSET(&k.key, 0) +
>   		       bset_sector_offset(&b->keys, i));
>   
> -	if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
> -		struct bio_vec *bv;
> -		void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
> -		struct bvec_iter_all iter_all;
> -
> -		bio_for_each_segment_all(bv, b->bio, iter_all) {
> -			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
> -			addr += PAGE_SIZE;
> -		}
> -
> -		bch_submit_bbio(b->bio, b->c, &k.key, 0);
> -
> -		continue_at(cl, btree_node_write_done, NULL);
> -	} else {
> -		/*
> -		 * No problem for multipage bvec since the bio is
> -		 * just allocated
> -		 */
> -		b->bio->bi_vcnt = 0;
> +	size = roundup(set_bytes(i), block_bytes(b->c->cache));
> +	if (bch_bio_alloc_pages(b->bio, i, size,
> +			__GFP_NOWARN | GFP_NOWAIT) < 0) {
> +		b->bio->bi_iter.bi_size	= size;
>   		bch_bio_map(b->bio, i);
> -
>   		bch_submit_bbio(b->bio, b->c, &k.key, 0);
>   
>   		closure_sync(cl);
>   		continue_at_nobarrier(cl, __btree_node_write_done, NULL);
> +		return;
>   	}
> +
> +	bch_submit_bbio(b->bio, b->c, &k.key, 0);
> +	continue_at(cl, btree_node_write_done, NULL);
>   }
>   
>   void __bch_btree_node_write(struct btree *b, struct closure *parent)
> diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
> index 6230dfdd9286e..ef8ec8090d579 100644
> --- a/drivers/md/bcache/debug.c
> +++ b/drivers/md/bcache/debug.c
> @@ -117,10 +117,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
>   	bio_set_dev(check, bio->bi_bdev);
>   	check->bi_opf = REQ_OP_READ;
>   	check->bi_iter.bi_sector = bio->bi_iter.bi_sector;
> -	check->bi_iter.bi_size = bio->bi_iter.bi_size;
>   
> -	bch_bio_map(check, NULL);
> -	if (bch_bio_alloc_pages(check, GFP_NOIO))
> +	if (bch_bio_alloc_pages(check, NULL, bio->bi_iter.bi_size, GFP_NOIO))
>   		goto out_put;
>   
>   	submit_bio_wait(check);
> diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
> index b9c3d27ec093a..7d94e2ec562a4 100644
> --- a/drivers/md/bcache/movinggc.c
> +++ b/drivers/md/bcache/movinggc.c
> @@ -84,9 +84,7 @@ static void moving_init(struct moving_io *io)
>   	bio_get(bio);
>   	bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
>   
> -	bio->bi_iter.bi_size	= KEY_SIZE(&io->w->key) << 9;
>   	bio->bi_private		= &io->cl;
> -	bch_bio_map(bio, NULL);
>   }
>   
>   static void write_moving(struct closure *cl)
> @@ -97,6 +95,8 @@ static void write_moving(struct closure *cl)
>   	if (!op->status) {
>   		moving_init(io);
>   
> +		io->bio.bio.bi_iter.bi_size = KEY_SIZE(&io->w->key) << 9;
> +		bch_bio_map(&io->bio.bio, NULL);
>   		io->bio.bio.bi_iter.bi_sector = KEY_START(&io->w->key);
>   		op->write_prio		= 1;
>   		op->bio			= &io->bio.bio;
> @@ -163,7 +163,8 @@ static void read_moving(struct cache_set *c)
>   		bio_set_op_attrs(bio, REQ_OP_READ, 0);
>   		bio->bi_end_io	= read_moving_endio;
>   
> -		if (bch_bio_alloc_pages(bio, GFP_KERNEL))
> +		if (bch_bio_alloc_pages(bio, NULL, KEY_SIZE(&io->w->key) << 9,
> +				GFP_KERNEL))
>   			goto err;
>   
>   		trace_bcache_gc_copy(&w->key);
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index d15aae6c51c13..faa89410c7470 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -921,13 +921,12 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
>   
>   	cache_bio->bi_iter.bi_sector	= miss->bi_iter.bi_sector;
>   	bio_copy_dev(cache_bio, miss);
> -	cache_bio->bi_iter.bi_size	= s->insert_bio_sectors << 9;
>   
>   	cache_bio->bi_end_io	= backing_request_endio;
>   	cache_bio->bi_private	= &s->cl;
>   
> -	bch_bio_map(cache_bio, NULL);
> -	if (bch_bio_alloc_pages(cache_bio, __GFP_NOWARN|GFP_NOIO))
> +	if (bch_bio_alloc_pages(cache_bio, NULL, s->insert_bio_sectors << 9,
> +			__GFP_NOWARN | GFP_NOIO))
>   		goto out_put;
>   
>   	s->cache_miss	= miss;
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index ae380bc3992e3..cf627e7eadc10 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -258,6 +258,8 @@ start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
>   /**
>    * bch_bio_alloc_pages - allocates a single page for each bvec in a bio
>    * @bio: bio to allocate pages for
> + * @data: if non-NULL copy data from here into the newly allocate pages
> + * @size: size to allocate
>    * @gfp_mask: flags for allocation
>    *
>    * Allocates pages up to @bio->bi_vcnt.
> @@ -265,23 +267,34 @@ start:		bv->bv_len	= min_t(size_t, PAGE_SIZE - bv->bv_offset,
>    * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
>    * freed.
>    */
> -int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
> +int bch_bio_alloc_pages(struct bio *bio, void *data, size_t size, gfp_t gfp)
>   {
> -	int i;
> -	struct bio_vec *bv;
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
>   
> -	/*
> -	 * This is called on freshly new bio, so it is safe to access the
> -	 * bvec table directly.
> -	 */
> -	for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
> -		bv->bv_page = alloc_page(gfp_mask);
> -		if (!bv->bv_page) {
> -			while (--bv >= bio->bi_io_vec)
> -				__free_page(bv->bv_page);
> -			return -ENOMEM;
> -		}
> +	BUG_ON(bio->bi_vcnt);
> +
> +	while (size) {
> +		struct page *page = alloc_page(gfp);
> +		unsigned int offset = offset_in_page(page);
> +		size_t len = min(size, PAGE_SIZE - offset);
> +
> +		if (!page)
> +			goto unwind;
> +		if (data)
> +			memcpy_to_page(page, offset, data, len);
> +		if (bio_add_page(bio, page, offset, len) != len)
> +			goto unwind;
> +
> +		size -= len;
> +		data += len;
>   	}
>   
>   	return 0;
> +
> +unwind:
> +	bio_for_each_segment(bv, bio, iter)
> +		__free_page(bv.bv_page);
> +	bio->bi_vcnt = 0;
> +	return -ENOMEM;
>   }
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index 6f3cb7c921303..131d5e874231f 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -557,6 +557,6 @@ static inline unsigned int fract_exp_two(unsigned int x,
>   }
>   
>   void bch_bio_map(struct bio *bio, void *base);
> -int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask);
> +int bch_bio_alloc_pages(struct bio *bio, void *data, size_t size, gfp_t gfp);
>   
>   #endif /* _BCACHE_UTIL_H */
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index c7560f66dca88..a153e3a1b3b87 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -297,9 +297,7 @@ static void dirty_init(struct keybuf_key *w)
>   	if (!io->dc->writeback_percent)
>   		bio_set_prio(bio, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0));
>   
> -	bio->bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
>   	bio->bi_private		= w;
> -	bch_bio_map(bio, NULL);
>   }
>   
>   static void dirty_io_destructor(struct closure *cl)
> @@ -395,6 +393,8 @@ static void write_dirty(struct closure *cl)
>   	 */
>   	if (KEY_DIRTY(&w->key)) {
>   		dirty_init(w);
> +		io->bio.bi_iter.bi_size	= KEY_SIZE(&w->key) << 9;
> +		bch_bio_map(&io->bio, NULL);
>   		bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
>   		io->bio.bi_iter.bi_sector = KEY_START(&w->key);
>   		bio_set_dev(&io->bio, io->dc->bdev);
> @@ -513,7 +513,8 @@ static void read_dirty(struct cached_dev *dc)
>   			bio_set_dev(&io->bio, dc->disk.c->cache->bdev);
>   			io->bio.bi_end_io	= read_dirty_endio;
>   
> -			if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
> +			if (bch_bio_alloc_pages(&io->bio, NULL,
> +					KEY_SIZE(&w->key) << 9, GFP_KERNEL))
>   				goto err_free;
>   
>   			trace_bcache_writeback(&w->key);
Coly Li Nov. 8, 2021, 8:16 a.m. UTC | #7
On 11/3/21 11:10 PM, Coly Li wrote:
> This reverts commit 2fd3e5efe791946be0957c8e1eed9560b541fe46.
>
> The above commit replaces page_address(bv->bv_page) by bvec_virt(bv) to
> avoid directly access to bv->bv_page, but in situation bv->bv_offset is
> not zero and page_address(bv->bv_page) is not equal to bvec_virt(bv). In
> such case a memory corruption may happen because memory in next page is
> tainted by following line in do_btree_node_write(),
> 	memcpy(bvec_virt(bv), addr, PAGE_SIZE);
>
> This patch reverts the mentioned commit to avoid the memory corruption.
>
> Fixes: 2fd3e5efe791 ("bcache: use bvec_virt")
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: stable@vger.kernel.org # 5.15
> ---
>   drivers/md/bcache/btree.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 93b67b8d31c3..88c573eeb598 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -378,7 +378,7 @@ static void do_btree_node_write(struct btree *b)
>   		struct bvec_iter_all iter_all;
>   
>   		bio_for_each_segment_all(bv, b->bio, iter_all) {
> -			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
> +			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
>   			addr += PAGE_SIZE;
>   		}
>   

Hi Jens,

Since now we don't have a better alternative patch yet, and this issue 
should be fixed ASAP. Could you please take it firstly. And I will work 
with Christoph for a better change (maybe large and not trivial) later 
for next merge window?

Thanks.

Coly Li
Jens Axboe Nov. 8, 2021, 1:23 p.m. UTC | #8
On Wed, 3 Nov 2021 23:10:41 +0800, Coly Li wrote:
> This reverts commit 2fd3e5efe791946be0957c8e1eed9560b541fe46.
> 
> The above commit replaces page_address(bv->bv_page) by bvec_virt(bv) to
> avoid directly access to bv->bv_page, but in situation bv->bv_offset is
> not zero and page_address(bv->bv_page) is not equal to bvec_virt(bv). In
> such case a memory corruption may happen because memory in next page is
> tainted by following line in do_btree_node_write(),
> 	memcpy(bvec_virt(bv), addr, PAGE_SIZE);
> 
> [...]

Applied, thanks!

[1/1] bcache: Revert "bcache: use bvec_virt"
      commit: 2878feaed543c35f9dbbe6d8ce36fb67ac803eef

Best regards,
Christoph Hellwig Nov. 9, 2021, 8:03 a.m. UTC | #9
On Mon, Nov 08, 2021 at 04:16:51PM +0800, Coly Li wrote:
> Since now we don't have a better alternative patch yet, and this issue 
> should be fixed ASAP. Could you please take it firstly. And I will work 
> with Christoph for a better change (maybe large and not trivial) later for 
> next merge window?

fine with me.
diff mbox series

Patch

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 93b67b8d31c3..88c573eeb598 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -378,7 +378,7 @@  static void do_btree_node_write(struct btree *b)
 		struct bvec_iter_all iter_all;
 
 		bio_for_each_segment_all(bv, b->bio, iter_all) {
-			memcpy(bvec_virt(bv), addr, PAGE_SIZE);
+			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
 			addr += PAGE_SIZE;
 		}