From patchwork Wed Mar 8 05:11:18 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 9610391 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 3E7BD60414 for ; Wed, 8 Mar 2017 05:41:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 33BA928507 for ; Wed, 8 Mar 2017 05:41:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 27DB3285B8; Wed, 8 Mar 2017 05:41:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5688E28507 for ; Wed, 8 Mar 2017 05:41:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756752AbdCHFlZ (ORCPT ); Wed, 8 Mar 2017 00:41:25 -0500 Received: from LGEAMRELO13.lge.com ([156.147.23.53]:46174 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756596AbdCHFlX (ORCPT ); Wed, 8 Mar 2017 00:41:23 -0500 Received: from unknown (HELO lgeamrelo01.lge.com) (156.147.1.125) by 156.147.23.53 with ESMTP; 8 Mar 2017 14:11:19 +0900 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: minchan@kernel.org Received: from unknown (HELO bbox) (10.177.223.161) by 156.147.1.125 with ESMTP; 8 Mar 2017 14:11:19 +0900 X-Original-SENDERIP: 10.177.223.161 X-Original-MAILFROM: minchan@kernel.org Date: Wed, 8 Mar 2017 14:11:18 +0900 From: Minchan Kim To: Johannes Thumshirn Cc: Hannes Reinecke , Jens Axboe , Nitin Gupta , Christoph Hellwig , Sergey Senozhatsky , yizhan@redhat.com, Linux Block Layer Mailinglist , Linux Kernel Mailinglist Subject: Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses Message-ID: <20170308051118.GA11206@bbox> References: <20170306102335.9180-1-jthumshirn@suse.de> <20170307052242.GA29458@bbox> <95c31a93-32cd-ad06-6cc0-e11b42ec2f68@suse.de> <20170307085545.GA538@bbox> <10a2335c-0ed0-43de-1cbd-625845301aef@suse.de> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <10a2335c-0ed0-43de-1cbd-625845301aef@suse.de> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Johannes, On Tue, Mar 07, 2017 at 10:51:45AM +0100, Johannes Thumshirn wrote: > On 03/07/2017 09:55 AM, Minchan Kim wrote: > > On Tue, Mar 07, 2017 at 08:48:06AM +0100, Hannes Reinecke wrote: > >> On 03/07/2017 08:23 AM, Minchan Kim wrote: > >>> Hi Hannes, > >>> > >>> On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke wrote: > >>>> On 03/07/2017 06:22 AM, Minchan Kim wrote: > >>>>> Hello Johannes, > >>>>> > >>>>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote: > >>>>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using > >>>>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of > >>>>>> pages attached to the bio's bvec this results in a kernel panic because of > >>>>>> array out of bounds accesses in zram_decompress_page(). > >>>>> > >>>>> First of all, thanks for the report and fix up! > >>>>> Unfortunately, I'm not familiar with that interface of block layer. > >>>>> > >>>>> It seems this is a material for stable so I want to understand it clear. > >>>>> Could you say more specific things to educate me? > >>>>> > >>>>> What scenario/When/How it is problem? It will help for me to understand! > >>>>> > >>> > >>> Thanks for the quick response! > >>> > >>>> The problem is that zram as it currently stands can only handle bios > >>>> where each bvec contains a single page (or, to be precise, a chunk of > >>>> data with a length of a page). > >>> > >>> Right. > >>> > >>>> > >>>> This is not an automatic guarantee from the block layer (who is free to > >>>> send us bios with arbitrary-sized bvecs), so we need to set the queue > >>>> limits to ensure that. > >>> > >>> What does it mean "bios with arbitrary-sized bvecs"? > >>> What kinds of scenario is it used/useful? > >>> > >> Each bio contains a list of bvecs, each of which points to a specific > >> memory area: > >> > >> struct bio_vec { > >> struct page *bv_page; > >> unsigned int bv_len; > >> unsigned int bv_offset; > >> }; > >> > >> The trick now is that while 'bv_page' does point to a page, the memory > >> area pointed to might in fact be contiguous (if several pages are > >> adjacent). Hence we might be getting a bio_vec where bv_len is _larger_ > >> than a page. > > > > Thanks for detail, Hannes! > > > > If I understand it correctly, it seems to be related to bid_add_page > > with high-order page. Right? > > > > If so, I really wonder why I don't see such problem because several > > places have used it and I expected some of them might do IO with > > contiguous pages intentionally or by chance. Hmm, > > > > IIUC, it's not a nvme specific problme but general problem which > > can trigger normal FSes if they uses contiguos pages? > > > > I'm not a FS expert, but a quick grep shows that non of the file-systems > does the > > for_each_sg() > while(bio_add_page())) > > trick NVMe does. Aah, I see. > > >> > >> Hence the check for 'is_partial_io' in zram_drv.c (which just does a > >> test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for > >> partial I/O (ie if the overall length of the bio_vec is _smaller_ than a > >> page), but also for multipage bvecs (where the length of the bio_vec is > >> _larger_ than a page). > > > > Right. I need to look into that. Thanks for the pointing out! > > > >> > >> So rather than fixing the bio scanning loop in zram it's easier to set > >> the queue limits correctly so that 'is_partial_io' does the correct > >> thing and the overall logic in zram doesn't need to be altered. > > > > > > Isn't that approach require new bio allocation through blk_queue_split? > > Maybe, it wouldn't make severe regression in zram-FS workload but need > > to test. > > Yes, but blk_queue_split() needs information how big the bvecs can be, > hence the patch. > > For details have a look into blk_bio_segment_split() in block/blk-merge.c > > It get's the max_sectors from blk_max_size_offset() which is > q->limits.max_sectors when q->limits.chunk_sectors isn't set and > then loops over the bio's bvecs to check when to split the bio and then > calls bio_split() when appropriate. Yeb so it causes split bio which means new bio allocations which was not needed before. > > > > > Is there any ways to trigger the problem without real nvme device? > > It would really help to test/measure zram. > > It isn't a /real/ device but the fabrics loopback target. If you want a > fast reproducible test-case, have a look at: > > https://github.com/ddiss/rapido/ > the cut_nvme_local.sh script set's up the correct VM for this test. Then > a simple mkfs.xfs /dev/nvme0n1 will oops. Thanks! I will look into that. And could you test this patch? It avoids split bio so no need new bio allocations and makes zram code simple. From f778d7564d5cd772f25bb181329362c29548a257 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 8 Mar 2017 13:35:29 +0900 Subject: [PATCH] fix Not-yet-Signed-off-by: Minchan Kim --- drivers/block/zram/zram_drv.c | 40 ++++++++++++++-------------------------- 1 file changed, 14 insertions(+), 26 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index bcb03bacdded..516c3bd97a28 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -147,8 +147,7 @@ static inline bool valid_io_request(struct zram *zram, static void update_position(u32 *index, int *offset, struct bio_vec *bvec) { - if (*offset + bvec->bv_len >= PAGE_SIZE) - (*index)++; + *index += (*offset + bvec->bv_len) / PAGE_SIZE; *offset = (*offset + bvec->bv_len) % PAGE_SIZE; } @@ -886,7 +885,7 @@ static void __zram_make_request(struct zram *zram, struct bio *bio) { int offset; u32 index; - struct bio_vec bvec; + struct bio_vec bvec, bv; struct bvec_iter iter; index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT; @@ -900,34 +899,23 @@ static void __zram_make_request(struct zram *zram, struct bio *bio) } bio_for_each_segment(bvec, bio, iter) { - int max_transfer_size = PAGE_SIZE - offset; + int remained_size = bvec.bv_len; + int transfer_size; - if (bvec.bv_len > max_transfer_size) { - /* - * zram_bvec_rw() can only make operation on a single - * zram page. Split the bio vector. - */ - struct bio_vec bv; - - bv.bv_page = bvec.bv_page; - bv.bv_len = max_transfer_size; - bv.bv_offset = bvec.bv_offset; + bv.bv_page = bvec.bv_page; + bv.bv_offset = bvec.bv_offset; + do { + transfer_size = min_t(int, PAGE_SIZE, remained_size); + bv.bv_len = transfer_size; if (zram_bvec_rw(zram, &bv, index, offset, - op_is_write(bio_op(bio))) < 0) - goto out; - - bv.bv_len = bvec.bv_len - max_transfer_size; - bv.bv_offset += max_transfer_size; - if (zram_bvec_rw(zram, &bv, index + 1, 0, - op_is_write(bio_op(bio))) < 0) - goto out; - } else - if (zram_bvec_rw(zram, &bvec, index, offset, - op_is_write(bio_op(bio))) < 0) + op_is_write(bio_op(bio))) < 0) goto out; - update_position(&index, &offset, &bvec); + bv.bv_offset += transfer_size; + update_position(&index, &offset, &bv); + remained_size -= transfer_size; + } while (remained_size); } bio_endio(bio);