Message ID | 20190408104641.4905-2-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/5] block: rewrite blk_bvec_map_sg to avoid a nth_page call | expand |
Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
On Mon, 2019-04-08 at 12:46 +0200, Christoph Hellwig wrote: > The offset in scatterlists is allowed to be larger than the page size, > so don't go to great length to avoid that case and simplify the > arithmetics. Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On Mon, Apr 08, 2019 at 12:46:37PM +0200, Christoph Hellwig wrote: > The offset in scatterlists is allowed to be larger than the page size, > so don't go to great length to avoid that case and simplify the > arithmetics. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > block/blk-merge.c | 21 ++++++--------------- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 8f96d683b577..52dbdd089fdf 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -467,26 +467,17 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, > struct scatterlist **sg) > { > unsigned nbytes = bvec->bv_len; > - unsigned nsegs = 0, total = 0, offset = 0; > + unsigned nsegs = 0, total = 0; > > while (nbytes > 0) { > - unsigned seg_size; > - struct page *pg; > - unsigned idx; > + unsigned offset = bvec->bv_offset + total; > + unsigned len = min(get_max_segment_size(q, offset), nbytes); > > *sg = blk_next_sg(sg, sglist); > + sg_set_page(*sg, bvec->bv_page, len, offset); > > - seg_size = get_max_segment_size(q, bvec->bv_offset + total); > - seg_size = min(nbytes, seg_size); > - > - offset = (total + bvec->bv_offset) % PAGE_SIZE; > - idx = (total + bvec->bv_offset) / PAGE_SIZE; > - pg = bvec_nth_page(bvec->bv_page, idx); > - > - sg_set_page(*sg, pg, seg_size, offset); > - > - total += seg_size; > - nbytes -= seg_size; > + total += len; > + nbytes -= len; > nsegs++; > } Nice cleanup & simplification: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming
On Mon, Apr 08, 2019 at 12:46:37PM +0200, Christoph Hellwig wrote: > The offset in scatterlists is allowed to be larger than the page size, > so don't go to great length to avoid that case and simplify the > arithmetics. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> > Reviewed-by: Bart Van Assche <bvanassche@acm.org> > Reviewed-by: Ming Lei <ming.lei@redhat.com> > --- This patch causes crashes with various boot tests. Most sparc tests crash, as well as several arm tests. Bisect results in both cases point to this patch. On sparc: Run /sbin/init as init process kernel BUG at arch/sparc/lib/bitext.c:112! \|/ ____ \|/ "@'/ ,. \`@" /_| \__/ |_\ \__U_/ lock_torture_wr(23): Kernel bad trap [#1] CPU: 0 PID: 23 Comm: lock_torture_wr Not tainted 5.1.0-rc4-next-20190415 #1 PSR: 04401fc0 PC: f04a4e28 NPC: f04a4e2c Y: 00000000 Not tainted PC: <bit_map_clear+0xe8/0xec> %G: f000caec 00010003 f05db308 000000fd 00000000 000000fd f514e000 f062be28 %O: 0000002a f05a41e0 00000070 00000001 f05dd564 f05dd504 f514f9a0 f04a4e20 RPC: <bit_map_clear+0xe0/0xec> %L: 04001fc1 f00130c4 f00130c8 00000002 00000000 00000004 f514e000 00000547 %I: f500b990 0000003a 00000012 f513e570 00010000 f500b990 f514fa00 f001316c Disabling lock debugging due to kernel taint Caller[f001316c]: sbus_iommu_unmap_sg+0x34/0x60 Caller[f02d6170]: scsi_dma_unmap+0xac/0xd0 Caller[f02e2424]: esp_cmd_is_done+0x198/0x1a8 Caller[f02e4150]: scsi_esp_intr+0x191c/0x21e0 Caller[f0055438]: __handle_irq_event_percpu+0x8c/0x128 Caller[f00554e0]: handle_irq_event_percpu+0xc/0x4c Caller[f0055550]: handle_irq_event+0x30/0x64 Caller[f0058b10]: handle_level_irq+0xb4/0x17c Caller[f0054c38]: generic_handle_irq+0x30/0x40 Caller[f0009a4c]: handler_irq+0x3c/0x78 Caller[f0007b68]: patch_handler_irq+0x0/0x24 Caller[f004e054]: lock_torture_writer+0xc0/0x1fc Caller[f003ca18]: kthread+0xd8/0x110 Caller[f00082f0]: ret_from_kernel_thread+0xc/0x38 Caller[00000000]: (null) Instruction DUMP: 113c1690 7fed90ae 901221e0 <91d02005> 9de3bfa0 92102000 9406a01f 90100019 9532a005 Kernel panic - not syncing: Aiee, killing interrupt handler! On arm: [ 13.836353] Run /sbin/init as init process [ 13.937406] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 13.937736] CPU: 0 PID: 1 Comm: init Not tainted 5.1.0-rc4-next-20190415 #1 [ 13.937895] Hardware name: ARM-Versatile Express [ 13.938472] [<c0312a74>] (unwind_backtrace) from [<c030d350>] (show_stack+0x10/0x14) [ 13.938671] [<c030d350>] (show_stack) from [<c1097184>] (dump_stack+0xb4/0xc8) [ 13.938834] [<c1097184>] (dump_stack) from [<c03474c4>] (panic+0x110/0x328) [ 13.939008] [<c03474c4>] (panic) from [<c034c360>] (do_exit+0xbf8/0xc04) [ 13.939161] [<c034c360>] (do_exit) from [<c034d3c0>] (do_group_exit+0x3c/0xbc) [ 13.939323] [<c034d3c0>] (do_group_exit) from [<c0359460>] (get_signal+0x13c/0x9c4) [ 13.939491] [<c0359460>] (get_signal) from [<c030c760>] (do_work_pending+0x198/0x5f0) [ 13.939660] [<c030c760>] (do_work_pending) from [<c030106c>] (slow_work_pending+0xc/0x20) [ 13.939862] Exception stack(0xc703ffb0 to 0xc703fff8) [ 13.940102] ffa0: b6f0e578 00000016 00000015 00000004 [ 13.940372] ffc0: b6f0e060 b6f0e578 b6ede000 b6ede360 b6ef5dc0 b6ede95c b6ede2c0 becb6efc [ 13.940613] ffe0: b6f0e060 becb6ec0 b6edf628 b6ee9c00 60000010 ffffffff Reverting the patch isn't possible since there are subsequent changes affecting the code, and the image no longer builds after the revert. Guenter --- bisect results: # bad: [f9221a7a1014d8a047b277a73289678646ddc110] Add linux-next specific files for 20190415 # good: [15ade5d2e7775667cf191cf2f94327a4889f8b9d] Linux 5.1-rc4 git bisect start 'HEAD' 'v5.1-rc4' # good: [34ad076e8efd33658e4367df70a92058de7a870d] Merge remote-tracking branch 'crypto/master' git bisect good 34ad076e8efd33658e4367df70a92058de7a870d # bad: [5b51a36f96d773866e1a2165bacfcde58464eb1d] Merge remote-tracking branch 'devicetree/for-next' git bisect bad 5b51a36f96d773866e1a2165bacfcde58464eb1d # good: [457109829f4ee4107e8c7108237afba21fabbb5e] Merge branch 'drm-next-5.2' of git://people.freedesktop.org/~agd5f/linux into drm-next git bisect good 457109829f4ee4107e8c7108237afba21fabbb5e # good: [c4af5502948132ca3ce48d047188f8fc2f484dd7] Merge remote-tracking branch 'sound-asoc/for-next' git bisect good c4af5502948132ca3ce48d047188f8fc2f484dd7 # bad: [722312b700e5fe2e9b16547fc2cb0dbbfb3e345c] Merge remote-tracking branch 'battery/for-next' git bisect bad 722312b700e5fe2e9b16547fc2cb0dbbfb3e345c # bad: [da55bae775eb580cd7f1b74850e55cf8880e7984] Merge branch 'for-5.2/block' into for-next git bisect bad da55bae775eb580cd7f1b74850e55cf8880e7984 # good: [1e815b33c5ccd3936b71292b5ffb84e97e1df9e0] block: sed-opal: fix typos and formatting git bisect good 1e815b33c5ccd3936b71292b5ffb84e97e1df9e0 # good: [06bda9d56ba3c0ba5ecb41f27da93a417aa442d1] Merge branch 'for-5.2/block' into for-next git bisect good 06bda9d56ba3c0ba5ecb41f27da93a417aa442d1 # bad: [14eacf12dbc75352fa746dfd9e24de3170ba5ff5] block: don't allow multiple bio_iov_iter_get_pages calls per bio git bisect bad 14eacf12dbc75352fa746dfd9e24de3170ba5ff5 # good: [ae50640bebc48f1fc0092f16ea004c7c4d12c985] md: use correct type in super_1_sync git bisect good ae50640bebc48f1fc0092f16ea004c7c4d12c985 # good: [efcd487c69b9d968552a6bf80e7839c4f28b419d] md: add __acquires/__releases annotations to handle_active_stripes git bisect good efcd487c69b9d968552a6bf80e7839c4f28b419d # bad: [8a96a0e408102fb7aa73d8aa0b5e2219cfd51e55] block: rewrite blk_bvec_map_sg to avoid a nth_page call git bisect bad 8a96a0e408102fb7aa73d8aa0b5e2219cfd51e55 # good: [22391ac30ab9cc2ba610bf7ea2244840b83c8017] Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.2/block git bisect good 22391ac30ab9cc2ba610bf7ea2244840b83c8017 # first bad commit: [8a96a0e408102fb7aa73d8aa0b5e2219cfd51e55] block: rewrite blk_bvec_map_sg to avoid a nth_page call
On Mon, Apr 15, 2019 at 12:44:35PM -0700, Guenter Roeck wrote: > This patch causes crashes with various boot tests. Most sparc tests crash, as > well as several arm tests. Bisect results in both cases point to this patch. That just means we trigger an existing bug more easily now. I'll see if I can help with the issues.
On Mon, Apr 15, 2019 at 10:52:42PM +0200, Christoph Hellwig wrote: > On Mon, Apr 15, 2019 at 12:44:35PM -0700, Guenter Roeck wrote: > > This patch causes crashes with various boot tests. Most sparc tests crash, as > > well as several arm tests. Bisect results in both cases point to this patch. > > That just means we trigger an existing bug more easily now. I'll see > if I can help with the issues. Code which previously worked reliably no longer does. I would be quite hesitant to call this "trigger an existing bug more easily". "Regression" seems to be a more appropriate term - even more so as it seems to cause 'init' crashes, at least on arm. Guenter
On Mon, Apr 15, 2019 at 02:07:31PM -0700, Guenter Roeck wrote: > On Mon, Apr 15, 2019 at 10:52:42PM +0200, Christoph Hellwig wrote: > > On Mon, Apr 15, 2019 at 12:44:35PM -0700, Guenter Roeck wrote: > > > This patch causes crashes with various boot tests. Most sparc tests crash, as > > > well as several arm tests. Bisect results in both cases point to this patch. > > > > That just means we trigger an existing bug more easily now. I'll see > > if I can help with the issues. > > Code which previously worked reliably no longer does. I would be quite > hesitant to call this "trigger an existing bug more easily". "Regression" > seems to be a more appropriate term - even more so as it seems to cause > 'init' crashes, at least on arm. Well, we have these sgls in the wild already, it just is that they are fairly rare. For a related fix on a mainstream platform see here for example: https://lore.kernel.org/patchwork/patch/1050367/ Below is a rework of the sparc32 iommu code that should avoid your reported problem. Please send any other reports to me as well. diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c index e8d5d73ca40d..93c2fc440cb0 100644 --- a/arch/sparc/mm/iommu.c +++ b/arch/sparc/mm/iommu.c @@ -175,16 +175,38 @@ static void iommu_flush_iotlb(iopte_t *iopte, unsigned int niopte) } } -static u32 iommu_get_one(struct device *dev, struct page *page, int npages) +static u32 __sbus_iommu_map_page(struct device *dev, struct page *page, unsigned offset, + unsigned len, bool need_flush) { struct iommu_struct *iommu = dev->archdata.iommu; + phys_addr_t paddr = page_to_phys(page) + offset, p; + unsigned long pfn = __phys_to_pfn(paddr); + unsigned long off = (unsigned long)paddr & ~PAGE_MASK; + unsigned long npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; int ioptex; iopte_t *iopte, *iopte0; unsigned int busa, busa0; int i; + /* XXX So what is maxphys for us and how do drivers know it? */ + if (!len || len > 256 * 1024) + return DMA_MAPPING_ERROR; + + /* + * We expect unmapped highmem pages to be not in the cache. + * XXX Is this a good assumption? + * XXX What if someone else unmaps it here and races us? + */ + if (need_flush && !PageHighMem(page)) { + for (p = paddr & PAGE_MASK; p < paddr + len; p += PAGE_SIZE) { + unsigned long vaddr = (unsigned long)phys_to_virt(p); + + flush_page_for_dma(vaddr); + } + } + /* page color = pfn of page */ - ioptex = bit_map_string_get(&iommu->usemap, npages, page_to_pfn(page)); + ioptex = bit_map_string_get(&iommu->usemap, npages, pfn); if (ioptex < 0) panic("iommu out"); busa0 = iommu->start + (ioptex << PAGE_SHIFT); @@ -193,11 +215,11 @@ static u32 iommu_get_one(struct device *dev, struct page *page, int npages) busa = busa0; iopte = iopte0; for (i = 0; i < npages; i++) { - iopte_val(*iopte) = MKIOPTE(page_to_pfn(page), IOPERM); + iopte_val(*iopte) = MKIOPTE(pfn, IOPERM); iommu_invalidate_page(iommu->regs, busa); busa += PAGE_SIZE; iopte++; - page++; + pfn++; } iommu_flush_iotlb(iopte0, npages); @@ -205,99 +227,62 @@ static u32 iommu_get_one(struct device *dev, struct page *page, int npages) return busa0; } -static dma_addr_t __sbus_iommu_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t len) -{ - void *vaddr = page_address(page) + offset; - unsigned long off = (unsigned long)vaddr & ~PAGE_MASK; - unsigned long npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; - - /* XXX So what is maxphys for us and how do drivers know it? */ - if (!len || len > 256 * 1024) - return DMA_MAPPING_ERROR; - return iommu_get_one(dev, virt_to_page(vaddr), npages) + off; -} - static dma_addr_t sbus_iommu_map_page_gflush(struct device *dev, struct page *page, unsigned long offset, size_t len, enum dma_data_direction dir, unsigned long attrs) { flush_page_for_dma(0); - return __sbus_iommu_map_page(dev, page, offset, len); + return __sbus_iommu_map_page(dev, page, offset, len, false); } static dma_addr_t sbus_iommu_map_page_pflush(struct device *dev, struct page *page, unsigned long offset, size_t len, enum dma_data_direction dir, unsigned long attrs) { - void *vaddr = page_address(page) + offset; - unsigned long p = ((unsigned long)vaddr) & PAGE_MASK; - - while (p < (unsigned long)vaddr + len) { - flush_page_for_dma(p); - p += PAGE_SIZE; - } - - return __sbus_iommu_map_page(dev, page, offset, len); + return __sbus_iommu_map_page(dev, page, offset, len, true); } -static int sbus_iommu_map_sg_gflush(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, unsigned long attrs) +static int __sbus_iommu_map_sg(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs, + bool need_flush) { struct scatterlist *sg; - int i, n; - - flush_page_for_dma(0); + int i; for_each_sg(sgl, sg, nents, i) { - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; - sg->dma_address = iommu_get_one(dev, sg_page(sg), n) + sg->offset; + sg->dma_address = __sbus_iommu_map_page(dev, sg_page(sg), + sg->offset, sg->length, need_flush); + if (sg->dma_address == DMA_MAPPING_ERROR) + return 0; sg->dma_length = sg->length; } return nents; } -static int sbus_iommu_map_sg_pflush(struct device *dev, struct scatterlist *sgl, +static int sbus_iommu_map_sg_gflush(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { - unsigned long page, oldpage = 0; - struct scatterlist *sg; - int i, j, n; - - for_each_sg(sgl, sg, nents, j) { - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; - - /* - * We expect unmapped highmem pages to be not in the cache. - * XXX Is this a good assumption? - * XXX What if someone else unmaps it here and races us? - */ - if ((page = (unsigned long) page_address(sg_page(sg))) != 0) { - for (i = 0; i < n; i++) { - if (page != oldpage) { /* Already flushed? */ - flush_page_for_dma(page); - oldpage = page; - } - page += PAGE_SIZE; - } - } - - sg->dma_address = iommu_get_one(dev, sg_page(sg), n) + sg->offset; - sg->dma_length = sg->length; - } + flush_page_for_dma(0); + return __sbus_iommu_map_sg(dev, sgl, nents, dir, attrs, false); +} - return nents; +static int sbus_iommu_map_sg_pflush(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, unsigned long attrs) +{ + return __sbus_iommu_map_sg(dev, sgl, nents, dir, attrs, true); } -static void iommu_release_one(struct device *dev, u32 busa, int npages) +static void __sbus_iommu_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t len) { struct iommu_struct *iommu = dev->archdata.iommu; - int ioptex; - int i; + unsigned busa, npages, ioptex, i; + busa = dma_addr & PAGE_MASK; BUG_ON(busa < iommu->start); ioptex = (busa - iommu->start) >> PAGE_SHIFT; + npages = ((dma_addr & ~PAGE_MASK) + len + PAGE_SIZE-1) >> PAGE_SHIFT; for (i = 0; i < npages; i++) { iopte_val(iommu->page_table[ioptex + i]) = 0; iommu_invalidate_page(iommu->regs, busa); @@ -309,22 +294,17 @@ static void iommu_release_one(struct device *dev, u32 busa, int npages) static void sbus_iommu_unmap_page(struct device *dev, dma_addr_t dma_addr, size_t len, enum dma_data_direction dir, unsigned long attrs) { - unsigned long off = dma_addr & ~PAGE_MASK; - int npages; - - npages = (off + len + PAGE_SIZE-1) >> PAGE_SHIFT; - iommu_release_one(dev, dma_addr & PAGE_MASK, npages); + __sbus_iommu_unmap_page(dev, dma_addr, len); } static void sbus_iommu_unmap_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs) { struct scatterlist *sg; - int i, n; + int i; for_each_sg(sgl, sg, nents, i) { - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; - iommu_release_one(dev, sg->dma_address & PAGE_MASK, n); + __sbus_iommu_unmap_page(dev, sg->dma_address, sg->length); sg->dma_address = 0x21212121; } }
On 4/15/19 11:33 PM, Christoph Hellwig wrote: > On Mon, Apr 15, 2019 at 02:07:31PM -0700, Guenter Roeck wrote: >> On Mon, Apr 15, 2019 at 10:52:42PM +0200, Christoph Hellwig wrote: >>> On Mon, Apr 15, 2019 at 12:44:35PM -0700, Guenter Roeck wrote: >>>> This patch causes crashes with various boot tests. Most sparc tests crash, as >>>> well as several arm tests. Bisect results in both cases point to this patch. >>> >>> That just means we trigger an existing bug more easily now. I'll see >>> if I can help with the issues. >> >> Code which previously worked reliably no longer does. I would be quite >> hesitant to call this "trigger an existing bug more easily". "Regression" >> seems to be a more appropriate term - even more so as it seems to cause >> 'init' crashes, at least on arm. > > Well, we have these sgls in the wild already, it just is that they > are fairly rare. For a related fix on a mainstream platform see > here for example: > > https://lore.kernel.org/patchwork/patch/1050367/ > > Below is a rework of the sparc32 iommu code that should avoid your > reported problem. Please send any other reports to me as well. > The code below on top of next-20190415 results in esp ffd398e4: esp0: regs[(ptrval):(ptrval)] irq[2] esp ffd398e4: esp0: is a FAS100A, 40 MHz (ccf=0), SCSI ID 7 scsi host0: esp scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 scsi target0:0:0: Beginning Domain Validation scsi target0:0:0: Domain Validation Initial Inquiry Failed scsi target0:0:0: Ending Domain Validation scsi host0: scsi scan: INQUIRY result too short (5), using 36 scsi 0:0:2:0: Direct-Access PQ: 0 ANSI: 0 scsi target0:0:2: Beginning Domain Validation scsi target0:0:2: Domain Validation Initial Inquiry Failed scsi target0:0:2: Ending Domain Validation ... sd 0:0:2:0: Device offlined - not ready after error recovery sd 0:0:2:0: rejecting I/O to offline device and sometimes: ... sd 0:0:0:4: [sde] Asking for cache data failed sd 0:0:0:4: [sde] Assuming drive cache: write through sd 0:0:0:4: rejecting I/O to offline device sd 0:0:0:5: Device offlined - not ready after error recovery Unable to handle kernel NULL pointer dereference tsk->{mm,active_mm}->context = ffffffff tsk->{mm,active_mm}->pgd = fc000000 \|/ ____ \|/ "@'/ ,. \`@" /_| \__/ |_\ \__U_/ kworker/u2:6(90): Oops [#1] CPU: 0 PID: 90 Comm: kworker/u2:6 Not tainted 5.1.0-rc4-next-20190415-00001-g328cdb292761 #1 Workqueue: events_unbound async_run_entry_fn PSR: 409010c5 PC: f02d2904 NPC: f02d2908 Y: 0000000c Not tainted PC: <__scsi_execute+0xc/0x18c> %G: f5334608 00000000 00000000 00000002 00041200 00000008 f5386000 00000002 %O: f5334608 00000000 f5387c90 f50d3fb0 0000000b f5334a4c f5387b98 f02d2a38 RPC: <__scsi_execute+0x140/0x18c> %L: 00000000 00000000 f51b1800 00000001 00000002 00000000 f5386000 f05f9070 %I: 00000200 f5387ca0 00000002 00000000 00000000 00000000 f5387bf8 f02e92a4 Disabling lock debugging due to kernel taint Caller[f02e92a4]: sd_revalidate_disk+0xe8/0x1f5c Caller[f02eb418]: sd_probe+0x2b0/0x3f0 Caller[f02bbc98]: really_probe+0x1bc/0x2e8 Caller[f02bc10c]: __driver_attach_async_helper+0x48/0x58 Caller[f003f534]: async_run_entry_fn+0x38/0x124 Caller[f00373bc]: process_one_work+0x168/0x390 Caller[f0037728]: worker_thread+0x144/0x504 Caller[f003c90c]: kthread+0xd8/0x110 Caller[f00082f0]: ret_from_kernel_thread+0xc/0x38 Caller[00000000]: (null) Instruction DUMP: 9de3bfa0 b41ea001 80a0001a <d0062004> 92603fff a4100018 e207a06c e007a074 94102008 Guenter > diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c > index e8d5d73ca40d..93c2fc440cb0 100644 > --- a/arch/sparc/mm/iommu.c > +++ b/arch/sparc/mm/iommu.c > @@ -175,16 +175,38 @@ static void iommu_flush_iotlb(iopte_t *iopte, unsigned int niopte) > } > } > > -static u32 iommu_get_one(struct device *dev, struct page *page, int npages) > +static u32 __sbus_iommu_map_page(struct device *dev, struct page *page, unsigned offset, > + unsigned len, bool need_flush) > { > struct iommu_struct *iommu = dev->archdata.iommu; > + phys_addr_t paddr = page_to_phys(page) + offset, p; > + unsigned long pfn = __phys_to_pfn(paddr); > + unsigned long off = (unsigned long)paddr & ~PAGE_MASK; > + unsigned long npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > int ioptex; > iopte_t *iopte, *iopte0; > unsigned int busa, busa0; > int i; > > + /* XXX So what is maxphys for us and how do drivers know it? */ > + if (!len || len > 256 * 1024) > + return DMA_MAPPING_ERROR; > + > + /* > + * We expect unmapped highmem pages to be not in the cache. > + * XXX Is this a good assumption? > + * XXX What if someone else unmaps it here and races us? > + */ > + if (need_flush && !PageHighMem(page)) { > + for (p = paddr & PAGE_MASK; p < paddr + len; p += PAGE_SIZE) { > + unsigned long vaddr = (unsigned long)phys_to_virt(p); > + > + flush_page_for_dma(vaddr); > + } > + } > + > /* page color = pfn of page */ > - ioptex = bit_map_string_get(&iommu->usemap, npages, page_to_pfn(page)); > + ioptex = bit_map_string_get(&iommu->usemap, npages, pfn); > if (ioptex < 0) > panic("iommu out"); > busa0 = iommu->start + (ioptex << PAGE_SHIFT); > @@ -193,11 +215,11 @@ static u32 iommu_get_one(struct device *dev, struct page *page, int npages) > busa = busa0; > iopte = iopte0; > for (i = 0; i < npages; i++) { > - iopte_val(*iopte) = MKIOPTE(page_to_pfn(page), IOPERM); > + iopte_val(*iopte) = MKIOPTE(pfn, IOPERM); > iommu_invalidate_page(iommu->regs, busa); > busa += PAGE_SIZE; > iopte++; > - page++; > + pfn++; > } > > iommu_flush_iotlb(iopte0, npages); > @@ -205,99 +227,62 @@ static u32 iommu_get_one(struct device *dev, struct page *page, int npages) > return busa0; > } > > -static dma_addr_t __sbus_iommu_map_page(struct device *dev, struct page *page, > - unsigned long offset, size_t len) > -{ > - void *vaddr = page_address(page) + offset; > - unsigned long off = (unsigned long)vaddr & ~PAGE_MASK; > - unsigned long npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; > - > - /* XXX So what is maxphys for us and how do drivers know it? */ > - if (!len || len > 256 * 1024) > - return DMA_MAPPING_ERROR; > - return iommu_get_one(dev, virt_to_page(vaddr), npages) + off; > -} > - > static dma_addr_t sbus_iommu_map_page_gflush(struct device *dev, > struct page *page, unsigned long offset, size_t len, > enum dma_data_direction dir, unsigned long attrs) > { > flush_page_for_dma(0); > - return __sbus_iommu_map_page(dev, page, offset, len); > + return __sbus_iommu_map_page(dev, page, offset, len, false); > } > > static dma_addr_t sbus_iommu_map_page_pflush(struct device *dev, > struct page *page, unsigned long offset, size_t len, > enum dma_data_direction dir, unsigned long attrs) > { > - void *vaddr = page_address(page) + offset; > - unsigned long p = ((unsigned long)vaddr) & PAGE_MASK; > - > - while (p < (unsigned long)vaddr + len) { > - flush_page_for_dma(p); > - p += PAGE_SIZE; > - } > - > - return __sbus_iommu_map_page(dev, page, offset, len); > + return __sbus_iommu_map_page(dev, page, offset, len, true); > } > > -static int sbus_iommu_map_sg_gflush(struct device *dev, struct scatterlist *sgl, > - int nents, enum dma_data_direction dir, unsigned long attrs) > +static int __sbus_iommu_map_sg(struct device *dev, struct scatterlist *sgl, > + int nents, enum dma_data_direction dir, unsigned long attrs, > + bool need_flush) > { > struct scatterlist *sg; > - int i, n; > - > - flush_page_for_dma(0); > + int i; > > for_each_sg(sgl, sg, nents, i) { > - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; > - sg->dma_address = iommu_get_one(dev, sg_page(sg), n) + sg->offset; > + sg->dma_address = __sbus_iommu_map_page(dev, sg_page(sg), > + sg->offset, sg->length, need_flush); > + if (sg->dma_address == DMA_MAPPING_ERROR) > + return 0; > sg->dma_length = sg->length; > } > > return nents; > } > > -static int sbus_iommu_map_sg_pflush(struct device *dev, struct scatterlist *sgl, > +static int sbus_iommu_map_sg_gflush(struct device *dev, struct scatterlist *sgl, > int nents, enum dma_data_direction dir, unsigned long attrs) > { > - unsigned long page, oldpage = 0; > - struct scatterlist *sg; > - int i, j, n; > - > - for_each_sg(sgl, sg, nents, j) { > - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; > - > - /* > - * We expect unmapped highmem pages to be not in the cache. > - * XXX Is this a good assumption? > - * XXX What if someone else unmaps it here and races us? > - */ > - if ((page = (unsigned long) page_address(sg_page(sg))) != 0) { > - for (i = 0; i < n; i++) { > - if (page != oldpage) { /* Already flushed? */ > - flush_page_for_dma(page); > - oldpage = page; > - } > - page += PAGE_SIZE; > - } > - } > - > - sg->dma_address = iommu_get_one(dev, sg_page(sg), n) + sg->offset; > - sg->dma_length = sg->length; > - } > + flush_page_for_dma(0); > + return __sbus_iommu_map_sg(dev, sgl, nents, dir, attrs, false); > +} > > - return nents; > +static int sbus_iommu_map_sg_pflush(struct device *dev, struct scatterlist *sgl, > + int nents, enum dma_data_direction dir, unsigned long attrs) > +{ > + return __sbus_iommu_map_sg(dev, sgl, nents, dir, attrs, true); > } > > -static void iommu_release_one(struct device *dev, u32 busa, int npages) > +static void __sbus_iommu_unmap_page(struct device *dev, dma_addr_t dma_addr, > + size_t len) > { > struct iommu_struct *iommu = dev->archdata.iommu; > - int ioptex; > - int i; > + unsigned busa, npages, ioptex, i; > > + busa = dma_addr & PAGE_MASK; > BUG_ON(busa < iommu->start); > ioptex = (busa - iommu->start) >> PAGE_SHIFT; > + npages = ((dma_addr & ~PAGE_MASK) + len + PAGE_SIZE-1) >> PAGE_SHIFT; > for (i = 0; i < npages; i++) { > iopte_val(iommu->page_table[ioptex + i]) = 0; > iommu_invalidate_page(iommu->regs, busa); > @@ -309,22 +294,17 @@ static void iommu_release_one(struct device *dev, u32 busa, int npages) > static void sbus_iommu_unmap_page(struct device *dev, dma_addr_t dma_addr, > size_t len, enum dma_data_direction dir, unsigned long attrs) > { > - unsigned long off = dma_addr & ~PAGE_MASK; > - int npages; > - > - npages = (off + len + PAGE_SIZE-1) >> PAGE_SHIFT; > - iommu_release_one(dev, dma_addr & PAGE_MASK, npages); > + __sbus_iommu_unmap_page(dev, dma_addr, len); > } > > static void sbus_iommu_unmap_sg(struct device *dev, struct scatterlist *sgl, > int nents, enum dma_data_direction dir, unsigned long attrs) > { > struct scatterlist *sg; > - int i, n; > + int i; > > for_each_sg(sgl, sg, nents, i) { > - n = (sg->length + sg->offset + PAGE_SIZE-1) >> PAGE_SHIFT; > - iommu_release_one(dev, sg->dma_address & PAGE_MASK, n); > + __sbus_iommu_unmap_page(dev, sg->dma_address, sg->length); > sg->dma_address = 0x21212121; > } > } >
On Tue, Apr 16, 2019 at 08:33:56AM +0200, Christoph Hellwig wrote: > On Mon, Apr 15, 2019 at 02:07:31PM -0700, Guenter Roeck wrote: > > On Mon, Apr 15, 2019 at 10:52:42PM +0200, Christoph Hellwig wrote: > > > On Mon, Apr 15, 2019 at 12:44:35PM -0700, Guenter Roeck wrote: > > > > This patch causes crashes with various boot tests. Most sparc tests crash, as > > > > well as several arm tests. Bisect results in both cases point to this patch. > > > > > > That just means we trigger an existing bug more easily now. I'll see > > > if I can help with the issues. > > > > Code which previously worked reliably no longer does. I would be quite > > hesitant to call this "trigger an existing bug more easily". "Regression" > > seems to be a more appropriate term - even more so as it seems to cause > > 'init' crashes, at least on arm. > > Well, we have these sgls in the wild already, it just is that they That is besides the point. Your code changes an internal API to be more stringent and less forgiving. This causes failures, presumably because callers of that API took advantage (on purpose or not) of it. When changing an API, you are responsible for both ends. You can not claim that the callers of that API are buggy. Taking advangage of a forgiving API is not a bug. If you change an API, and that change causes a failure, that is a regression, not a bug on the side of the caller. On top of that, an API change causing roughly 4% of my boot tests to fail is a serious regression. Those boot tests don't really do anything besides trying to boot the system. If 4% of those tests fail, I don't even want to know what else is going to fail when your patch (or patch series) hits mainline. Your patch should be reverted until that is resolved. If making the API more stringent / less forgiving indeed makes sense and improves code quality and/or performance, the very least would be to change the code to still accept what it used to accept before but generate a traceback. That would let people fix the calling code without making systems unusable. This is even more true with failures like the one I observed on arm, where your patch causes init to crash without clear indication of the root cause of that crash. Guenter
On Tue, Apr 16, 2019 at 10:08:47AM -0700, Guenter Roeck wrote: > That is besides the point. Your code changes an internal API to be more > stringent and less forgiving. This causes failures, presumably because > callers of that API took advantage (on purpose or not) of it. > When changing an API, you are responsible for both ends. You can not claim > that the callers of that API are buggy. Taking advangage of a forgiving > API is not a bug. If you change an API, and that change causes a failure, > that is a regression, not a bug on the side of the caller. As said I offered to fix these, even if this isn't my fault. I'm also still waiting for the the other reports.
On Tue, Apr 16, 2019 at 07:10:11PM +0200, Christoph Hellwig wrote: > On Tue, Apr 16, 2019 at 10:08:47AM -0700, Guenter Roeck wrote: > > That is besides the point. Your code changes an internal API to be more > > stringent and less forgiving. This causes failures, presumably because > > callers of that API took advantage (on purpose or not) of it. > > When changing an API, you are responsible for both ends. You can not claim > > that the callers of that API are buggy. Taking advangage of a forgiving > > API is not a bug. If you change an API, and that change causes a failure, > > that is a regression, not a bug on the side of the caller. > > As said I offered to fix these, even if this isn't my fault. I'm also "even if this isn't my fault" Here is where we disagree. You introduced the change, you are responsible for its impact, on both ends. > still waiting for the the other reports. I reported everything I know. To summarize, the following tests are confirmed to fail due to this patch. arm:vexpress-a9:multi_v7_defconfig:nolocktests:sd:mem128:vexpress-v2p-ca9:rootfs arm:vexpress-a15:multi_v7_defconfig:nolocktests:sd:mem128:vexpress-v2p-ca15-tc1:rootfs arm:vexpress-a15-a7:multi_v7_defconfig:nolocktests:sd:mem256:vexpress-v2p-ca15_a7:rootfs sparc32:SPARCClassic:nosmp:scsi:hd sparc32:SPARCbook:nosmp:scsi:cd sparc32:SS-5:nosmp:scsi:hd sparc32:SS-10:nosmp:scsi:cd sparc32:SS-600MP:nosmp:scsi:hd sparc32:Voyager:nosmp:noapc:scsi:hd sparc32:SS-4:smp:scsi:hd sparc32:SS-5:smp:scsi:cd sparc32:SS-20:smp:scsi:hd sparc32:SS-600MP:smp:scsi:hd sparc32:Voyager:smp:noapc:scsi:hd Detailed logs are available at https://kerneltests.org/builders, and the test scripts are published at https://github.com/groeck/linux-build-test. Guenter
Now that I've fixed the sparc32 iommu code in another thread: can you send me your rootfs and qemu arm command line for the failing one? I have a hard time parsing your buildbot output.
On 4/16/19 10:27 PM, Christoph Hellwig wrote: > Now that I've fixed the sparc32 iommu code in another thread: can > you send me your rootfs and qemu arm command line for the failing > one? I have a hard time parsing your buildbot output. > Example command line: qemu-system-arm -M vexpress-a9 -kernel arch/arm/boot/zImage -no-reboot \ -snapshot -m 128 \ -drive file=rootfs-armv5.ext2,format=raw,if=sd \ --append 'panic=-1 root=/dev/mmcblk0 rootwait console=ttyAMA0,115200' \ -dtb arch/arm/boot/dts/vexpress-v2p-ca9.dtb \ -nographic -monitor null -serial stdio Root file system is available from https://github.com/groeck/linux-build-test/blob/master/rootfs/arm/rootfs-armv5.ext2.gz Configuration is multi_v7_defconfig; I confirmed that this configuration is sufficient to see the failure. Guenter
On Wed, Apr 17, 2019 at 07:27:24AM +0200, Christoph Hellwig wrote: > Now that I've fixed the sparc32 iommu code in another thread: can > you send me your rootfs and qemu arm command line for the failing > one? I have a hard time parsing your buildbot output. FWIW: mmc_blk_data_prep() calls blk_rq_map_sg() with a large offset value. The old code translated this into: blk_bvec_map_sg(q=c77a0000 len=13824 offset=18944) sg_set_page(sg=c6015000 p=c7efd180 l=13824 o=2560) The new code leaves offset unchanged: blk_bvec_map_sg(q=c76c0528 len=13824 offset=18944) sg_set_page(sg=c6035000 p=c7f2af00 l=13824 o=18944) Traceback: [<c065e3d4>] (blk_rq_map_sg) from [<c0ca1444>] (mmc_blk_data_prep+0x1b0/0x2c8) [<c0ca1444>] (mmc_blk_data_prep) from [<c0ca15ac>] (mmc_blk_rw_rq_prep+0x50/0x178) [<c0ca15ac>] (mmc_blk_rw_rq_prep) from [<c0ca48bc>] (mmc_blk_mq_issue_rq+0x290/0x878) [<c0ca48bc>] (mmc_blk_mq_issue_rq) from [<c0ca52e4>] (mmc_mq_queue_rq+0x128/0x234) [<c0ca52e4>] (mmc_mq_queue_rq) from [<c066350c>] (blk_mq_dispatch_rq_list+0xc8/0x5e8) [<c066350c>] (blk_mq_dispatch_rq_list) from [<c06681a8>] (blk_mq_do_dispatch_sched+0x60/0xfc) [<c06681a8>] (blk_mq_do_dispatch_sched) from [<c06688b8>] (blk_mq_sched_dispatch_requests+0x134/0x1b0) [<c06688b8>] (blk_mq_sched_dispatch_requests) from [<c0661f08>] (__blk_mq_run_hw_queue+0xa4/0x138) [<c0661f08>] (__blk_mq_run_hw_queue) from [<c03622a0>] (process_one_work+0x218/0x510) [<c03622a0>] (process_one_work) from [<c0363230>] (worker_thread+0x44/0x5bc) This results in bad data transfers, which ultimately causes the crash. Guenter
On Thu, Apr 18, 2019 at 5:59 AM Guenter Roeck <linux@roeck-us.net> wrote: > > On Wed, Apr 17, 2019 at 07:27:24AM +0200, Christoph Hellwig wrote: > > Now that I've fixed the sparc32 iommu code in another thread: can > > you send me your rootfs and qemu arm command line for the failing > > one? I have a hard time parsing your buildbot output. > > FWIW: mmc_blk_data_prep() calls blk_rq_map_sg() with a large offset value. > The old code translated this into: > > blk_bvec_map_sg(q=c77a0000 len=13824 offset=18944) > sg_set_page(sg=c6015000 p=c7efd180 l=13824 o=2560) > > The new code leaves offset unchanged: > > blk_bvec_map_sg(q=c76c0528 len=13824 offset=18944) > sg_set_page(sg=c6035000 p=c7f2af00 l=13824 o=18944) > > Traceback: > > [<c065e3d4>] (blk_rq_map_sg) from [<c0ca1444>] (mmc_blk_data_prep+0x1b0/0x2c8) > [<c0ca1444>] (mmc_blk_data_prep) from [<c0ca15ac>] (mmc_blk_rw_rq_prep+0x50/0x178) > [<c0ca15ac>] (mmc_blk_rw_rq_prep) from [<c0ca48bc>] (mmc_blk_mq_issue_rq+0x290/0x878) > [<c0ca48bc>] (mmc_blk_mq_issue_rq) from [<c0ca52e4>] (mmc_mq_queue_rq+0x128/0x234) > [<c0ca52e4>] (mmc_mq_queue_rq) from [<c066350c>] (blk_mq_dispatch_rq_list+0xc8/0x5e8) > [<c066350c>] (blk_mq_dispatch_rq_list) from [<c06681a8>] (blk_mq_do_dispatch_sched+0x60/0xfc) > [<c06681a8>] (blk_mq_do_dispatch_sched) from [<c06688b8>] (blk_mq_sched_dispatch_requests+0x134/0x1b0) > [<c06688b8>] (blk_mq_sched_dispatch_requests) from [<c0661f08>] (__blk_mq_run_hw_queue+0xa4/0x138) > [<c0661f08>] (__blk_mq_run_hw_queue) from [<c03622a0>] (process_one_work+0x218/0x510) > [<c03622a0>] (process_one_work) from [<c0363230>] (worker_thread+0x44/0x5bc) > > This results in bad data transfers, which ultimately causes the crash. There are several bugs related with kmap(sg_page(sg)), such as: sdhci_kmap_atomic() tmio_mmc_kmap_atomic() wbsd_map_sg() Thanks, Ming Lei
On Fri, Apr 19, 2019 at 10:27 AM Ming Lei <tom.leiming@gmail.com> wrote: > > On Thu, Apr 18, 2019 at 5:59 AM Guenter Roeck <linux@roeck-us.net> wrote: > > > > On Wed, Apr 17, 2019 at 07:27:24AM +0200, Christoph Hellwig wrote: > > > Now that I've fixed the sparc32 iommu code in another thread: can > > > you send me your rootfs and qemu arm command line for the failing > > > one? I have a hard time parsing your buildbot output. > > > > FWIW: mmc_blk_data_prep() calls blk_rq_map_sg() with a large offset value. > > The old code translated this into: > > > > blk_bvec_map_sg(q=c77a0000 len=13824 offset=18944) > > sg_set_page(sg=c6015000 p=c7efd180 l=13824 o=2560) > > > > The new code leaves offset unchanged: > > > > blk_bvec_map_sg(q=c76c0528 len=13824 offset=18944) > > sg_set_page(sg=c6035000 p=c7f2af00 l=13824 o=18944) > > > > Traceback: > > > > [<c065e3d4>] (blk_rq_map_sg) from [<c0ca1444>] (mmc_blk_data_prep+0x1b0/0x2c8) > > [<c0ca1444>] (mmc_blk_data_prep) from [<c0ca15ac>] (mmc_blk_rw_rq_prep+0x50/0x178) > > [<c0ca15ac>] (mmc_blk_rw_rq_prep) from [<c0ca48bc>] (mmc_blk_mq_issue_rq+0x290/0x878) > > [<c0ca48bc>] (mmc_blk_mq_issue_rq) from [<c0ca52e4>] (mmc_mq_queue_rq+0x128/0x234) > > [<c0ca52e4>] (mmc_mq_queue_rq) from [<c066350c>] (blk_mq_dispatch_rq_list+0xc8/0x5e8) > > [<c066350c>] (blk_mq_dispatch_rq_list) from [<c06681a8>] (blk_mq_do_dispatch_sched+0x60/0xfc) > > [<c06681a8>] (blk_mq_do_dispatch_sched) from [<c06688b8>] (blk_mq_sched_dispatch_requests+0x134/0x1b0) > > [<c06688b8>] (blk_mq_sched_dispatch_requests) from [<c0661f08>] (__blk_mq_run_hw_queue+0xa4/0x138) > > [<c0661f08>] (__blk_mq_run_hw_queue) from [<c03622a0>] (process_one_work+0x218/0x510) > > [<c03622a0>] (process_one_work) from [<c0363230>] (worker_thread+0x44/0x5bc) > > > > This results in bad data transfers, which ultimately causes the crash. > > There are several bugs related with kmap(sg_page(sg)), such as: > > sdhci_kmap_atomic() > tmio_mmc_kmap_atomic() > wbsd_map_sg() Cc mmc maillist: Looks there are more such bad uses: au1xmmc_send_pio() au1xmmc_receive_pio() mmc_spi_data_do() sdricoh_request() However, seems tifm_sd.c notices this issue, see tifm_sd_transfer_data(). Thanks, Ming Lei
diff --git a/block/blk-merge.c b/block/blk-merge.c index 8f96d683b577..52dbdd089fdf 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -467,26 +467,17 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, struct scatterlist **sg) { unsigned nbytes = bvec->bv_len; - unsigned nsegs = 0, total = 0, offset = 0; + unsigned nsegs = 0, total = 0; while (nbytes > 0) { - unsigned seg_size; - struct page *pg; - unsigned idx; + unsigned offset = bvec->bv_offset + total; + unsigned len = min(get_max_segment_size(q, offset), nbytes); *sg = blk_next_sg(sg, sglist); + sg_set_page(*sg, bvec->bv_page, len, offset); - seg_size = get_max_segment_size(q, bvec->bv_offset + total); - seg_size = min(nbytes, seg_size); - - offset = (total + bvec->bv_offset) % PAGE_SIZE; - idx = (total + bvec->bv_offset) / PAGE_SIZE; - pg = bvec_nth_page(bvec->bv_page, idx); - - sg_set_page(*sg, pg, seg_size, offset); - - total += seg_size; - nbytes -= seg_size; + total += len; + nbytes -= len; nsegs++; }
The offset in scatterlists is allowed to be larger than the page size, so don't go to great length to avoid that case and simplify the arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> --- block/blk-merge.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-)