One liner patch for prd to avoid "kernel BUG at drivers/block/pmem.c:176!"
diff mbox

Message ID CALZRu8zgA+kHCM-QWtapWE7M0qjpsugTUDQfnU3TU80F+O-s5Q@mail.gmail.com
State New, archived
Headers show

Commit Message

Roger C. Pao (Enmotus) Feb. 11, 2015, 10:15 p.m. UTC
[rcpao@test27 Downloads]$ cat prd-20150210a-blk_queue_max_segment_size.patch
 	if (unlikely(!disk)) {
[rcpao@test27 Downloads]$

I'm just the messenger.  Patch provided by
From: Hugh Daschbach <hugh.daschbach@enmotus.com>
Date: Tue, 10 Feb 2015 20:27:08 -0800

--

Feb 10 09:43:42 test27.engr.enmotus.com kernel: ------------[ cut here
]------------
Feb 10 09:43:42 test27.engr.enmotus.com kernel: kernel BUG at
drivers/block/pmem.c:176!
Feb 10 09:43:43 test27.engr.enmotus.com kernel: invalid opcode: 0000 [#2] SMP
Feb 10 09:43:43 test27.engr.enmotus.com kernel: Modules linked in:
en_tier(PO) en_block(O) des3_ede_x86_64 des_generic md4 nls_utf8 cifs
rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache pmem bnep bluetooth
rfkill fuse xt_conntrack ebtable_nat ebtable_broute bridge stp llc
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw
ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw x86_pkg_temp_thermal coretemp kvm_intel
kvm crct10dif_pclmul iTCO_wdt joydev iTCO_vendor_support crc32_pclmul
crc32c_intel ghash_clmulni_intel mei_me sb_edac pcspkr mei lpc_ich
nfsd i2c_i801 ipmi_si tpm_tis edac_core ipmi_msghandler tpm mfd_core
shpchp wmi ioatdma auth_rpcgss nfs_acl lockd grace
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  sunrpc mgag200
drm_kms_helper ttm isci drm igb libsas ptp pps_core dca
scsi_transport_sas i2c_algo_bit [last unloaded: en_block]
Feb 10 09:43:43 test27.engr.enmotus.com kernel: CPU: 7 PID: 20863
Comm: evsp Tainted: P      D    O   3.18.0-rc1+ #1
Feb 10 09:43:43 test27.engr.enmotus.com kernel: Hardware name:
Supermicro X9DRH-iF-NV/X9DRH-iF-NV, BIOS 3.1  06/27/2014
Feb 10 09:43:43 test27.engr.enmotus.com kernel: task: ffff880333920000
ti: ffff88032024c000 task.ti: ffff88032024c000
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RIP:
0010:[<ffffffffa04105f0>]  [<ffffffffa04105f0>]
pmem_make_request+0x160/0x1c0 [pmem]
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RSP:
0018:ffff88032024fa68  EFLAGS: 00010206
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RAX: ffff88006a891b88
RBX: 0000000000002000 RCX: 0000000000000000
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RDX: ffff880457e0e400
RSI: ffffea000cd38000 RDI: 0000000000000001
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RBP: ffff88032024fac8
R08: 0000000000000000 R09: 0000000000000400
Feb 10 09:43:43 test27.engr.enmotus.com kernel: R10: ffff88006a891b00
R11: 0000000000000000 R12: 0000000000002000
Feb 10 09:43:43 test27.engr.enmotus.com kernel: R13: 0000000000000000
R14: 00000000007fff99 R15: 0000000000000000
Feb 10 09:43:43 test27.engr.enmotus.com kernel: FS:
00007f07166b9800(0000) GS:ffff88047fdc0000(0000)
knlGS:0000000000000000
Feb 10 09:43:43 test27.engr.enmotus.com kernel: CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Feb 10 09:43:43 test27.engr.enmotus.com kernel: CR2: 00007fff5cfa9b10
CR3: 00000003338c1000 CR4: 00000000001407e0
Feb 10 09:43:43 test27.engr.enmotus.com kernel: Stack:
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  ffff88032024fa78
ffffffff8119ab05 000000012024fb08 ffff8804680cf980
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  0000000000000017
00000000e36a2c6d 000000102024fae8 ffff88006a891b00
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  ffff880333920000
ffff88006a891b00 0000000000002000 ffffea000cd38080
Feb 10 09:43:43 test27.engr.enmotus.com kernel: Call Trace:
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff8119ab05>]
? mempool_alloc_slab+0x15/0x20
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff8135b2d8>]
generic_make_request+0xf8/0x150
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff8135b3a8>]
submit_bio+0x78/0x190
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa06081b5>]
en_block_do_io+0x155/0x330 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa06083cf>]
en_block_write+0x1f/0x30 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa065072e>]
en_write_sectors+0x5e/0xe0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa06509ca>]
en_write_metadata+0x5a/0xd0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa064b8d7>]
en_write_ddf_pdrecord+0x77/0xf0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa06400a0>]
en_ddf_update_pdr+0xd0/0x190 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0645e9c>]
en_ddf_add_pdisk+0x1dc/0x2e0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa06318f0>]
en_init_pdisk_meta+0x50/0x110 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0632d6e>]
en_process_init+0x15e/0x2c0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa063314f>]
en_cmd_init+0xdf/0x2c0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0633c75>]
en_cmdline_parse+0x275/0x4c0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff8131b657>]
? inode_has_perm.isra.30+0x27/0x40
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0611c23>]
en_be_log_write+0x293/0x2c0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0633f87>]
en_cmd_user_process+0xc7/0x140 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffffa0608fe6>]
en_user_write+0x76/0xa0 [en_tier]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff81214d07>]
vfs_write+0xb7/0x1f0
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff81215825>]
SyS_write+0x55/0xd0
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  [<ffffffff8175d2e9>]
system_call_fastpath+0x12/0x17
Feb 10 09:43:43 test27.engr.enmotus.com kernel: Code: 8b 54 19 08 41
0f 46 c4 89 d6 44 29 ee 39 f0 0f 47 c6 41 01 c5 29 c3 41 29 c4 44 39
ea 75 cc 41 83 c7 01 45 31 ed eb c3 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00
be fb ff ff ff e9 79 ff ff ff 80 3d 0b
Feb 10 09:43:43 test27.engr.enmotus.com kernel: RIP
[<ffffffffa04105f0>] pmem_make_request+0x160/0x1c0 [pmem]
Feb 10 09:43:43 test27.engr.enmotus.com kernel:  RSP <ffff88032024fa68>
Feb 10 09:43:43 test27.engr.enmotus.com kernel: ---[ end trace
73f040217b1381f1 ]---

Comments

Boaz Harrosh Feb. 12, 2015, 12:47 p.m. UTC | #1
On 02/12/2015 12:15 AM, Roger C. Pao (Enmotus) wrote:
> [rcpao@test27 Downloads]$ cat prd-20150210a-blk_queue_max_segment_size.patch 
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> index 1bd9ab0..0b3c8d6 100644
> --- a/drivers/block/pmem.c
> +++ b/drivers/block/pmem.c
> @@ -297,6 +299,7 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
>  	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
>  	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
>  	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
> +	blk_queue_max_segment_size(pmem->pmem_queue, PAGE_SIZE);
>  
>  	disk = alloc_disk(0);
>  	if (unlikely(!disk)) {
> [rcpao@test27 Downloads]$ 
> 
> I'm just the messenger.  Patch provided by 
> From: Hugh Daschbach <hugh.daschbach@enmotus.com <mailto:hugh.daschbach@enmotus.com>>
> Date: Tue, 10 Feb 2015 20:27:08 -0800 
> 
> --
> 
> Feb 10 09:43:42 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: ------------[ cut here ]------------
> Feb 10 09:43:42 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: kernel BUG at drivers/block/pmem.c:176!
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: invalid opcode: 0000 [#2] SMP
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: Modules linked in: en_tier(PO) en_block(O) des3_ede_x86_64 des_generic md4 nls_utf8 cifs rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache pmem bnep bluetooth rfkill fuse xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul iTCO_wdt joydev iTCO_vendor_support crc32_pclmul crc32c_intel ghash_clmulni_intel mei_me sb_edac pcspkr mei lpc_ich nfsd i2c_i801 ipmi_si tpm_tis edac_core ipmi_msghandler tpm mfd_core shpchp wmi ioatdma auth_rpcgss nfs_acl lockd grace
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  sunrpc mgag200 drm_kms_helper ttm isci drm igb libsas ptp pps_core dca scsi_transport_sas i2c_algo_bit [last unloaded: en_block]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: CPU: 7 PID: 20863 Comm: evsp Tainted: P      D    O   3.18.0-rc1+ #1
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: Hardware name: Supermicro X9DRH-iF-NV/X9DRH-iF-NV, BIOS 3.1  06/27/2014
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: task: ffff880333920000 ti: ffff88032024c000 task.ti: ffff88032024c000
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RIP: 0010:[<ffffffffa04105f0>]  [<ffffffffa04105f0>] pmem_make_request+0x160/0x1c0 [pmem]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RSP: 0018:ffff88032024fa68  EFLAGS: 00010206
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RAX: ffff88006a891b88 RBX: 0000000000002000 RCX: 0000000000000000
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RDX: ffff880457e0e400 RSI: ffffea000cd38000 RDI: 0000000000000001
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RBP: ffff88032024fac8 R08: 0000000000000000 R09: 0000000000000400
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: R10: ffff88006a891b00 R11: 0000000000000000 R12: 0000000000002000
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: R13: 0000000000000000 R14: 00000000007fff99 R15: 0000000000000000
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: FS:  00007f07166b9800(0000) GS:ffff88047fdc0000(0000) knlGS:0000000000000000
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: CR2: 00007fff5cfa9b10 CR3: 00000003338c1000 CR4: 00000000001407e0
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: Stack:
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  ffff88032024fa78 ffffffff8119ab05 000000012024fb08 ffff8804680cf980
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  0000000000000017 00000000e36a2c6d 000000102024fae8 ffff88006a891b00
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  ffff880333920000 ffff88006a891b00 0000000000002000 ffffea000cd38080
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: Call Trace:
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff8119ab05>] ? mempool_alloc_slab+0x15/0x20
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff8135b2d8>] generic_make_request+0xf8/0x150
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff8135b3a8>] submit_bio+0x78/0x190
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa06081b5>] en_block_do_io+0x155/0x330 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa06083cf>] en_block_write+0x1f/0x30 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa065072e>] en_write_sectors+0x5e/0xe0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa06509ca>] en_write_metadata+0x5a/0xd0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa064b8d7>] en_write_ddf_pdrecord+0x77/0xf0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa06400a0>] en_ddf_update_pdr+0xd0/0x190 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0645e9c>] en_ddf_add_pdisk+0x1dc/0x2e0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa06318f0>] en_init_pdisk_meta+0x50/0x110 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0632d6e>] en_process_init+0x15e/0x2c0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa063314f>] en_cmd_init+0xdf/0x2c0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0633c75>] en_cmdline_parse+0x275/0x4c0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff8131b657>] ? inode_has_perm.isra.30+0x27/0x40
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0611c23>] en_be_log_write+0x293/0x2c0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0633f87>] en_cmd_user_process+0xc7/0x140 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffffa0608fe6>] en_user_write+0x76/0xa0 [en_tier]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff81214d07>] vfs_write+0xb7/0x1f0
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff81215825>] SyS_write+0x55/0xd0
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  [<ffffffff8175d2e9>] system_call_fastpath+0x12/0x17
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: Code: 8b 54 19 08 41 0f 46 c4 89 d6 44 29 ee 39 f0 0f 47 c6 41 01 c5 29 c3 41 29 c4 44 39 ea 75 cc 41 83 c7 01 45 31 ed eb c3 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 be fb ff ff ff e9 79 ff ff ff 80 3d 0b
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: RIP  [<ffffffffa04105f0>] pmem_make_request+0x160/0x1c0 [pmem]
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel:  RSP <ffff88032024fa68>
> Feb 10 09:43:43 test27.engr.enmotus.com <http://test27.engr.enmotus.com> kernel: ---[ end trace 73f040217b1381f1 ]---
> 
> 

Hi Roger, Hugh

Thanks, I will investigate this, and submit to my tree.

When submitting patches the Stack trace needs to go in the patch description, not after the code diff.
So it can be recorded at git, and also for review. I will fix this and submit.

That said, this is a very strange bug for me. Because up until today what I knew about
blk_queue_max_segment_size, is what it says at the doc of this function

/**
 * blk_queue_max_segment_size - set max segment size for blk_rq_map_sg
 * @q:  the request queue for the device
 * @max_size:  max size of segment in bytes
 *
 * Description:
 *    Enables a low level driver to set an upper limit on the size of a
 *    coalesced segment
 **/

What I knew is that it only affects the processing of blk_rq_map_sg() in the
stage where we translate a bio to an sg_list, where the sg_list hardware part
is aloud to coalesce multiple pages if they are contiguous in pfn.

But I had the notion that a bio always points to a single page. The biovec
has a page-pointer not an address. So how can it be that a page-pointer points
to a page that is bigger than PAGE_SIZE?

If this patch actually fixes your problem I would like to learn where in the
bio stack it affects processing. So I will continue my investigation.

Meanwhile I guess it can never hurt, so I will submit it regardless

Thanks
Boaz
Roger C. Pao (Enmotus) Feb. 13, 2015, 1:55 a.m. UTC | #2
On 02/12/2015 04:47 AM, Boaz Harrosh wrote:
 > Hi Roger, Hugh
 >
 > Thanks, I will investigate this, and submit to my tree.
 >
 > When submitting patches the Stack trace needs to go in the patch 
description, not after the code diff.
 > So it can be recorded at git, and also for review. I will fix this 
and submit.
 >
 > That said, this is a very strange bug for me. Because up until today 
what I knew about
 > blk_queue_max_segment_size, is what it says at the doc of this function
 >
 > /**
 >   * blk_queue_max_segment_size - set max segment size for blk_rq_map_sg
 >   * @q:  the request queue for the device
 >   * @max_size:  max size of segment in bytes
 >   *
 >   * Description:
 >   *    Enables a low level driver to set an upper limit on the size of a
 >   *    coalesced segment
 >   **/
 >
 > What I knew is that it only affects the processing of blk_rq_map_sg() 
in the
 > stage where we translate a bio to an sg_list, where the sg_list 
hardware part
 > is aloud to coalesce multiple pages if they are contiguous in pfn.
 >
 > But I had the notion that a bio always points to a single page. The 
biovec
 > has a page-pointer not an address. So how can it be that a 
page-pointer points
 > to a page that is bigger than PAGE_SIZE?
 >
 > If this patch actually fixes your problem I would like to learn where 
in the
 > bio stack it affects processing. So I will continue my investigation.
 >
 > Meanwhile I guess it can never hurt, so I will submit it regardless

On February 12, 2015 4:25:26 PM PST, Hugh Daschbach 
<hugh.daschbach@enmotus.com> wrote:

     The biovec can present a size greater that PAGE_SIZE if an I/O 
buffer contains physically contiguous pages.  This may be unusual for 
user space pages, as the virtual to physical memory map gets fragmented. 
  But for buffers allocated by the kernel with kmalloc, physical memory 
will be contiguous.

     Even if a single I/O request does not contain two contiguous pages, 
the block layer may merge two requests that have contiguous pages.  It 
will then attempt to coalesce biovecs.  You probably won't see that if 
you avoid the I/O scheduler by capturing requests at make_request.  But 
it is still a good idea to declare your devices segment size limitation 
with blk_queue_max_segment_size.  There are a couple drivers in 
drivers/block/ that do just that to limit segments to PAGE_SIZE.

     We caught this because the Enmotus tiering engine issues rather 
large I/O requests to buffers that were allocated with kmalloc.  It is 
fairly common for the tiering engine to allocate I/O buffers of 64KB or 
greater.  If the underlying block device supports it, we will submit a 
bio with a biovec mapping many contiguous pages.  The entire buffer will 
possibly be mapped by a single biovec.  The tiering engine uses 
max_segment_size to determine how to build it's biovec list.

     Thanks for considering the change.

     Hugh
Dan Williams Feb. 13, 2015, 3:22 a.m. UTC | #3
> On February 12, 2015 4:25:26 PM PST, Hugh Daschbach
> <hugh.daschbach@enmotus.com> wrote:
>
>     The biovec can present a size greater that PAGE_SIZE if an I/O buffer
> contains physically contiguous pages.  This may be unusual for user space
> pages, as the virtual to physical memory map gets fragmented.  But for
> buffers allocated by the kernel with kmalloc, physical memory will be
> contiguous.
>
>     Even if a single I/O request does not contain two contiguous pages, the
> block layer may merge two requests that have contiguous pages.  It will then
> attempt to coalesce biovecs.  You probably won't see that if you avoid the
> I/O scheduler by capturing requests at make_request.  But it is still a good
> idea to declare your devices segment size limitation with
> blk_queue_max_segment_size.  There are a couple drivers in drivers/block/
> that do just that to limit segments to PAGE_SIZE.
>
>     We caught this because the Enmotus tiering engine issues rather large
> I/O requests to buffers that were allocated with kmalloc.  It is fairly
> common for the tiering engine to allocate I/O buffers of 64KB or greater.
> If the underlying block device supports it, we will submit a bio with a
> biovec mapping many contiguous pages.  The entire buffer will possibly be
> mapped by a single biovec.  The tiering engine uses max_segment_size to
> determine how to build it's biovec list.
>
>     Thanks for considering the change.
>

brd has the same bug, and pmem should be able to handle arbitrarily
sized contiguous ranges.
Ross Zwisler Feb. 13, 2015, 5:40 p.m. UTC | #4
On Thu, 2015-02-12 at 19:22 -0800, Dan Williams wrote:
> > On February 12, 2015 4:25:26 PM PST, Hugh Daschbach
> > <hugh.daschbach@enmotus.com> wrote:
> >
> >     The biovec can present a size greater that PAGE_SIZE if an I/O buffer
> > contains physically contiguous pages.  This may be unusual for user space
> > pages, as the virtual to physical memory map gets fragmented.  But for
> > buffers allocated by the kernel with kmalloc, physical memory will be
> > contiguous.
> >
> >     Even if a single I/O request does not contain two contiguous pages, the
> > block layer may merge two requests that have contiguous pages.  It will then
> > attempt to coalesce biovecs.  You probably won't see that if you avoid the
> > I/O scheduler by capturing requests at make_request.  But it is still a good
> > idea to declare your devices segment size limitation with
> > blk_queue_max_segment_size.  There are a couple drivers in drivers/block/
> > that do just that to limit segments to PAGE_SIZE.
> >
> >     We caught this because the Enmotus tiering engine issues rather large
> > I/O requests to buffers that were allocated with kmalloc.  It is fairly
> > common for the tiering engine to allocate I/O buffers of 64KB or greater.
> > If the underlying block device supports it, we will submit a bio with a
> > biovec mapping many contiguous pages.  The entire buffer will possibly be
> > mapped by a single biovec.  The tiering engine uses max_segment_size to
> > determine how to build it's biovec list.
> >
> >     Thanks for considering the change.
> >
> 
> brd has the same bug, and pmem should be able to handle arbitrarily
> sized contiguous ranges.

Yep, agreed.  I think your patch is correct for BRD, since it really
does keep things arranged on a per-page basis, but I think the right fix
for PMEM is to allow it to handle arbitrary sized bvecs.  Like Boaz, I
didn't realize that the struct page in the bio_vec could point to a size
larger than a page - I saw that BRD was making that assumption, and I
coded up the I/O path in PMEM to enforce the assumption.

As BRD is coded right now it'll just silently ignore all the data beyond
PAGE_SIZE.

I'll code this up when I get time, unless someone else gets to it
sooner.

- Ross
Boaz Harrosh Feb. 15, 2015, 9:41 a.m. UTC | #5
On 02/13/2015 03:50 AM, Roger C. Pao (Enmotus) wrote:
> 
> 
<>
> <hugh.daschbach@enmotus.com> wrote:
> 
>      The biovec can present a size greater that PAGE_SIZE if an I/O 
> buffer contains physically contiguous pages.  This may be unusual for 
> user space pages, as the virtual to physical memory map gets fragmented. 
>   But for buffers allocated by the kernel with kmalloc, physical memory 
> will be contiguous.
> 
>      Even if a single I/O request does not contain two contiguous pages, 
> the block layer may merge two requests that have contiguous pages.  It 
> will then attempt to coalesce biovecs.  You probably won't see that if 
> you avoid the I/O scheduler by capturing requests at make_request.  But 
> it is still a good idea to declare your devices segment size limitation 
> with blk_queue_max_segment_size.  There are a couple drivers in 
> drivers/block/ that do just that to limit segments to PAGE_SIZE.
> 
>      We caught this because the Enmotus tiering engine issues rather 
> large I/O requests to buffers that were allocated with kmalloc.  It is 
> fairly common for the tiering engine to allocate I/O buffers of 64KB or 
> greater.  If the underlying block device supports it, we will submit a 
> bio with a biovec mapping many contiguous pages.  The entire buffer will 
> possibly be mapped by a single biovec.  The tiering engine uses 
> max_segment_size to determine how to build it's biovec list.
> 
>      Thanks for considering the change.
> 
>      Hugh
>

Patch
diff mbox

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 1bd9ab0..0b3c8d6 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -297,6 +299,7 @@  static struct pmem_device *pmem_alloc(phys_addr_t
phys_addr, size_t disk_size,
 	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_max_segment_size(pmem->pmem_queue, PAGE_SIZE);

 	disk = alloc_disk(0);