bio linked list corruption.

Message ID	488f9edc-6a1c-2c68-0d33-d3aa32ece9a4@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> Subject: Re: bio linked list corruption. To: Linus Torvalds <torvalds@linux-foundation.org>, Dave Jones <davej@codemonkey.org.uk>, Andy Lutomirski <luto@amacapital.net>, "Andy Lutomirski" <luto@kernel.org>, Jens Axboe <axboe@fb.com>, Al Viro <viro@zeniv.linux.org.uk>, Josef Bacik <jbacik@fb.com>, David Sterba <dsterba@suse.com>, linux-btrfs <linux-btrfs@vger.kernel.org>, Linux Kernel <linux-kernel@vger.kernel.org>, Dave Chinner <david@fromorbit.com> References: <20161021200245.kahjzgqzdfyoe3uz@codemonkey.org.uk> <20161022152033.gkmm3l75kqjzsije@codemonkey.org.uk> <b1bbcbfc-dba2-952d-f1c0-87f532d5936b@fb.com> <20161024044051.onmh4h6sc2bjxzzc@codemonkey.org.uk> <77d9983d-a00a-1dc1-a9a1-631de1d0c146@fb.com> <20161026002752.qvrm6yxqb54fiqnd@codemonkey.org.uk> <CA+55aFyv0k9e6bBqGm-LL3CUwimS4+rSu341P7SOV5ezYrrW_g@mail.gmail.com> <CA+55aFydo5oxo+ihUfGffMK6cAZnbZ8WyEiF7xgEx1Bh_aOMww@mail.gmail.com> <20161026163018.wx57yy554576s6e2@codemonkey.org.uk> <CA+55aFzsrRenu3NsDYr3qqO=fC9FUxkLc3M2wzeLBpN7ghwvwA@mail.gmail.com> <20161026184201.6ofblkd3j5uxystq@codemonkey.org.uk> <CA+55aFwD9McVapb0svQrrvP1k6iSkqz5ENNGXY6b+Yo-k7wOsg@mail.gmail.com> From: Chris Mason <clm@fb.com> Message-ID: <488f9edc-6a1c-2c68-0d33-d3aa32ece9a4@fb.com> Date: Wed, 26 Oct 2016 16:00:23 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <CA+55aFwD9McVapb0svQrrvP1k6iSkqz5ENNGXY6b+Yo-k7wOsg@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtETTVQUjE1TUIxMjQxOzIzOmFEWXZ0REdsekNSTTVJSEdIRXJEQVdDZVpj?= =?utf-8?B?WnlVQkNNcjFXOUVvYmRZaWdnL2pXa2hHU3hZSWxNTElaZS9EN3NqR2Vyb05T?= =?utf-8?B?UTIvNzhhR09yc0FiRlpEZTRqWDlsYkh1bWZTY3ZqZHZTL3ZuRDcvZDBZZ3Yy?= =?utf-8?B?SExNOVVZOGVQTUFDTVJhZjFhZ0cyZmJrVnJaNW4zbXlsRHhNc1hIRnBMUUlk?= =?utf-8?B?RFJzTVdMKzcrNjQrTnkwOEI1dWtvb1ZMMmpXVk5INm9YYk11c1NDVnNzNElN?= =?utf-8?B?bG4yTk1ndnh3YlZWT0xabGIxY3hQeDRZSnR1aG1ZbzZ1cVV4ZjFJWmVHN3dn?= =?utf-8?B?V1d4U0dzY0tibU1XdThJMnUyVm5oRzFmYkNQS2V3eklhU08vbzBsZ1VRN2Zk?= =?utf-8?B?MmRqMVF3U1VjRk9mcklCdVVSb3pJai9Ld3RSVkdCS2pJOUd6WDJGNlJnTE1j?= =?utf-8?B?WHB6SnlTM3FVVkV5YmV5YzBEU3dFSFJXNitsUU5CZ2hSa25wR0xvb3djWFMv?= =?utf-8?B?Y3YrWUdiWS9zY1huUDVJQUhlK0NubmwxK2thKzgzK2p4WTBJVERzTzNKV0p2?= =?utf-8?B?YUNDekRuVElZMkhDQVVHVXhzTHdESVdPd0xFOXVTZ0tSV1Z4OTBLUVdXUGVa?= =?utf-8?B?akJReFBMZEdlZkRjME9pUmNVQmFSdHNuOWplcVA4NDhjdHdwajFmNUt2S2FC?= =?utf-8?B?ZEk3cUVtQlhENXpzeVJsUXpnc2E5b0hLNlVNNmZiNmV2eVRjcE5lcnpqUzBR?= =?utf-8?B?MkpzbGlzRzAyc0MySzB3d1Q3TFZmaGVNVnhzTG1zb05UdmJWN2Vrdm1jM2VV?= =?utf-8?B?OEJ3NG1xaFB0V1lwRWNzb240U0ZNOUZCRnMvbnR3dUp1SE41cnB1UTFkSElU?= =?utf-8?B?QWovM2Y1NC9aVDh5bnRCZGFnbVdDWFA1N2hXM2d4L0NjajZMbGdKT0RhN2JB?= =?utf-8?B?SnBKK0hRSnN1UCs2QkVUZm9abHB1MGcxbEpPV0FNcjVKTWNZOHRQc29VcWh3?= =?utf-8?B?cVZMdWpRYjJRV0g5SVd0cEpNRDdjcFVkaFUrTXBDY3Nqa0QyMkxwd1prbll4?= =?utf-8?B?QThjRERyc0E4UFp6T1hLNE11RFl0L0ZoU3ordWdMenJKZDg2NkFCVFJhU3Y0?= =?utf-8?B?ODBZRTNxbE5nVnBpamZUT0NkYUtOcVErYmFOalB5WUtUc2lCaXp2N3dRUTZa?= =?utf-8?B?WWZkSVppMzVsSkNGSHl0M3BUdkt5UnFLemVRM1lPY1E4VlFGSEdBWFJqZkJE?= =?utf-8?B?ZmV4Qkd5SXdUWG1yOXZhM2VnUGY5OFAxL2pNcDllbkk2bVRXeDFSamlBMXhp?= =?utf-8?B?VitOK3lOY2tUMGpvZzZtRmYrOCtYK1h2VDBjNUNBR0JlcDROSzY1c0lGSUxD?= =?utf-8?B?d0ZQZWxNMnJWSnRiVzlXaGxySE1RaEVXSnZBME5IV0IwcVZMVzVDaXlKTDBk?= =?utf-8?B?OGs1ejhIUnZHOGt6blV3NFBkTHRzb2p6amwyWm5sQTIwN3BrbFE1TThqMUFY?= =?utf-8?B?bk1Pa1FTVUprbDErNkRVdHU4eUZYdUFUcjRQd2poK0tPTXNubjZtRE5KeUpT?= =?utf-8?B?YUs0dWlSVzg4b05hUGhNa1V3YWJhR3NVUkpuSG5PL2EvTEszOHhRUU5EdGNw?= =?utf-8?B?REhCZG9JeEVUbWV1b2tWR0RqSkpTVnZwS2hsMnJHRGgwMlVOQzdabTV5MnZw?= =?utf-8?B?V1dSYW9Ya2ZZRmZKeTR3YjNXQ0RxanF2Y25ERXB5K2hsd1BremxRS3dhYnc3?= =?utf-8?B?ajlwZWg0Rms3ZU9nMW0zRlJwN1NuUjZBSmF2WjdFV1hpdVJ1akRLL0xsSnNq?= =?utf-8?B?KzFWNlZlL0lSanZJalRkY1pUaWRIaERIU0hBTTVsQmpaVG5WU3FEK3czV2Fr?= =?utf-8?B?QmdOS1c0SGhhb0JORklUUE51dmFYTlV1K2VWRVVOZkE4YytUa1VkSjFya1RY?= =?utf-8?B?clVVNFNaLzFRPT0=?= X-Microsoft-Exchange-Diagnostics: 1; DM5PR15MB1241; 6:uYGmaqsn4cWnNEr/JJk9QjMIaQFuttbV/h0+KphXrezBzr4m/jR0FNXia6s39yZW6qpgbisgoBVIf41TWDmTXImNKIhqD4HScqt1T3vWtR7FL4drHgno79B6Iw9e8kN7PQdmOYSyDxVZJHee5NYy6Qc5AYdr4THcH+/L+jV7rqRFnOdrJqD74V2LzAD3M1TNpGiqgvN3G4e+N1/o0XcrNgmMv4rjjOztDkYoU+BbdYIgIB0bWqhu9UapeKWdMTKhujUoh1uUEp+VyY9BwVyUvd7kQlXAIuBJ7mWs75foiN+1NqbQ6WiZ/B+4Gk30BspB; 5:SwxrFbr+nfKKY1Ylfw7tdLyR3rLg5v9in+68hWsEQ6wocGJjZgG7ko/ual0L+Pyk3ENMMr2M8U/n/w7GpDG7j+2nxRA/bA0V8DC1Pg9aPpbueWSGt9eBWthC9LPVixmytVXyYGs6tN1LMIJac3ZvFw==; 24:LGskQnEEzyKcoGakI+s3EBC4GJRGB/Mo4dpPkvqf2wxYUMbHbcbIJTNA7abZZGuzITnm6F4VZq5nv/a1w+A0t6Usw5BPTMLviRX0ZmFfETU=; 7:NaWiut15m5rJt6xLZWI0z7gL4TnZYVEVCoUh6X22sweKMVX6Aqtsf51v21BUFwcfP7oAnxAyAXVw91YJ1AebJQiLuGlAGysP/b4CjMM53XVOrI8G5Bf+8nhN9vuJiQbnR6QoYxYIWgBacXDr38J2Aq1Tzz7Y+UOkRvoA33c/+CD+R2XqWAy9nak1Sh8KSaPihD7ILn9VzFheTbwFKHN0wa9r1GgjELhfEznAVxfp3QUDn1dQ4BtvHUsOChskGqtZJg8Aqj7S5IN9xxuJDyzMw7uWycMIHw8/cK5VxN/Xe7AuN486fklNv0bffJw/YkO8IrdEjS3orAlmg9bm++B59PDDiYnj 0cqypuOZ4B0RbTc= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; DM5PR15MB1241; 20:goSj/kxwVfaqq114hE8ZY4Vyv50ckW+4/PWl0WC+fL9eyeSfEE64iECgCH/Cu8UmxPiagB4QPDrviuZb7v3PUE3KGkr33Zxk9UjnsgX00fPiH949jqsdGMyfOMmQX3763XF0meSoh7X6yjCbX35Jpo7hRK9JhyjTbJpcED9U2Wk= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Oct 2016 20:00:27.8692 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR15MB1241 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Message ID

488f9edc-6a1c-2c68-0d33-d3aa32ece9a4@fb.com (mailing list archive)

State

New, archived

Headers

Subject: Re: bio linked list corruption.
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Dave Jones <davej@codemonkey.org.uk>,
	Andy Lutomirski <luto@amacapital.net>,
	"Andy Lutomirski" <luto@kernel.org>, Jens Axboe <axboe@fb.com>,
	Al Viro <viro@zeniv.linux.org.uk>, Josef Bacik <jbacik@fb.com>,
	David Sterba <dsterba@suse.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
References: <20161021200245.kahjzgqzdfyoe3uz@codemonkey.org.uk>
	<20161022152033.gkmm3l75kqjzsije@codemonkey.org.uk>
	<b1bbcbfc-dba2-952d-f1c0-87f532d5936b@fb.com>
	<20161024044051.onmh4h6sc2bjxzzc@codemonkey.org.uk>
	<77d9983d-a00a-1dc1-a9a1-631de1d0c146@fb.com>
	<20161026002752.qvrm6yxqb54fiqnd@codemonkey.org.uk>
	<CA+55aFyv0k9e6bBqGm-LL3CUwimS4+rSu341P7SOV5ezYrrW_g@mail.gmail.com>
	<CA+55aFydo5oxo+ihUfGffMK6cAZnbZ8WyEiF7xgEx1Bh_aOMww@mail.gmail.com>
	<20161026163018.wx57yy554576s6e2@codemonkey.org.uk>
	<CA+55aFzsrRenu3NsDYr3qqO=fC9FUxkLc3M2wzeLBpN7ghwvwA@mail.gmail.com>
	<20161026184201.6ofblkd3j5uxystq@codemonkey.org.uk>
	<CA+55aFwD9McVapb0svQrrvP1k6iSkqz5ENNGXY6b+Yo-k7wOsg@mail.gmail.com>
From: Chris Mason <clm@fb.com>
Message-ID: <488f9edc-6a1c-2c68-0d33-d3aa32ece9a4@fb.com>
Date: Wed, 26 Oct 2016 16:00:23 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.3.0
MIME-Version: 1.0
In-Reply-To: <CA+55aFwD9McVapb0svQrrvP1k6iSkqz5ENNGXY6b+Yo-k7wOsg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Received-SPF: None (protection.outlook.com: fb.com does not designate
	permitted sender hosts)
SpamDiagnosticOutput: 1:99
SpamDiagnosticMetadata: NSPM
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Oct 2016 20:00:27.8692
	(UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR15MB1241
X-OriginatorOrg: fb.com
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2016-10-26_12:, , signatures=0
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Commit Message

Chris Mason Oct. 26, 2016, 8 p.m. UTC

On 10/26/2016 03:06 PM, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <davej@codemonkey.org.uk> wrote:
>>
>> The stacks show nearly all of them are stuck in sync_inodes_sb
> 
> That's just wb_wait_for_completion(), and it means that some IO isn't
> completing.
> 
> There's also a lot of processes waiting for inode_lock(), and a few
> waiting for mnt_want_write()
> 
> Ignoring those, we have
> 
>> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
>> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
>> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
>> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
>> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Don't know this one. There's a couple of them. Could there be some
> ABBA deadlock on the ordered roots waiting?

It's always possible, but we haven't changed anything here.

I've tried a long list of things to reproduce this on my test boxes,
including days of trinity runs and a kernel module to exercise vmalloc,
and thread creation.

Today I turned off every CONFIG_DEBUG_* except for list debugging, and
ran dbench 2048:

[ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
[ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
[ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
[ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
[ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 2759.125077]  ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
[ 2759.125077]  ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
[ 2759.127444]  ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
[ 2759.127444] Call Trace:
[ 2759.127444]  [<ffffffff814fe4ff>] dump_stack+0x53/0x74
[ 2759.127444]  [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
[ 2759.127444]  [<ffffffff81064bbf>] __warn+0xff/0x120
[ 2759.127444]  [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
[ 2759.127444]  [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
[ 2759.127444]  [<ffffffff814df338>] blk_sq_make_request+0x388/0x580
[ 2759.127444]  [<ffffffff814d5b44>] generic_make_request+0x104/0x200
[ 2759.127444]  [<ffffffff814d5ca5>] submit_bio+0x65/0x130
[ 2759.127444]  [<ffffffff8152a946>] ? __percpu_counter_add+0x96/0xd0
[ 2759.127444]  [<ffffffff814260bc>] btrfs_map_bio+0x23c/0x310
[ 2759.127444]  [<ffffffff813f42b3>] btrfs_submit_bio_hook+0xd3/0x190
[ 2759.127444]  [<ffffffff814117ad>] submit_one_bio+0x6d/0xa0
[ 2759.127444]  [<ffffffff8141182e>] flush_epd_write_bio+0x4e/0x70
[ 2759.127444]  [<ffffffff81418d8d>] extent_writepages+0x5d/0x70
[ 2759.127444]  [<ffffffff813f84e0>] ? btrfs_releasepage+0x50/0x50
[ 2759.127444]  [<ffffffff81220ffe>] ? wbc_attach_and_unlock_inode+0x6e/0x170
[ 2759.127444]  [<ffffffff813f5047>] btrfs_writepages+0x27/0x30
[ 2759.127444]  [<ffffffff81178690>] do_writepages+0x20/0x30
[ 2759.127444]  [<ffffffff81167d85>] __filemap_fdatawrite_range+0xb5/0x100
[ 2759.127444]  [<ffffffff81168263>] filemap_fdatawrite_range+0x13/0x20
[ 2759.127444]  [<ffffffff81405a7b>] btrfs_fdatawrite_range+0x2b/0x70
[ 2759.127444]  [<ffffffff81405ba8>] btrfs_sync_file+0x88/0x490
[ 2759.127444]  [<ffffffff810751e2>] ? group_send_sig_info+0x42/0x80
[ 2759.127444]  [<ffffffff8107527d>] ? kill_pid_info+0x5d/0x90
[ 2759.127444]  [<ffffffff8107564a>] ? SYSC_kill+0xba/0x1d0
[ 2759.127444]  [<ffffffff811f2638>] ? __sb_end_write+0x58/0x80
[ 2759.127444]  [<ffffffff81225b9c>] vfs_fsync_range+0x4c/0xb0
[ 2759.127444]  [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0
[ 2759.127444]  [<ffffffff81225c1c>] vfs_fsync+0x1c/0x20
[ 2759.127444]  [<ffffffff81225c5d>] do_fsync+0x3d/0x70
[ 2759.127444]  [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100
[ 2759.127444]  [<ffffffff81225cc0>] SyS_fsync+0x10/0x20
[ 2759.127444]  [<ffffffff81002b65>] do_syscall_64+0x55/0xd0
[ 2759.127444]  [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40
[ 2759.127444]  [<ffffffff819ad986>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2759.150635] ---[ end trace 3b5b7e2ef61c3d02 ]---

I put a variant of your suggested patch in place, but my printk never
triggered.  Now that I've made it happen once, I'll make sure I can do it
over and over again.  This doesn't have the patches that Andy asked Davej to
try out yet, but I'll try them once I have a reliable reproducer.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chris Mason Oct. 26, 2016, 9:52 p.m. UTC | #1

On 10/26/2016 04:00 PM, Chris Mason wrote:
> 
> 
> On 10/26/2016 03:06 PM, Linus Torvalds wrote:
>> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones <davej@codemonkey.org.uk> wrote:
>>>
>>> The stacks show nearly all of them are stuck in sync_inodes_sb
>>
>> That's just wb_wait_for_completion(), and it means that some IO isn't
>> completing.
>>
>> There's also a lot of processes waiting for inode_lock(), and a few
>> waiting for mnt_want_write()
>>
>> Ignoring those, we have
>>
>>> [<ffffffffa009554f>] btrfs_wait_ordered_roots+0x3f/0x200 [btrfs]
>>> [<ffffffffa00470d1>] btrfs_sync_fs+0x31/0xc0 [btrfs]
>>> [<ffffffff811fbd4e>] sync_filesystem+0x6e/0xa0
>>> [<ffffffff811fbebc>] SyS_syncfs+0x3c/0x70
>>> [<ffffffff8100255c>] do_syscall_64+0x5c/0x170
>>> [<ffffffff817908cb>] entry_SYSCALL64_slow_path+0x25/0x25
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Don't know this one. There's a couple of them. Could there be some
>> ABBA deadlock on the ordered roots waiting?
> 
> It's always possible, but we haven't changed anything here.
> 
> I've tried a long list of things to reproduce this on my test boxes,
> including days of trinity runs and a kernel module to exercise vmalloc,
> and thread creation.
> 
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
> 

This one is special because CONFIG_VMAP_STACK is not set.  Btrfs triggers in < 10 minutes.
I've done 30 minutes each with XFS and Ext4 without luck.

This is all in a virtual machine that I can copy on to a bunch of hosts.  So I'll get some
parallel tests going tonight to narrow it down.

------------[ cut here ]------------
WARNING: CPU: 6 PID: 4481 at lib/list_debug.c:33 __list_add+0xbe/0xd0
list_add corruption. prev->next should be next (ffffe8ffffd80b08), but was ffff88012b65fb88. (prev=ffff880128c8d500).
Modules linked in: crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper i2c_piix4 cryptd i2c_core virtio_net serio_raw floppy button pcspkr sch_fq_codel autofs4 virtio_blk
CPU: 6 PID: 4481 Comm: dbench Not tainted 4.9.0-rc2-15419-g811d54d #319
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
 ffff880104eff868 ffffffff814fde0f ffffffff8151c46e ffff880104eff8c8
 ffff880104eff8c8 0000000000000000 ffff880104eff8b8 ffffffff810648cf
 ffff880128cab2c0 000000213fc57c68 ffff8801384e8928 ffff880128cab180
Call Trace:
 [<ffffffff814fde0f>] dump_stack+0x53/0x74
 [<ffffffff8151c46e>] ? __list_add+0xbe/0xd0
 [<ffffffff810648cf>] __warn+0xff/0x120
 [<ffffffff810649a9>] warn_slowpath_fmt+0x49/0x50
 [<ffffffff8151c46e>] __list_add+0xbe/0xd0
 [<ffffffff814dec38>] blk_sq_make_request+0x388/0x580
 [<ffffffff814d5444>] generic_make_request+0x104/0x200
 [<ffffffff814d55a5>] submit_bio+0x65/0x130
 [<ffffffff8152a256>] ? __percpu_counter_add+0x96/0xd0
 [<ffffffff81425c7c>] btrfs_map_bio+0x23c/0x310
 [<ffffffff813f3e73>] btrfs_submit_bio_hook+0xd3/0x190
 [<ffffffff8141136d>] submit_one_bio+0x6d/0xa0
 [<ffffffff814113ee>] flush_epd_write_bio+0x4e/0x70
 [<ffffffff8141894d>] extent_writepages+0x5d/0x70
 [<ffffffff813f80a0>] ? btrfs_releasepage+0x50/0x50
 [<ffffffff81220d0e>] ? wbc_attach_and_unlock_inode+0x6e/0x170
 [<ffffffff813f4c07>] btrfs_writepages+0x27/0x30
 [<ffffffff811783a0>] do_writepages+0x20/0x30
 [<ffffffff81167a95>] __filemap_fdatawrite_range+0xb5/0x100
 [<ffffffff81167f73>] filemap_fdatawrite_range+0x13/0x20
 [<ffffffff8140563b>] btrfs_fdatawrite_range+0x2b/0x70
 [<ffffffff81405768>] btrfs_sync_file+0x88/0x490
 [<ffffffff81074ef2>] ? group_send_sig_info+0x42/0x80
 [<ffffffff81074f8d>] ? kill_pid_info+0x5d/0x90
 [<ffffffff8107535a>] ? SYSC_kill+0xba/0x1d0
 [<ffffffff811f2348>] ? __sb_end_write+0x58/0x80
 [<ffffffff812258ac>] vfs_fsync_range+0x4c/0xb0
 [<ffffffff81002501>] ? syscall_trace_enter+0x201/0x2e0
 [<ffffffff8122592c>] vfs_fsync+0x1c/0x20
 [<ffffffff8122596d>] do_fsync+0x3d/0x70
 [<ffffffff810029cb>] ? syscall_slow_exit_work+0xfb/0x100
 [<ffffffff812259d0>] SyS_fsync+0x10/0x20
 [<ffffffff81002b65>] do_syscall_64+0x55/0xd0
 [<ffffffff810026d7>] ? prepare_exit_to_usermode+0x37/0x40
 [<ffffffff819ad246>] entry_SYSCALL64_slow_path+0x25/0x25
---[ end trace efe6b17c6dba2a6e ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Oct. 26, 2016, 10:07 p.m. UTC | #2

On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@fb.com> wrote:
>
> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
> ran dbench 2048:
>
> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
> [ 2759.125077]  ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
> [ 2759.125077]  ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
> [ 2759.127444]  ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
> [ 2759.127444] Call Trace:
> [ 2759.127444]  [<ffffffff814fe4ff>] dump_stack+0x53/0x74
> [ 2759.127444]  [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
> [ 2759.127444]  [<ffffffff81064bbf>] __warn+0xff/0x120
> [ 2759.127444]  [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
> [ 2759.127444]  [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
> [ 2759.127444]  [<ffffffff814df338>] blk_sq_make_request+0x388/0x580

Ok, that's definitely the same one that Dave started out seeing.

The fact that it is that reliable - two different machines, two very
different loads (dbench looks nothing like trinity) really makes me
think that maybe the problem really is in the block plugging after
all.

It very much does not smell like random stack corruption. It's simply
not random enough.

And I just noticed something: I originally thought that this is the
"list_add_tail()" to the plug - which is the only "list_add()" variant
in that function.

But that never made sense, because the whole "but was" isn't a stack
address, and "next" in "list_add_tail()" is basically fixed, and would
have to be the stack.

But I now notice that there's actually another "list_add()" variant
there, and it's the one from __blk_mq_insert_request() that gets
inlined into blk_mq_insert_request(), which then gets inlined into
blk_mq_make_request().

And that actually makes some sense just looking at the offsets too:

     blk_sq_make_request+0x388/0x580

so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.

And there it makes perfect sense that the "next should be" value is
*not* on the stack.

Chris, if you built with debug info, you can try

    ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388

to get what line that blk_sq_make_request+0x388 address actually is. I
think it's the

                list_add_tail(&rq->queuelist, &ctx->rq_list);

in __blk_mq_insert_req_list() (when it's inlined from
blk_sq_make_request(), "at_head" will be false.

So it smells like "&ctx->rq_list" might be corrupt.

And that actually seems much more likely than the "plug" list, because
while the plug is entirely thread-local (and thus shouldn't have any
races), the ctx->rq_list very much is not.

Jens?

For example, should we have a

    BUG_ON(ctx != rq->mq_ctx);

in blk_mq_merge_queue_io()? Because it locks ctx->lock, but then
__blk_mq_insert_request() will insert things onto the queues of
rq->mq_ctx.

blk_mq_insert_requests() has similar issues, but there has that BUG_ON().

The locking there really is *very* messy. All the lockers do

        spin_lock(&ctx->lock);
        ...
        spin_unlock(&ctx->lock);

but then __blk_mq_insert_request() and __blk_mq_insert_req_list don't
act on "ctx", but on "ctx = rq->mq_ctx", so if you ever get those
wrong, you're completely dead.

Now, I'm not seeing why they'd be wrong, and why they'd be associated
with the VMAP_STACK thing, but it could just be an unlucky timing
thing.

           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason Oct. 26, 2016, 10:54 p.m. UTC | #3

On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote:
>On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason <clm@fb.com> wrote:
>>
>> Today I turned off every CONFIG_DEBUG_* except for list debugging, and
>> ran dbench 2048:
>>
>> [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0
>> [ 2759.119652] list_add corruption. prev->next should be next (ffffe8ffffc80308), but was ffffc90000ccfb88. (prev=ffff880128522380).
>> [ 2759.121039] Modules linked in: crc32c_intel i2c_piix4 aesni_intel aes_x86_64 virtio_net glue_helper i2c_core lrw floppy gf128mul serio_raw pcspkr button ablk_helper cryptd sch_fq_codel autofs4 virtio_blk
>> [ 2759.124369] CPU: 2 PID: 31039 Comm: dbench Not tainted 4.9.0-rc1-15246-g4ce9206-dirty #317
>> [ 2759.125077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
>> [ 2759.125077]  ffffc9000f6fb868 ffffffff814fe4ff ffffffff8151cb5e ffffc9000f6fb8c8
>> [ 2759.125077]  ffffc9000f6fb8c8 0000000000000000 ffffc9000f6fb8b8 ffffffff81064bbf
>> [ 2759.127444]  ffff880128523680 0000002139968000 ffff880138b7a4a0 ffff880128523540
>> [ 2759.127444] Call Trace:
>> [ 2759.127444]  [<ffffffff814fe4ff>] dump_stack+0x53/0x74
>> [ 2759.127444]  [<ffffffff8151cb5e>] ? __list_add+0xbe/0xd0
>> [ 2759.127444]  [<ffffffff81064bbf>] __warn+0xff/0x120
>> [ 2759.127444]  [<ffffffff81064c99>] warn_slowpath_fmt+0x49/0x50
>> [ 2759.127444]  [<ffffffff8151cb5e>] __list_add+0xbe/0xd0
>> [ 2759.127444]  [<ffffffff814df338>] blk_sq_make_request+0x388/0x580
>
>Ok, that's definitely the same one that Dave started out seeing.
>
>The fact that it is that reliable - two different machines, two very
>different loads (dbench looks nothing like trinity) really makes me
>think that maybe the problem really is in the block plugging after
>all.
>
>It very much does not smell like random stack corruption. It's simply
>not random enough.

Agreed.  I'd feel better if I could trigger this outside of btrfs, even 
though Dave Chinner hit something very similar on xfs.  I'll peel off 
another test machine for a long XFS run.

>
>And I just noticed something: I originally thought that this is the
>"list_add_tail()" to the plug - which is the only "list_add()" variant
>in that function.
>
>But that never made sense, because the whole "but was" isn't a stack
>address, and "next" in "list_add_tail()" is basically fixed, and would
>have to be the stack.
>
>But I now notice that there's actually another "list_add()" variant
>there, and it's the one from __blk_mq_insert_request() that gets
>inlined into blk_mq_insert_request(), which then gets inlined into
>blk_mq_make_request().
>
>And that actually makes some sense just looking at the offsets too:
>
>     blk_sq_make_request+0x388/0x580
>
>so it's somewhat at the end of blk_sq_make_request(). So it's not unlikely.
>
>And there it makes perfect sense that the "next should be" value is
>*not* on the stack.
>
>Chris, if you built with debug info, you can try
>
>    ./scripts/faddr2line /boot/vmlinux blk_sq_make_request+0x388
>
>to get what line that blk_sq_make_request+0x388 address actually is. I
>think it's the
>
>                list_add_tail(&rq->queuelist, &ctx->rq_list);
>
>in __blk_mq_insert_req_list() (when it's inlined from
>blk_sq_make_request(), "at_head" will be false.
>
>So it smells like "&ctx->rq_list" might be corrupt.

I'm running your current git here, so these line numbers should line up 
for you:

blk_sq_make_request+0x388/0x578:
__blk_mq_insert_request at block/blk-mq.c:1049
 (inlined by) blk_mq_merge_queue_io at block/blk-mq.c:1175
  (inlined by) blk_sq_make_request at block/blk-mq.c:1419

The fsync path in the WARN doesn't have any plugs that I can see, so its 
not surprising that we're not in the plugging path.  I'm here:

	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
		/*
		 * For a SYNC request, send it to the hardware immediately. For
		 * an ASYNC request, just ensure that we run it later on. The
		 * latter allows for merging opportunities and more efficient
		 * dispatching.
		 */

I'll try the debugging patch you sent in the other email.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/kernel/fork.c b/kernel/fork.c
index 623259f..de95e19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -165,7 +165,7 @@  void __weak arch_release_thread_stack(unsigned long *stack)
  * vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
  * flush.  Try to minimize the number of calls by caching stacks.
  */
-#define NR_CACHED_STACKS 2
+#define NR_CACHED_STACKS 256
 static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
 #endif
 
@@ -173,7 +173,9 @@  static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 #ifdef CONFIG_VMAP_STACK
        void *stack;
+       char *p;
        int i;
+       int j;
 
        local_irq_disable();
        for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -183,7 +185,15 @@  static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
                        continue;
                this_cpu_write(cached_stacks[i], NULL);
 
+               p = s->addr;
+               for (j = 0; j < THREAD_SIZE; j++) {
+                       if (p[j] != 'c') {
+                               printk_ratelimited(KERN_CRIT "bad poison %c byte %d\n", p[j], j);
+                               break;
+                       }
+               }
                tsk->stack_vm_area = s;
+
                local_irq_enable();
                return s->addr;
        }
@@ -219,6 +229,7 @@  static inline void free_thread_stack(struct task_struct *tsk)
                int i;
 
                local_irq_save(flags);
+               memset(tsk->stack_vm_area->addr, 'c', THREAD_SIZE);
                for (i = 0; i < NR_CACHED_STACKS; i++) {
                        if (this_cpu_read(cached_stacks[i]))
                                continue;

bio linked list corruption.

Commit Message

Comments

Patch