[1/2] xfs: fix up xfs_swap_extent_forks inline extent handling
diff mbox

Message ID 893a260d-f511-a557-a248-d96a4de4dc8b@sandeen.net
State Accepted
Headers show

Commit Message

Eric Sandeen Nov. 5, 2016, 2:24 a.m. UTC
There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Brian Foster Nov. 7, 2016, 2:57 p.m. UTC | #1
On Fri, Nov 04, 2016 at 09:24:16PM -0500, Eric Sandeen wrote:
> There have been several reports over the years of NULL pointer
> dereferences in xfs_trans_log_inode during xfs_fsr processes,
> when the process is doing an fput and tearing down extents
> on the temporary inode, something like:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
>     [exception RIP: xfs_trans_log_inode+0x10]
>  #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
> #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
> #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
> #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
> #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
> #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
> #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
> #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
> #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
> #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
> #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
> #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
> #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
> #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
> 
> As it turns out, this is because the i_itemp pointer, along
> with the d_ops pointer, has been overwritten with zeros
> when we tear down the extents during truncate.  When the in-core
> inode fork on the temporary inode used by xfs_fsr was originally
> set up during the extent swap, we mistakenly looked at di_nextents
> to determine whether all extents fit inline, but this misses extents
> generated by speculative preallocation; we should be using if_bytes
> instead.
> 

Neat bug. :P The fix looks fine, but technically di_nextents doesn't
include any delalloc extents, right? From taking a quick skim through
the test case, it looks like the problematic situation is when the file
flush doesn't end up converting post-eof delalloc (created via
speculative prealloc) due to free space fragmentation.

So in other words, in most cases a file flush will convert all delalloc,
including post-eof preallocation, by virtue of being part of an extent
that is partly within file eof. In the fragmented free space situation,
however, the file flush is not necessarily guaranteed to convert all
delalloc blocks past eof, thus di_nextents is not consistent with the
in-core inode state, and badness ensues...

If I'm following that correctly it would be nice to just clarify that a
bit in the commit log. Otherwise looks good to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> This mistake corrupts the in-memory inode, and code in
> xfs_iext_remove_inline eventually gets bad inputs, causing
> it to memmove and memset incorrect ranges; this became apparent
> because the two values in ifp->if_u2.if_inline_ext[1] contained
> what should have been in d_ops and i_itemp; they were memmoved due
> to incorrect array indexing and then the original locations
> were zeroed with memset, again due to an array overrun.
> 
> Fix this by properly using i_df.if_bytes to determine the number
> of extents, not di_nextents.
> 
> Thanks to dchinner for looking at this with me and spotting the
> root cause.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 552465e..47074e0 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1792,6 +1792,7 @@
>  	struct xfs_ifork	tempifp, *ifp, *tifp;
>  	int			aforkblks = 0;
>  	int			taforkblks = 0;
> +	xfs_extnum_t		nextents;
>  	__uint64_t		tmp;
>  	int			error;
>  
> @@ -1881,7 +1882,8 @@
>  		 * pointer.  Otherwise it's already NULL or
>  		 * pointing to the extent.
>  		 */
> -		if (ip->i_d.di_nextents <= XFS_INLINE_EXTS) {
> +		nextents = ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
> +		if (nextents <= XFS_INLINE_EXTS) {
>  			ifp->if_u1.if_extents =
>  				ifp->if_u2.if_inline_ext;
>  		}
> @@ -1900,7 +1902,8 @@
>  		 * pointer.  Otherwise it's already NULL or
>  		 * pointing to the extent.
>  		 */
> -		if (tip->i_d.di_nextents <= XFS_INLINE_EXTS) {
> +		nextents = tip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
> +		if (nextents <= XFS_INLINE_EXTS) {
>  			tifp->if_u1.if_extents =
>  				tifp->if_u2.if_inline_ext;
>  		}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 7, 2016, 4:38 p.m. UTC | #2
On 11/7/16 8:57 AM, Brian Foster wrote:
> On Fri, Nov 04, 2016 at 09:24:16PM -0500, Eric Sandeen wrote:
>> There have been several reports over the years of NULL pointer
>> dereferences in xfs_trans_log_inode during xfs_fsr processes,
>> when the process is doing an fput and tearing down extents
>> on the temporary inode, something like:
>>
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>> PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
>>     [exception RIP: xfs_trans_log_inode+0x10]
>>  #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
>> #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
>> #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
>> #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
>> #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
>> #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
>> #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
>> #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
>> #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
>> #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
>> #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
>> #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
>> #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
>> #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
>>
>> As it turns out, this is because the i_itemp pointer, along
>> with the d_ops pointer, has been overwritten with zeros
>> when we tear down the extents during truncate.  When the in-core
>> inode fork on the temporary inode used by xfs_fsr was originally
>> set up during the extent swap, we mistakenly looked at di_nextents
>> to determine whether all extents fit inline, but this misses extents
>> generated by speculative preallocation; we should be using if_bytes
>> instead.
>>
> 
> Neat bug. :P The fix looks fine, but technically di_nextents doesn't
> include any delalloc extents, right? From taking a quick skim through
> the test case, it looks like the problematic situation is when the file
> flush doesn't end up converting post-eof delalloc (created via
> speculative prealloc) due to free space fragmentation.
> 
> So in other words, in most cases a file flush will convert all delalloc,
> including post-eof preallocation, by virtue of being part of an extent
> that is partly within file eof. In the fragmented free space situation,
> however, the file flush is not necessarily guaranteed to convert all
> delalloc blocks past eof, thus di_nextents is not consistent with the
> in-core inode state, and badness ensues...

To be honest I hadn't thought it through that far.  Could modify the
testcase to free up more contiguous space prior to the last flush, perhaps,
and see what happens.

-Eric

> If I'm following that correctly it would be nice to just clarify that a
> bit in the commit log. Otherwise looks good to me:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
>> This mistake corrupts the in-memory inode, and code in
>> xfs_iext_remove_inline eventually gets bad inputs, causing
>> it to memmove and memset incorrect ranges; this became apparent
>> because the two values in ifp->if_u2.if_inline_ext[1] contained
>> what should have been in d_ops and i_itemp; they were memmoved due
>> to incorrect array indexing and then the original locations
>> were zeroed with memset, again due to an array overrun.
>>
>> Fix this by properly using i_df.if_bytes to determine the number
>> of extents, not di_nextents.
>>
>> Thanks to dchinner for looking at this with me and spotting the
>> root cause.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 552465e..47074e0 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1792,6 +1792,7 @@
>>  	struct xfs_ifork	tempifp, *ifp, *tifp;
>>  	int			aforkblks = 0;
>>  	int			taforkblks = 0;
>> +	xfs_extnum_t		nextents;
>>  	__uint64_t		tmp;
>>  	int			error;
>>  
>> @@ -1881,7 +1882,8 @@
>>  		 * pointer.  Otherwise it's already NULL or
>>  		 * pointing to the extent.
>>  		 */
>> -		if (ip->i_d.di_nextents <= XFS_INLINE_EXTS) {
>> +		nextents = ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
>> +		if (nextents <= XFS_INLINE_EXTS) {
>>  			ifp->if_u1.if_extents =
>>  				ifp->if_u2.if_inline_ext;
>>  		}
>> @@ -1900,7 +1902,8 @@
>>  		 * pointer.  Otherwise it's already NULL or
>>  		 * pointing to the extent.
>>  		 */
>> -		if (tip->i_d.di_nextents <= XFS_INLINE_EXTS) {
>> +		nextents = tip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
>> +		if (nextents <= XFS_INLINE_EXTS) {
>>  			tifp->if_u1.if_extents =
>>  				tifp->if_u2.if_inline_ext;
>>  		}
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 552465e..47074e0 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1792,6 +1792,7 @@ 
 	struct xfs_ifork	tempifp, *ifp, *tifp;
 	int			aforkblks = 0;
 	int			taforkblks = 0;
+	xfs_extnum_t		nextents;
 	__uint64_t		tmp;
 	int			error;
 
@@ -1881,7 +1882,8 @@ 
 		 * pointer.  Otherwise it's already NULL or
 		 * pointing to the extent.
 		 */
-		if (ip->i_d.di_nextents <= XFS_INLINE_EXTS) {
+		nextents = ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
+		if (nextents <= XFS_INLINE_EXTS) {
 			ifp->if_u1.if_extents =
 				ifp->if_u2.if_inline_ext;
 		}
@@ -1900,7 +1902,8 @@ 
 		 * pointer.  Otherwise it's already NULL or
 		 * pointing to the extent.
 		 */
-		if (tip->i_d.di_nextents <= XFS_INLINE_EXTS) {
+		nextents = tip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
+		if (nextents <= XFS_INLINE_EXTS) {
 			tifp->if_u1.if_extents =
 				tifp->if_u2.if_inline_ext;
 		}