diff mbox

btrfs: fix invalid memory access with journal_info

Message ID 1525862125-15228-1-git-send-email-robbieko@synology.com (mailing list archive)
State New, archived
Headers show

Commit Message

robbieko May 9, 2018, 10:35 a.m. UTC
From: Robbie Ko <robbieko@synology.com>

When send process requires memory allocation, shrinker may be triggered
due to insufficient memory.
Then evict_inode gets called when inode is freed, and this function
may need to start transaction.
However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, it
passed the if condition,
and the following use yields illegal memory access.

  if (current->journal_info) {
      WARN_ON(type & TRANS_EXTWRITERS);
      h = current->journal_info;
      refcount_inc(&h->use_count);
      WARN_ON(refcount_read(&h->use_count) > 2);
      h->orig_rsv = h->block_rsv;
      h->block_rsv = NULL;
      goto got_it;
  }

Direct IO has a similar problem, journal_info will store btrfs_dio_data,
which will lead to illegal memory access.

We fixed the problem by save the journal_info and restore afterwards.

CallTrace looks like this:
 BUG: unable to handle kernel NULL pointer dereference at 0000000000000021
 IP: [<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
 PGD 8fea4b067 PUD a33bea067 PMD 0
 Oops: 0000 [#1] SMP
 CPU: 3 PID: 12681 Comm: btrfs Tainted: P C O 3.10.102 #15266
 RIP: 0010:[<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
 Call Trace:
 [<ffffffffa087a838>] ? btrfs_evict_inode+0x3d8/0x580 [btrfs]
 [<ffffffff81115932>] ? evict+0xa2/0x1a0
 [<ffffffff81112888>] ? shrink_dentry_list+0x308/0x3d0
 [<ffffffff811137f3>] ? prune_dcache_sb+0x133/0x160
 [<ffffffff810fa51f>] ? prune_super+0xcf/0x1a0
 [<ffffffff810bf6bf>] ? shrink_slab+0x11f/0x1d0
 [<ffffffff810c19f2>] ? do_try_to_free_pages+0x452/0x560
 [<ffffffff810bf054>] ? throttle_direct_reclaim+0x74/0x240
 [<ffffffff810c1bae>] ? try_to_free_pages+0xae/0xc0
 [<ffffffff810ba16b>] ? __alloc_pages_nodemask+0x53b/0x9f0
 [<ffffffff810bc89c>] ? __do_page_cache_readahead+0xec/0x270
 [<ffffffff810bcb2b>] ? ondemand_readahead+0xbb/0x220
 [<ffffffffa08d7c43>] ? fill_read_buf+0x2b3/0x3a0 [btrfs]
 [<ffffffffa08dbf5e>] ? send_extent_data+0x10e/0x300 [btrfs]
 [<ffffffffa08dc34b>] ? process_extent+0x1fb/0x1310 [btrfs]
 [<ffffffffa08d8300>] ? iterate_dir_item.isra.28+0x1b0/0x250 [btrfs]
 [<ffffffffa08dd500>] ? send_set_xattr+0xa0/0xa0 [btrfs]
 [<ffffffffa08de565>] ? changed_cb+0xd5/0xc40 [btrfs]
 [<ffffffffa08df1c2>] ? full_send_tree+0xf2/0x1a0 [btrfs]
 [<ffffffffa08e022b>] ? btrfs_ioctl_send+0xfbb/0x1040 [btrfs]
 [<ffffffffa08a9864>] ? btrfs_ioctl+0x1084/0x32a0 [btrfs]

Signed-off-by: Robbie Ko <robbieko@synology.com>
---
 fs/btrfs/inode.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Omar Sandoval May 10, 2018, 4:53 a.m. UTC | #1
On Wed, May 09, 2018 at 06:35:25PM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> When send process requires memory allocation, shrinker may be triggered
> due to insufficient memory.
> Then evict_inode gets called when inode is freed, and this function
> may need to start transaction.
> However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, it
> passed the if condition,
> and the following use yields illegal memory access.
> 
>   if (current->journal_info) {
>       WARN_ON(type & TRANS_EXTWRITERS);
>       h = current->journal_info;
>       refcount_inc(&h->use_count);
>       WARN_ON(refcount_read(&h->use_count) > 2);
>       h->orig_rsv = h->block_rsv;
>       h->block_rsv = NULL;
>       goto got_it;
>   }

start_transaction() has

    ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);

Are you saying that's wrong? Are there other cases where the shrinker
can end up starting a transaction?

> Direct IO has a similar problem, journal_info will store btrfs_dio_data,
> which will lead to illegal memory access.

I have patches getting rid of this for direct I/O here:
https://github.com/osandov/linux/tree/btrfs-journal-info-abuse

I originally did that for btrfs swapfile support, but if it actually
fixes an existing bug it should be easy to get merged.

> We fixed the problem by save the journal_info and restore afterwards.
> 
> CallTrace looks like this:
>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000021
>  IP: [<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
>  PGD 8fea4b067 PUD a33bea067 PMD 0
>  Oops: 0000 [#1] SMP
>  CPU: 3 PID: 12681 Comm: btrfs Tainted: P C O 3.10.102 #15266
>  RIP: 0010:[<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
>  Call Trace:
>  [<ffffffffa087a838>] ? btrfs_evict_inode+0x3d8/0x580 [btrfs]
>  [<ffffffff81115932>] ? evict+0xa2/0x1a0
>  [<ffffffff81112888>] ? shrink_dentry_list+0x308/0x3d0
>  [<ffffffff811137f3>] ? prune_dcache_sb+0x133/0x160
>  [<ffffffff810fa51f>] ? prune_super+0xcf/0x1a0
>  [<ffffffff810bf6bf>] ? shrink_slab+0x11f/0x1d0
>  [<ffffffff810c19f2>] ? do_try_to_free_pages+0x452/0x560
>  [<ffffffff810bf054>] ? throttle_direct_reclaim+0x74/0x240
>  [<ffffffff810c1bae>] ? try_to_free_pages+0xae/0xc0
>  [<ffffffff810ba16b>] ? __alloc_pages_nodemask+0x53b/0x9f0
>  [<ffffffff810bc89c>] ? __do_page_cache_readahead+0xec/0x270
>  [<ffffffff810bcb2b>] ? ondemand_readahead+0xbb/0x220
>  [<ffffffffa08d7c43>] ? fill_read_buf+0x2b3/0x3a0 [btrfs]
>  [<ffffffffa08dbf5e>] ? send_extent_data+0x10e/0x300 [btrfs]
>  [<ffffffffa08dc34b>] ? process_extent+0x1fb/0x1310 [btrfs]
>  [<ffffffffa08d8300>] ? iterate_dir_item.isra.28+0x1b0/0x250 [btrfs]
>  [<ffffffffa08dd500>] ? send_set_xattr+0xa0/0xa0 [btrfs]
>  [<ffffffffa08de565>] ? changed_cb+0xd5/0xc40 [btrfs]
>  [<ffffffffa08df1c2>] ? full_send_tree+0xf2/0x1a0 [btrfs]
>  [<ffffffffa08e022b>] ? btrfs_ioctl_send+0xfbb/0x1040 [btrfs]
>  [<ffffffffa08a9864>] ? btrfs_ioctl+0x1084/0x32a0 [btrfs]
> 
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> ---
>  fs/btrfs/inode.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index f534701..77aec8d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5295,6 +5295,7 @@ void btrfs_evict_inode(struct inode *inode)
>  	int steal_from_global = 0;
>  	u64 min_size;
>  	int ret;
> +	void *journal_info = NULL;

This initialization isn't necessary.

>  	trace_btrfs_inode_evict(inode);
>  
> @@ -5303,6 +5304,16 @@ void btrfs_evict_inode(struct inode *inode)
>  		return;
>  	}
>  
> +	/*
> +	 * Send or Direct IO may store information in journal_info.
> +	 * However, this function may use start_transaction and
> +	 * start_transaction will use journal_info.
> +	 * To avoid accessing invalid memory, we can save the journal_info
> +	 * and restore it later.
> +	 */
> +	journal_info = current->journal_info;
> +	current->journal_info = NULL;
> +
>  	min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
>  
>  	evict_inode_truncate_pages(inode);
> @@ -5462,6 +5473,7 @@ void btrfs_evict_inode(struct inode *inode)
>  no_delete:
>  	btrfs_remove_delayed_node(BTRFS_I(inode));
>  	clear_inode(inode);
> +	current->journal_info = journal_info;
>  }
>  
>  /*
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
robbieko May 10, 2018, 5:55 a.m. UTC | #2
Omar Sandoval 於 2018-05-10 12:53 寫到:
> On Wed, May 09, 2018 at 06:35:25PM +0800, robbieko wrote:
>> From: Robbie Ko <robbieko@synology.com>
>> 
>> When send process requires memory allocation, shrinker may be 
>> triggered
>> due to insufficient memory.
>> Then evict_inode gets called when inode is freed, and this function
>> may need to start transaction.
>> However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, 
>> it
>> passed the if condition,
>> and the following use yields illegal memory access.
>> 
>>   if (current->journal_info) {
>>       WARN_ON(type & TRANS_EXTWRITERS);
>>       h = current->journal_info;
>>       refcount_inc(&h->use_count);
>>       WARN_ON(refcount_read(&h->use_count) > 2);
>>       h->orig_rsv = h->block_rsv;
>>       h->block_rsv = NULL;
>>       goto got_it;
>>   }
> 
> start_transaction() has
> 
>     ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);

I didn't turn on the configuration CONFIG_BTRFS_ASSERT, so this ASSERT 
has no effect

4506 #ifdef CONFIG_BTRFS_ASSERT
4507
4508 static inline void assfail(char *expr, char *file, int line)
4509 {
4510     pr_err("BTRFS: assertion failed: %s, file: %s, line: %d",
4511            expr, file, line);
4512     BUG();
4513 }
4514
4515 #define ASSERT(expr)    \
4516     (likely(expr) ? (void)0 : assfail(#expr, __FILE__, __LINE__))
4517 #else
4518 #define ASSERT(expr)    ((void)0)
4519 #endif


> 
> Are you saying that's wrong? Are there other cases where the shrinker
> can end up starting a transaction?

I'm not sure if there are other cases where shrinker might trigger 
start_transaction.
But I confirm that btrfs_evict_inode triggers strat_transaction



> 
>> Direct IO has a similar problem, journal_info will store 
>> btrfs_dio_data,
>> which will lead to illegal memory access.
> 
> I have patches getting rid of this for direct I/O here:
> https://github.com/osandov/linux/tree/btrfs-journal-info-abuse
> 
> I originally did that for btrfs swapfile support, but if it actually
> fixes an existing bug it should be easy to get merged.
> 
>> We fixed the problem by save the journal_info and restore afterwards.
>> 
>> CallTrace looks like this:
>>  BUG: unable to handle kernel NULL pointer dereference at 
>> 0000000000000021
>>  IP: [<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
>>  PGD 8fea4b067 PUD a33bea067 PMD 0
>>  Oops: 0000 [#1] SMP
>>  CPU: 3 PID: 12681 Comm: btrfs Tainted: P C O 3.10.102 #15266
>>  RIP: 0010:[<ffffffffa086f2d4>] start_transaction+0x64/0x450 [btrfs]
>>  Call Trace:
>>  [<ffffffffa087a838>] ? btrfs_evict_inode+0x3d8/0x580 [btrfs]
>>  [<ffffffff81115932>] ? evict+0xa2/0x1a0
>>  [<ffffffff81112888>] ? shrink_dentry_list+0x308/0x3d0
>>  [<ffffffff811137f3>] ? prune_dcache_sb+0x133/0x160
>>  [<ffffffff810fa51f>] ? prune_super+0xcf/0x1a0
>>  [<ffffffff810bf6bf>] ? shrink_slab+0x11f/0x1d0
>>  [<ffffffff810c19f2>] ? do_try_to_free_pages+0x452/0x560
>>  [<ffffffff810bf054>] ? throttle_direct_reclaim+0x74/0x240
>>  [<ffffffff810c1bae>] ? try_to_free_pages+0xae/0xc0
>>  [<ffffffff810ba16b>] ? __alloc_pages_nodemask+0x53b/0x9f0
>>  [<ffffffff810bc89c>] ? __do_page_cache_readahead+0xec/0x270
>>  [<ffffffff810bcb2b>] ? ondemand_readahead+0xbb/0x220
>>  [<ffffffffa08d7c43>] ? fill_read_buf+0x2b3/0x3a0 [btrfs]
>>  [<ffffffffa08dbf5e>] ? send_extent_data+0x10e/0x300 [btrfs]
>>  [<ffffffffa08dc34b>] ? process_extent+0x1fb/0x1310 [btrfs]
>>  [<ffffffffa08d8300>] ? iterate_dir_item.isra.28+0x1b0/0x250 [btrfs]
>>  [<ffffffffa08dd500>] ? send_set_xattr+0xa0/0xa0 [btrfs]
>>  [<ffffffffa08de565>] ? changed_cb+0xd5/0xc40 [btrfs]
>>  [<ffffffffa08df1c2>] ? full_send_tree+0xf2/0x1a0 [btrfs]
>>  [<ffffffffa08e022b>] ? btrfs_ioctl_send+0xfbb/0x1040 [btrfs]
>>  [<ffffffffa08a9864>] ? btrfs_ioctl+0x1084/0x32a0 [btrfs]
>> 
>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>> ---
>>  fs/btrfs/inode.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>> 
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index f534701..77aec8d 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5295,6 +5295,7 @@ void btrfs_evict_inode(struct inode *inode)
>>  	int steal_from_global = 0;
>>  	u64 min_size;
>>  	int ret;
>> +	void *journal_info = NULL;
> 
> This initialization isn't necessary.
> 
>>  	trace_btrfs_inode_evict(inode);
>> 
>> @@ -5303,6 +5304,16 @@ void btrfs_evict_inode(struct inode *inode)
>>  		return;
>>  	}
>> 
>> +	/*
>> +	 * Send or Direct IO may store information in journal_info.
>> +	 * However, this function may use start_transaction and
>> +	 * start_transaction will use journal_info.
>> +	 * To avoid accessing invalid memory, we can save the journal_info
>> +	 * and restore it later.
>> +	 */
>> +	journal_info = current->journal_info;
>> +	current->journal_info = NULL;
>> +
>>  	min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
>> 
>>  	evict_inode_truncate_pages(inode);
>> @@ -5462,6 +5473,7 @@ void btrfs_evict_inode(struct inode *inode)
>>  no_delete:
>>  	btrfs_remove_delayed_node(BTRFS_I(inode));
>>  	clear_inode(inode);
>> +	current->journal_info = journal_info;
>>  }
>> 
>>  /*
>> --
>> 1.9.1
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 10, 2018, 12:56 p.m. UTC | #3
On Thu, May 10, 2018 at 01:55:45PM +0800, robbieko wrote:
> Omar Sandoval 於 2018-05-10 12:53 寫到:
> > On Wed, May 09, 2018 at 06:35:25PM +0800, robbieko wrote:
> >> From: Robbie Ko <robbieko@synology.com>
> >> 
> >> When send process requires memory allocation, shrinker may be 
> >> triggered
> >> due to insufficient memory.
> >> Then evict_inode gets called when inode is freed, and this function
> >> may need to start transaction.
> >> However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, 
> >> it
> >> passed the if condition,
> >> and the following use yields illegal memory access.
> >> 
> >>   if (current->journal_info) {
> >>       WARN_ON(type & TRANS_EXTWRITERS);
> >>       h = current->journal_info;
> >>       refcount_inc(&h->use_count);
> >>       WARN_ON(refcount_read(&h->use_count) > 2);
> >>       h->orig_rsv = h->block_rsv;
> >>       h->block_rsv = NULL;
> >>       goto got_it;
> >>   }
> > 
> > start_transaction() has
> > 
> >     ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
> 
> I didn't turn on the configuration CONFIG_BTRFS_ASSERT, so this ASSERT 
> has no effect

Asserts are usually turned on, at least during development so we can
catch all sorts of problems early. In this case it caught problem that
needs to be fixed regardless of the assert being on or off.

> 
> 4506 #ifdef CONFIG_BTRFS_ASSERT
> 4507
> 4508 static inline void assfail(char *expr, char *file, int line)
> 4509 {
> 4510     pr_err("BTRFS: assertion failed: %s, file: %s, line: %d",
> 4511            expr, file, line);
> 4512     BUG();
> 4513 }
> 4514
> 4515 #define ASSERT(expr)    \
> 4516     (likely(expr) ? (void)0 : assfail(#expr, __FILE__, __LINE__))
> 4517 #else
> 4518 #define ASSERT(expr)    ((void)0)
> 4519 #endif
> 
> 
> > 
> > Are you saying that's wrong? Are there other cases where the shrinker
> > can end up starting a transaction?
> 
> I'm not sure if there are other cases where shrinker might trigger 
> start_transaction.
> But I confirm that btrfs_evict_inode triggers strat_transaction

The shrinker and allocator can call to filesystems to let them write
data and free some memory, but this is typically guarded by the GFP_NOFS
so it does not recurse to the same filesystem (and deadlock).

Playing games with the journal_info can fix that but I'd rather not
spread the pattern, Omar's removal of that from dio looks cleaner and
the right way to go.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 10, 2018, 1:01 p.m. UTC | #4
On Wed, May 09, 2018 at 09:53:03PM -0700, Omar Sandoval wrote:
> On Wed, May 09, 2018 at 06:35:25PM +0800, robbieko wrote:
> > From: Robbie Ko <robbieko@synology.com>
> > 
> > When send process requires memory allocation, shrinker may be triggered
> > due to insufficient memory.
> > Then evict_inode gets called when inode is freed, and this function
> > may need to start transaction.
> > However, the journal_info is already points to BTRFS_SEND_TRANS_STUB, it
> > passed the if condition,
> > and the following use yields illegal memory access.
> > 
> >   if (current->journal_info) {
> >       WARN_ON(type & TRANS_EXTWRITERS);
> >       h = current->journal_info;
> >       refcount_inc(&h->use_count);
> >       WARN_ON(refcount_read(&h->use_count) > 2);
> >       h->orig_rsv = h->block_rsv;
> >       h->block_rsv = NULL;
> >       goto got_it;
> >   }
> 
> start_transaction() has
> 
>     ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
> 
> Are you saying that's wrong? Are there other cases where the shrinker
> can end up starting a transaction?
> 
> > Direct IO has a similar problem, journal_info will store btrfs_dio_data,
> > which will lead to illegal memory access.
> 
> I have patches getting rid of this for direct I/O here:
> https://github.com/osandov/linux/tree/btrfs-journal-info-abuse
> 
> I originally did that for btrfs swapfile support, but if it actually
> fixes an existing bug it should be easy to get merged.

Yes please.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f534701..77aec8d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5295,6 +5295,7 @@  void btrfs_evict_inode(struct inode *inode)
 	int steal_from_global = 0;
 	u64 min_size;
 	int ret;
+	void *journal_info = NULL;
 
 	trace_btrfs_inode_evict(inode);
 
@@ -5303,6 +5304,16 @@  void btrfs_evict_inode(struct inode *inode)
 		return;
 	}
 
+	/*
+	 * Send or Direct IO may store information in journal_info.
+	 * However, this function may use start_transaction and
+	 * start_transaction will use journal_info.
+	 * To avoid accessing invalid memory, we can save the journal_info
+	 * and restore it later.
+	 */
+	journal_info = current->journal_info;
+	current->journal_info = NULL;
+
 	min_size = btrfs_calc_trunc_metadata_size(fs_info, 1);
 
 	evict_inode_truncate_pages(inode);
@@ -5462,6 +5473,7 @@  void btrfs_evict_inode(struct inode *inode)
 no_delete:
 	btrfs_remove_delayed_node(BTRFS_I(inode));
 	clear_inode(inode);
+	current->journal_info = journal_info;
 }
 
 /*