diff mbox series

[v2] hugetlbfs: fix memory leak for resv_map

Message ID 20190306061007.61645-1-yuyufen@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v2] hugetlbfs: fix memory leak for resv_map | expand

Commit Message

Yufen Yu March 6, 2019, 6:10 a.m. UTC
When .mknod create a block device file in hugetlbfs, it will
allocate an inode, and kmalloc a 'struct resv_map' in resv_map_alloc().
For now, inode->i_mapping->private_data is used to point the resv_map.
However, when open the device, bd_acquire() will set i_mapping as
bd_inode->imapping, result in resv_map memory leak.

We fix it by waiting until a call to hugetlb_reserve_pages() to allocate
the inode specific resv_map. We could then remove the resv_map allocation
at inode creation time.

Programs to reproduce:
	mount -t hugetlbfs nodev hugetlbfs
	mknod hugetlbfs/dev b 0 0
	exec 30<> hugetlbfs/dev
	umount hugetlbfs/

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
---
 fs/hugetlbfs/inode.c | 10 ++--------
 mm/hugetlb.c         |  6 ++++++
 2 files changed, 8 insertions(+), 8 deletions(-)

Comments

Mike Kravetz March 6, 2019, 11:52 p.m. UTC | #1
On 3/5/19 10:10 PM, Yufen Yu wrote:
> When .mknod create a block device file in hugetlbfs, it will
> allocate an inode, and kmalloc a 'struct resv_map' in resv_map_alloc().
> For now, inode->i_mapping->private_data is used to point the resv_map.
> However, when open the device, bd_acquire() will set i_mapping as
> bd_inode->imapping, result in resv_map memory leak.
> 
> We fix it by waiting until a call to hugetlb_reserve_pages() to allocate
> the inode specific resv_map. We could then remove the resv_map allocation
> at inode creation time.
> 
> Programs to reproduce:
> 	mount -t hugetlbfs nodev hugetlbfs
> 	mknod hugetlbfs/dev b 0 0
> 	exec 30<> hugetlbfs/dev
> 	umount hugetlbfs/
> 
> Signed-off-by: Yufen Yu <yuyufen@huawei.com>

Thank you.  That is the approach I had in mind.

Unfortunately, this patch causes several regressions in the libhugetlbfs
test suite.  I have not debugged to determine exact cause.  

I was unsure about one thing with this approach.  We set
inode->i_mapping->private_data while holding the inode lock, so there
should be no problem there.  However, we access inode_resv_map() in the
page fault path without the inode lock.  The page fault path should get
NULL or a resv_map.  I just wonder if there may be some races where the
fault path may still be seeing NULL.

I can do more debug, but it will take a couple days as I am busy with
other things right now.
Mike Kravetz March 7, 2019, 11:50 p.m. UTC | #2
Adding others on Cc to see if they have comments or opinions.

On 3/6/19 3:52 PM, Mike Kravetz wrote:
> On 3/5/19 10:10 PM, Yufen Yu wrote:
>> When .mknod create a block device file in hugetlbfs, it will
>> allocate an inode, and kmalloc a 'struct resv_map' in resv_map_alloc().
>> For now, inode->i_mapping->private_data is used to point the resv_map.
>> However, when open the device, bd_acquire() will set i_mapping as
>> bd_inode->imapping, result in resv_map memory leak.
>>
>> We fix it by waiting until a call to hugetlb_reserve_pages() to allocate
>> the inode specific resv_map. We could then remove the resv_map allocation
>> at inode creation time.
>>
>> Programs to reproduce:
>> 	mount -t hugetlbfs nodev hugetlbfs
>> 	mknod hugetlbfs/dev b 0 0
>> 	exec 30<> hugetlbfs/dev
>> 	umount hugetlbfs/
>>
>> Signed-off-by: Yufen Yu <yuyufen@huawei.com>
> 
> Thank you.  That is the approach I had in mind.
> 
> Unfortunately, this patch causes several regressions in the libhugetlbfs
> test suite.  I have not debugged to determine exact cause.  
> 
> I was unsure about one thing with this approach.  We set
> inode->i_mapping->private_data while holding the inode lock, so there
> should be no problem there.  However, we access inode_resv_map() in the
> page fault path without the inode lock.  The page fault path should get
> NULL or a resv_map.  I just wonder if there may be some races where the
> fault path may still be seeing NULL.
> 
> I can do more debug, but it will take a couple days as I am busy with
> other things right now.

My apologies.  Calling resv_map_alloc() only from hugetlb_reserve_pages()
is not going to work.  The reason why is that the reserv_map is used to track
page allocations even if there are not reservations.  So, if reservations
are not created other huge page accounting is impacted.

Sorry for suggesting that approach.

As mentioned, I do not like your original approach as it adds an extra word
to every hugetlbfs inode.  How about something like the following which
only adds the resv_map to inodes which can have associated page allocations?
I have only done limited regression testing with this.

From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Thu, 7 Mar 2019 15:37:31 -0800
Subject: [PATCH] hugetlbfs: only allocate reserve map for inodes that can
 allocate pages

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a7fa037b876b..a3a3d256fb0e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -741,11 +741,17 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
-	struct resv_map *resv_map;
+	struct resv_map *resv_map = NULL;
 
-	resv_map = resv_map_alloc();
-	if (!resv_map)
-		return NULL;
+	/*
+	 * Reserve maps are only needed for inodes that can have associated
+	 * page allocations.
+	 */
+	if (S_ISREG(mode) || S_ISLNK(mode)) {
+		resv_map = resv_map_alloc();
+		if (!resv_map)
+			return NULL;
+	}
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -780,8 +786,10 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	} else
-		kref_put(&resv_map->refs, resv_map_release);
+	} else {
+		if (resv_map)
+			kref_put(&resv_map->refs, resv_map_release);
+	}
 
 	return inode;
 }
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a7fa037b876b..eb87cfb80eb9 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -741,11 +741,6 @@  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 					umode_t mode, dev_t dev)
 {
 	struct inode *inode;
-	struct resv_map *resv_map;
-
-	resv_map = resv_map_alloc();
-	if (!resv_map)
-		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
@@ -757,7 +752,7 @@  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 				&hugetlbfs_i_mmap_rwsem_key);
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
-		inode->i_mapping->private_data = resv_map;
+		inode->i_mapping->private_data = NULL;
 		info->seals = F_SEAL_SEAL;
 		switch (mode & S_IFMT) {
 		default:
@@ -780,8 +775,7 @@  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 			break;
 		}
 		lockdep_annotate_inode_mutex_key(inode);
-	} else
-		kref_put(&resv_map->refs, resv_map_release);
+	}
 
 	return inode;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8dfdffc34a99..088fd1239d49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4462,6 +4462,12 @@  int hugetlb_reserve_pages(struct inode *inode,
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
 		resv_map = inode_resv_map(inode);
+		if (!resv_map) {
+			resv_map = resv_map_alloc();
+			if (!resv_map)
+				return -ENOMEM;
+			inode->i_mapping->private_data = resv_map;
+		}
 
 		chg = region_chg(resv_map, from, to);