diff mbox series

[v5,6/6] mm,thp: avoid writes to file with THP in pagecache

Message ID 20190620205348.3980213-7-songliubraving@fb.com (mailing list archive)
State New, archived
Headers show
Series Enable THP for text section of non-shmem files | expand

Commit Message

Song Liu June 20, 2019, 8:53 p.m. UTC
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_last(), if the file is open for write and nr_thps
is non-zero, we drop page cache for the whole file.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 fs/inode.c         |  3 +++
 fs/namei.c         | 22 +++++++++++++++++++++-
 include/linux/fs.h | 31 +++++++++++++++++++++++++++++++
 mm/filemap.c       |  1 +
 mm/khugepaged.c    |  4 +++-
 5 files changed, 59 insertions(+), 2 deletions(-)

Comments

Rik van Riel June 21, 2019, 12:52 a.m. UTC | #1
On Thu, 2019-06-20 at 13:53 -0700, Song Liu wrote:
> In previous patch, an application could put part of its text section
> in
> THP via madvise(). These THPs will be protected from writes when the
> application is still running (TXTBSY). However, after the application
> exits, the file is available for writes.
> 
> This patch avoids writes to file THP by dropping page cache for the
> file
> when the file is open for write. A new counter nr_thps is added to
> struct
> address_space. In do_last(), if the file is open for write and
> nr_thps
> is non-zero, we drop page cache for the whole file.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>

Acked-by: Rik van Riel <riel@surriel.com>


The comment for release_file_thp() is a little implementation
specific, which is normally a bad thing (for code we expect
to stick around for years), but probably the right thing for
code that is just a step in the direction we want to go.

Thank you for reworking these patches so quickly.
Kirill A . Shutemov June 21, 2019, 1:07 p.m. UTC | #2
On Thu, Jun 20, 2019 at 01:53:48PM -0700, Song Liu wrote:
> In previous patch, an application could put part of its text section in
> THP via madvise(). These THPs will be protected from writes when the
> application is still running (TXTBSY). However, after the application
> exits, the file is available for writes.
> 
> This patch avoids writes to file THP by dropping page cache for the file
> when the file is open for write. A new counter nr_thps is added to struct
> address_space. In do_last(), if the file is open for write and nr_thps
> is non-zero, we drop page cache for the whole file.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  fs/inode.c         |  3 +++
>  fs/namei.c         | 22 +++++++++++++++++++++-
>  include/linux/fs.h | 31 +++++++++++++++++++++++++++++++
>  mm/filemap.c       |  1 +
>  mm/khugepaged.c    |  4 +++-
>  5 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index df6542ec3b88..518113a4e219 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  	mapping->flags = 0;
>  	mapping->wb_err = 0;
>  	atomic_set(&mapping->i_mmap_writable, 0);
> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> +	atomic_set(&mapping->nr_thps, 0);
> +#endif
>  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>  	mapping->private_data = NULL;
>  	mapping->writeback_index = 0;
> diff --git a/fs/namei.c b/fs/namei.c
> index 20831c2fbb34..de64f24b58e9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3249,6 +3249,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>  	return error;
>  }
>  
> +/*
> + * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
> + * it still has THP in page cache, drop the whole file from pagecache
> + * before processing writes. This helps us avoid handling write back of
> + * THP for now.
> + */
> +static inline void release_file_thp(struct file *file)
> +{
> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> +	struct inode *inode = file_inode(file);
> +
> +	if (inode_is_open_for_write(inode) && filemap_nr_thps(inode->i_mapping))
> +		truncate_pagecache(inode, 0);
> +#endif
> +}
> +
>  /*
>   * Handle the last step of open()
>   */
> @@ -3418,7 +3434,11 @@ static int do_last(struct nameidata *nd,
>  		goto out;
>  opened:
>  	error = ima_file_check(file, op->acc_mode);
> -	if (!error && will_truncate)
> +	if (error)
> +		goto out;
> +
> +	release_file_thp(file);

What protects against re-fill the file with THP in parallel?
Song Liu June 21, 2019, 1:10 p.m. UTC | #3
> On Jun 21, 2019, at 6:07 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Thu, Jun 20, 2019 at 01:53:48PM -0700, Song Liu wrote:
>> In previous patch, an application could put part of its text section in
>> THP via madvise(). These THPs will be protected from writes when the
>> application is still running (TXTBSY). However, after the application
>> exits, the file is available for writes.
>> 
>> This patch avoids writes to file THP by dropping page cache for the file
>> when the file is open for write. A new counter nr_thps is added to struct
>> address_space. In do_last(), if the file is open for write and nr_thps
>> is non-zero, we drop page cache for the whole file.
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> fs/inode.c         |  3 +++
>> fs/namei.c         | 22 +++++++++++++++++++++-
>> include/linux/fs.h | 31 +++++++++++++++++++++++++++++++
>> mm/filemap.c       |  1 +
>> mm/khugepaged.c    |  4 +++-
>> 5 files changed, 59 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/inode.c b/fs/inode.c
>> index df6542ec3b88..518113a4e219 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>> 	mapping->flags = 0;
>> 	mapping->wb_err = 0;
>> 	atomic_set(&mapping->i_mmap_writable, 0);
>> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
>> +	atomic_set(&mapping->nr_thps, 0);
>> +#endif
>> 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>> 	mapping->private_data = NULL;
>> 	mapping->writeback_index = 0;
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 20831c2fbb34..de64f24b58e9 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3249,6 +3249,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>> 	return error;
>> }
>> 
>> +/*
>> + * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
>> + * it still has THP in page cache, drop the whole file from pagecache
>> + * before processing writes. This helps us avoid handling write back of
>> + * THP for now.
>> + */
>> +static inline void release_file_thp(struct file *file)
>> +{
>> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
>> +	struct inode *inode = file_inode(file);
>> +
>> +	if (inode_is_open_for_write(inode) && filemap_nr_thps(inode->i_mapping))
>> +		truncate_pagecache(inode, 0);
>> +#endif
>> +}
>> +
>> /*
>>  * Handle the last step of open()
>>  */
>> @@ -3418,7 +3434,11 @@ static int do_last(struct nameidata *nd,
>> 		goto out;
>> opened:
>> 	error = ima_file_check(file, op->acc_mode);
>> -	if (!error && will_truncate)
>> +	if (error)
>> +		goto out;
>> +
>> +	release_file_thp(file);
> 
> What protects against re-fill the file with THP in parallel?

khugepaged would only process vma with VM_DENYWRITE. So once the
file is open for write (i_write_count > 0), khugepage will not 
collapse the pages. 

Thanks,
Song
Kirill A . Shutemov June 21, 2019, 1:39 p.m. UTC | #4
On Fri, Jun 21, 2019 at 01:10:54PM +0000, Song Liu wrote:
> 
> 
> > On Jun 21, 2019, at 6:07 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Thu, Jun 20, 2019 at 01:53:48PM -0700, Song Liu wrote:
> >> In previous patch, an application could put part of its text section in
> >> THP via madvise(). These THPs will be protected from writes when the
> >> application is still running (TXTBSY). However, after the application
> >> exits, the file is available for writes.
> >> 
> >> This patch avoids writes to file THP by dropping page cache for the file
> >> when the file is open for write. A new counter nr_thps is added to struct
> >> address_space. In do_last(), if the file is open for write and nr_thps
> >> is non-zero, we drop page cache for the whole file.
> >> 
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >> fs/inode.c         |  3 +++
> >> fs/namei.c         | 22 +++++++++++++++++++++-
> >> include/linux/fs.h | 31 +++++++++++++++++++++++++++++++
> >> mm/filemap.c       |  1 +
> >> mm/khugepaged.c    |  4 +++-
> >> 5 files changed, 59 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index df6542ec3b88..518113a4e219 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >> 	mapping->flags = 0;
> >> 	mapping->wb_err = 0;
> >> 	atomic_set(&mapping->i_mmap_writable, 0);
> >> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> >> +	atomic_set(&mapping->nr_thps, 0);
> >> +#endif
> >> 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >> 	mapping->private_data = NULL;
> >> 	mapping->writeback_index = 0;
> >> diff --git a/fs/namei.c b/fs/namei.c
> >> index 20831c2fbb34..de64f24b58e9 100644
> >> --- a/fs/namei.c
> >> +++ b/fs/namei.c
> >> @@ -3249,6 +3249,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
> >> 	return error;
> >> }
> >> 
> >> +/*
> >> + * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
> >> + * it still has THP in page cache, drop the whole file from pagecache
> >> + * before processing writes. This helps us avoid handling write back of
> >> + * THP for now.
> >> + */
> >> +static inline void release_file_thp(struct file *file)
> >> +{
> >> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> >> +	struct inode *inode = file_inode(file);
> >> +
> >> +	if (inode_is_open_for_write(inode) && filemap_nr_thps(inode->i_mapping))
> >> +		truncate_pagecache(inode, 0);
> >> +#endif
> >> +}
> >> +
> >> /*
> >>  * Handle the last step of open()
> >>  */
> >> @@ -3418,7 +3434,11 @@ static int do_last(struct nameidata *nd,
> >> 		goto out;
> >> opened:
> >> 	error = ima_file_check(file, op->acc_mode);
> >> -	if (!error && will_truncate)
> >> +	if (error)
> >> +		goto out;
> >> +
> >> +	release_file_thp(file);
> > 
> > What protects against re-fill the file with THP in parallel?
> 
> khugepaged would only process vma with VM_DENYWRITE. So once the
> file is open for write (i_write_count > 0), khugepage will not 
> collapse the pages. 

I have not look at the patch very closely. Do you only create THP by
khugepaged? Not in fault path?
Song Liu June 21, 2019, 1:43 p.m. UTC | #5
> On Jun 21, 2019, at 6:39 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Fri, Jun 21, 2019 at 01:10:54PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jun 21, 2019, at 6:07 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>> 
>>> On Thu, Jun 20, 2019 at 01:53:48PM -0700, Song Liu wrote:
>>>> In previous patch, an application could put part of its text section in
>>>> THP via madvise(). These THPs will be protected from writes when the
>>>> application is still running (TXTBSY). However, after the application
>>>> exits, the file is available for writes.
>>>> 
>>>> This patch avoids writes to file THP by dropping page cache for the file
>>>> when the file is open for write. A new counter nr_thps is added to struct
>>>> address_space. In do_last(), if the file is open for write and nr_thps
>>>> is non-zero, we drop page cache for the whole file.
>>>> 
>>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>>> ---
>>>> fs/inode.c         |  3 +++
>>>> fs/namei.c         | 22 +++++++++++++++++++++-
>>>> include/linux/fs.h | 31 +++++++++++++++++++++++++++++++
>>>> mm/filemap.c       |  1 +
>>>> mm/khugepaged.c    |  4 +++-
>>>> 5 files changed, 59 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/fs/inode.c b/fs/inode.c
>>>> index df6542ec3b88..518113a4e219 100644
>>>> --- a/fs/inode.c
>>>> +++ b/fs/inode.c
>>>> @@ -181,6 +181,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>>>> 	mapping->flags = 0;
>>>> 	mapping->wb_err = 0;
>>>> 	atomic_set(&mapping->i_mmap_writable, 0);
>>>> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
>>>> +	atomic_set(&mapping->nr_thps, 0);
>>>> +#endif
>>>> 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>>>> 	mapping->private_data = NULL;
>>>> 	mapping->writeback_index = 0;
>>>> diff --git a/fs/namei.c b/fs/namei.c
>>>> index 20831c2fbb34..de64f24b58e9 100644
>>>> --- a/fs/namei.c
>>>> +++ b/fs/namei.c
>>>> @@ -3249,6 +3249,22 @@ static int lookup_open(struct nameidata *nd, struct path *path,
>>>> 	return error;
>>>> }
>>>> 
>>>> +/*
>>>> + * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
>>>> + * it still has THP in page cache, drop the whole file from pagecache
>>>> + * before processing writes. This helps us avoid handling write back of
>>>> + * THP for now.
>>>> + */
>>>> +static inline void release_file_thp(struct file *file)
>>>> +{
>>>> +#ifdef CONFIG_READ_ONLY_THP_FOR_FS
>>>> +	struct inode *inode = file_inode(file);
>>>> +
>>>> +	if (inode_is_open_for_write(inode) && filemap_nr_thps(inode->i_mapping))
>>>> +		truncate_pagecache(inode, 0);
>>>> +#endif
>>>> +}
>>>> +
>>>> /*
>>>> * Handle the last step of open()
>>>> */
>>>> @@ -3418,7 +3434,11 @@ static int do_last(struct nameidata *nd,
>>>> 		goto out;
>>>> opened:
>>>> 	error = ima_file_check(file, op->acc_mode);
>>>> -	if (!error && will_truncate)
>>>> +	if (error)
>>>> +		goto out;
>>>> +
>>>> +	release_file_thp(file);
>>> 
>>> What protects against re-fill the file with THP in parallel?
>> 
>> khugepaged would only process vma with VM_DENYWRITE. So once the
>> file is open for write (i_write_count > 0), khugepage will not 
>> collapse the pages. 
> 
> I have not look at the patch very closely. Do you only create THP by
> khugepaged? Not in fault path?

Right. My set only creates THP in khugepaged. William Kucharski is
working on the fault path. 

Thanks,
Song
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index df6542ec3b88..518113a4e219 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -181,6 +181,9 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->flags = 0;
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	atomic_set(&mapping->nr_thps, 0);
+#endif
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->private_data = NULL;
 	mapping->writeback_index = 0;
diff --git a/fs/namei.c b/fs/namei.c
index 20831c2fbb34..de64f24b58e9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3249,6 +3249,22 @@  static int lookup_open(struct nameidata *nd, struct path *path,
 	return error;
 }
 
+/*
+ * The file is open for write, so it is not mmapped with VM_DENYWRITE. If
+ * it still has THP in page cache, drop the whole file from pagecache
+ * before processing writes. This helps us avoid handling write back of
+ * THP for now.
+ */
+static inline void release_file_thp(struct file *file)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	struct inode *inode = file_inode(file);
+
+	if (inode_is_open_for_write(inode) && filemap_nr_thps(inode->i_mapping))
+		truncate_pagecache(inode, 0);
+#endif
+}
+
 /*
  * Handle the last step of open()
  */
@@ -3418,7 +3434,11 @@  static int do_last(struct nameidata *nd,
 		goto out;
 opened:
 	error = ima_file_check(file, op->acc_mode);
-	if (!error && will_truncate)
+	if (error)
+		goto out;
+
+	release_file_thp(file);
+	if (will_truncate)
 		error = handle_truncate(file);
 out:
 	if (unlikely(error > 0)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..3edf4ee42eee 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -444,6 +444,10 @@  struct address_space {
 	struct xarray		i_pages;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	/* number of thp, only for non-shmem files */
+	atomic_t		nr_thps;
+#endif
 	struct rb_root_cached	i_mmap;
 	struct rw_semaphore	i_mmap_rwsem;
 	unsigned long		nrpages;
@@ -2790,6 +2794,33 @@  static inline errseq_t filemap_sample_wb_err(struct address_space *mapping)
 	return errseq_sample(&mapping->wb_err);
 }
 
+static inline int filemap_nr_thps(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	return atomic_read(&mapping->nr_thps);
+#else
+	return 0;
+#endif
+}
+
+static inline void filemap_nr_thps_inc(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	atomic_inc(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
+static inline void filemap_nr_thps_dec(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	atomic_dec(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
diff --git a/mm/filemap.c b/mm/filemap.c
index e79ceccdc6df..a8e86c136381 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -205,6 +205,7 @@  static void unaccount_page_cache_page(struct address_space *mapping,
 			__dec_node_page_state(page, NR_SHMEM_THPS);
 	} else if (PageTransHuge(page)) {
 		__dec_node_page_state(page, NR_FILE_THPS);
+		filemap_nr_thps_dec(mapping);
 	}
 
 	/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fbcff5a1d65a..17ebe9da56ce 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1500,8 +1500,10 @@  static void collapse_file(struct vm_area_struct *vma,
 
 	if (is_shmem)
 		__inc_node_page_state(new_page, NR_SHMEM_THPS);
-	else
+	else {
 		__inc_node_page_state(new_page, NR_FILE_THPS);
+		filemap_nr_thps_inc(mapping);
+	}
 
 	if (nr_none) {
 		struct zone *zone = page_zone(new_page);