diff mbox series

[2/6] block: add dio_w_*() wrappers for pin, unpin user pages

Message ID 20220827083607.2345453-3-jhubbard@nvidia.com (mailing list archive)
State New, archived
Headers show
Series convert most filesystems to pin_user_pages_fast() | expand

Commit Message

John Hubbard Aug. 27, 2022, 8:36 a.m. UTC
Background: The Direct IO part of the block infrastructure is being
changed to use pin_user_page*() and unpin_user_page*() calls, in place
of a mix of get_user_pages_fast(), get_page(), and put_page(). These
have to be changed over all at the same time, for block, bio, and all
filesystems. However, most filesystems can be changed via iomap and core
filesystem routines, so let's get that in place, and then continue on
with converting the remaining filesystems (9P, CIFS) and anything else
that feeds pages into bio that ultimately get released via
bio_release_pages().

Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
dio_w_*() wrapper functions. The dio_w_ prefix was chosen for
uniqueness, so as to ease a subsequent kernel-wide rename via
search-and-replace. Together, these allow the developer to choose
between these sets of routines, for Direct IO code paths:

a) pin_user_pages_fast()
    pin_user_page()
    unpin_user_page()

b) get_user_pages_fast()
    get_page()
    put_page()

CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and may
be deleted once the conversion is complete. In the meantime, developers
can enable this in order to try out each filesystem.

Please remember that these /proc/vmstat items (below) should normally
contain the same values as each other, except during the middle of
pin/unpin operations. As such, they can be helpful when monitoring test
runs:

    nr_foll_pin_acquired
    nr_foll_pin_released

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/Kconfig        | 24 ++++++++++++++++++++++++
 include/linux/bvec.h | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

Comments

Andrew Morton Aug. 27, 2022, 10:27 p.m. UTC | #1
On Sat, 27 Aug 2022 01:36:03 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> Background: The Direct IO part of the block infrastructure is being
> changed to use pin_user_page*() and unpin_user_page*() calls, in place
> of a mix of get_user_pages_fast(), get_page(), and put_page(). These
> have to be changed over all at the same time, for block, bio, and all
> filesystems. However, most filesystems can be changed via iomap and core
> filesystem routines, so let's get that in place, and then continue on
> with converting the remaining filesystems (9P, CIFS) and anything else
> that feeds pages into bio that ultimately get released via
> bio_release_pages().
> 
> Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
> dio_w_*() wrapper functions. The dio_w_ prefix was chosen for
> uniqueness, so as to ease a subsequent kernel-wide rename via
> search-and-replace. Together, these allow the developer to choose
> between these sets of routines, for Direct IO code paths:
> 
> a) pin_user_pages_fast()
>     pin_user_page()
>     unpin_user_page()
> 
> b) get_user_pages_fast()
>     get_page()
>     put_page()
> 
> CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and may
> be deleted once the conversion is complete. In the meantime, developers
> can enable this in order to try out each filesystem.
> 
> Please remember that these /proc/vmstat items (below) should normally
> contain the same values as each other, except during the middle of
> pin/unpin operations. As such, they can be helpful when monitoring test
> runs:
> 
>     nr_foll_pin_acquired
>     nr_foll_pin_released
> 
> ...
>
> +static inline void dio_w_unpin_user_pages(struct page **pages,
> +					  unsigned long npages)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; i++)
> +		put_page(pages[i]);
> +}

release_pages()?  Might be faster if many of the pages are page_count()==1.

(release_pages() was almost as simple as the above when I added it a
million years ago.  But then progress happened).
John Hubbard Aug. 27, 2022, 11:59 p.m. UTC | #2
On 8/27/22 15:27, Andrew Morton wrote:
>> +static inline void dio_w_unpin_user_pages(struct page **pages,
>> +					  unsigned long npages)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; i++)
>> +		put_page(pages[i]);
>> +}
> 
> release_pages()?  Might be faster if many of the pages are page_count()==1.

Sure. I was being perhaps too cautious about changing too many things
at once, earlier when I wrote this. 

> 
> (release_pages() was almost as simple as the above when I added it a
> million years ago.  But then progress happened).
> 

Actually, I'm tempted update the release_pages() API as well, because it
uses an int for npages, while other things (in gup.c, anyway) are moving
over to unsigned long.

Anyway, I'll change my patch locally for now, to this:

static inline void dio_w_unpin_user_pages(struct page **pages,
					  unsigned long npages)
{
	/* Careful, release_pages() uses a smaller integer type for npages: */
	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
		return;

	release_pages(pages, (int)npages);
}

...in hopes that I can somehow find a way to address Al Viro's other
comments, which have the potential to doom the whole series, heh.


thanks,
Andrew Morton Aug. 28, 2022, 12:12 a.m. UTC | #3
On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> Anyway, I'll change my patch locally for now, to this:
> 
> static inline void dio_w_unpin_user_pages(struct page **pages,
> 					  unsigned long npages)
> {
> 	/* Careful, release_pages() uses a smaller integer type for npages: */
> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
> 		return;
> 
> 	release_pages(pages, (int)npages);
> }

Well, it might be slower.  release_pages() has a ton of fluff.

As mentioned, the above might be faster if the pages tend
to have page_count()==1??
John Hubbard Aug. 28, 2022, 12:31 a.m. UTC | #4
On 8/27/22 17:12, Andrew Morton wrote:
> On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> 
>> Anyway, I'll change my patch locally for now, to this:
>>
>> static inline void dio_w_unpin_user_pages(struct page **pages,
>> 					  unsigned long npages)
>> {
>> 	/* Careful, release_pages() uses a smaller integer type for npages: */
>> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
>> 		return;
>>
>> 	release_pages(pages, (int)npages);
>> }
> 
> Well, it might be slower.  release_pages() has a ton of fluff.
> 
> As mentioned, the above might be faster if the pages tend
> to have page_count()==1??
> 

I don't think we can know the answer to that. This code is called
in all kinds of situations, and it seems to me that whatever
tradeoffs are best for release_pages(), are probably also reasonable
for this code.

Even with all the fluff in release_pages(), it at least batches
the pages, as opposed to the simple put_page() in a loop. Most of 
the callers do have more than one page to release in non-error cases,
so release_pages() does seem better.


thanks,
John Hubbard Aug. 28, 2022, 1:07 a.m. UTC | #5
On 8/27/22 17:31, John Hubbard wrote:
> On 8/27/22 17:12, Andrew Morton wrote:
>> On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
>>
>>> Anyway, I'll change my patch locally for now, to this:
>>>
>>> static inline void dio_w_unpin_user_pages(struct page **pages,
>>> 					  unsigned long npages)
>>> {
>>> 	/* Careful, release_pages() uses a smaller integer type for npages: */
>>> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
>>> 		return;
>>>
>>> 	release_pages(pages, (int)npages);
>>> }
>>
>> Well, it might be slower.  release_pages() has a ton of fluff.
>>
>> As mentioned, the above might be faster if the pages tend
>> to have page_count()==1??
>>
> 
> I don't think we can know the answer to that. This code is called

To clarify: I mean, I don't think we can know whether or not these pages 
usually have page_count()==1.


thanks,
diff mbox series

Patch

diff --git a/block/Kconfig b/block/Kconfig
index 444c5ab3b67e..d4fdd606d138 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -48,6 +48,30 @@  config BLK_DEV_BSG_COMMON
 config BLK_ICQ
 	bool
 
+config BLK_USE_PIN_USER_PAGES_FOR_DIO
+	bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
+	default n
+
+	help
+	  For Direct IO code, retain the pages via calls to
+	  pin_user_pages_fast(), instead of via get_user_pages_fast().
+	  Likewise, use pin_user_page() instead of get_page(). And then
+	  release such pages via unpin_user_page(), instead of
+	  put_page().
+
+	  This is a temporary setting, which will be deleted once the
+	  conversion is completed, reviewed, and tested. In the meantime,
+	  developers can enable this in order to try out each filesystem.
+	  For that, it's best to monitor these /proc/vmstat items:
+
+		nr_foll_pin_acquired
+		nr_foll_pin_released
+
+	  ...to ensure that they remain equal, when "at rest".
+
+	  Say yes here ONLY if are actively developing or testing the
+	  block layer or filesystems with pin_user_pages_fast().
+
 config BLK_DEV_BSGLIB
 	bool "Block layer SG support v4 helper lib"
 	select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..33fc119da9fc 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,4 +241,44 @@  static inline void *bvec_virt(struct bio_vec *bvec)
 	return page_address(bvec->bv_page) + bvec->bv_offset;
 }
 
+#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
+#define dio_w_pin_user_pages_fast(s, n, p, f)	pin_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			pin_user_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_pin_pages(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_pin_pages_alloc(i, p, m, s)
+#define dio_w_unpin_user_page(p)		unpin_user_page(p)
+#define dio_w_unpin_user_pages(p, n)		unpin_user_pages(p, n)
+#define dio_w_unpin_user_pages_dirty_lock(p, n, d) unpin_user_pages_dirty_lock(p, n, d)
+
+#else
+#define dio_w_pin_user_pages_fast(s, n, p, f)	get_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			get_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_get_pages2(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_get_pages_alloc2(i, p, m, s)
+#define dio_w_unpin_user_page(p)		put_page(p)
+
+static inline void dio_w_unpin_user_pages(struct page **pages,
+					  unsigned long npages)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++)
+		put_page(pages[i]);
+}
+
+static inline void dio_w_unpin_user_pages_dirty_lock(struct page **pages,
+						     unsigned long npages,
+						     bool make_dirty)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		if (make_dirty)
+			set_page_dirty_lock(pages[i]);
+		put_page(pages[i]);
+	}
+}
+
+#endif
+
 #endif /* __LINUX_BVEC_H */