[PATCHv4,00/25] THP-enabled tmpfs/shmem
diff mbox

Message ID 20160324091727.GA26796@node.shutemov.name
State New
Headers show

Commit Message

Kirill A. Shutemov March 24, 2016, 9:17 a.m. UTC
On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> The small files thing formed my first impression.  My second
> impression was similar, when I tried mmap(NULL, size_of_RAM,
> PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> cycled around the arena touching all the pages (which of
> course has to push a little into swap): that soon OOMed.
> 
> But there I think you probably just have some minor bug to be fixed:
> I spent a little while trying to debug it, but then decided I'd
> better get back to writing to you.  I didn't really understand what
> I was seeing, but when I hacked some stats into shrink_page_list(),
> converting !is_page_cache_freeable(page) to page_cache_references(page)
> to return the difference instead of the bool, a large proportion of
> huge tmpfs pages seemed to have count 1 too high to be freeable at
> that point (and one huge tmpfs page had a count of 3477).

I'll reply to your other points later, but first I wanted to address this
obvious bug.

I cannot really explain page_count() == 3477, but otherwise:

The root cause is that try_to_unmap() doesn't handle PMD-mapped huge
pages, so we hit 'case SWAP_AGAIN' all the time.

The patch below effectively rewrites 17/25: now we split the huge page
before trying to unmap it.

split_huge_page() has its own check similar to is_page_cache_freeable(),
so we woundn't split pages we cannot free later on.

And split_huge_page() for file pages would unmap the page, so we wouldn't
need to go to try_to_unmap() after that.

The patch look rather simple, but I haven't done full validation cycle for
it. Regressions are unlikely, but possible.

At some point we would need to teach try_to_unmap() to handle huge pages.
It would be required for filesystems with backing storage. But I don't see
need for it to get huge tmpfs/shmem work.

Comments

Hugh Dickins March 24, 2016, 7:08 p.m. UTC | #1
On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > The small files thing formed my first impression.  My second
> > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > cycled around the arena touching all the pages (which of
> > course has to push a little into swap): that soon OOMed.
> > 
> > But there I think you probably just have some minor bug to be fixed:
> > I spent a little while trying to debug it, but then decided I'd
> > better get back to writing to you.  I didn't really understand what
> > I was seeing, but when I hacked some stats into shrink_page_list(),
> > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > to return the difference instead of the bool, a large proportion of
> > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > that point (and one huge tmpfs page had a count of 3477).
> 
> I'll reply to your other points later, but first I wanted to address this
> obvious bug.

Thanks.  That works better, but is not yet right: memory isn't freed
as it should be, so when I exit then try to run a second time, the
mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
MemFree is low.  No rush to fix, I've other stuff to do.

I don't get as far as that on the laptop, since the first run is OOM
killed while swapping; but I can't vouch for the OOM-kill-correctness
of the base tree I'm using, and this laptop has a history of OOMing
rather too easily if all's not right.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A. Shutemov March 25, 2016, 3:04 p.m. UTC | #2
On Thu, Mar 24, 2016 at 12:08:55PM -0700, Hugh Dickins wrote:
> On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> > On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > > The small files thing formed my first impression.  My second
> > > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > > cycled around the arena touching all the pages (which of
> > > course has to push a little into swap): that soon OOMed.
> > > 
> > > But there I think you probably just have some minor bug to be fixed:
> > > I spent a little while trying to debug it, but then decided I'd
> > > better get back to writing to you.  I didn't really understand what
> > > I was seeing, but when I hacked some stats into shrink_page_list(),
> > > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > > to return the difference instead of the bool, a large proportion of
> > > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > > that point (and one huge tmpfs page had a count of 3477).
> > 
> > I'll reply to your other points later, but first I wanted to address this
> > obvious bug.
> 
> Thanks.  That works better, but is not yet right: memory isn't freed
> as it should be, so when I exit then try to run a second time, the
> mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
> MemFree is low.  No rush to fix, I've other stuff to do.
> 
> I don't get as far as that on the laptop, since the first run is OOM
> killed while swapping; but I can't vouch for the OOM-kill-correctness
> of the base tree I'm using, and this laptop has a history of OOMing
> rather too easily if all's not right.

Hm. I don't see the issue.

I tried to reproduce it in my VM with following script:

#!/bin/sh -efu

swapon -a

ram="$(grep MemTotal /proc/meminfo | sed 's,[^0-9\]\+,,; s, kB,k,')"

usemem -w -f /dev/zero "$ram"

swapoff -a
swapon -a

usemem -w -f /dev/zero "$ram"

cat /proc/meminfo
grep thp /proc/vmstat

-----

usemem is a tool from this archive:

http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar

It works fine even if would double size of mapping.

Do you have a reproducer?
Hugh Dickins March 26, 2016, midnight UTC | #3
On Fri, 25 Mar 2016, Kirill A. Shutemov wrote:
> On Thu, Mar 24, 2016 at 12:08:55PM -0700, Hugh Dickins wrote:
> > On Thu, 24 Mar 2016, Kirill A. Shutemov wrote:
> > > On Wed, Mar 23, 2016 at 01:09:05PM -0700, Hugh Dickins wrote:
> > > > The small files thing formed my first impression.  My second
> > > > impression was similar, when I tried mmap(NULL, size_of_RAM,
> > > > PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0) and
> > > > cycled around the arena touching all the pages (which of
> > > > course has to push a little into swap): that soon OOMed.
> > > > 
> > > > But there I think you probably just have some minor bug to be fixed:
> > > > I spent a little while trying to debug it, but then decided I'd
> > > > better get back to writing to you.  I didn't really understand what
> > > > I was seeing, but when I hacked some stats into shrink_page_list(),
> > > > converting !is_page_cache_freeable(page) to page_cache_references(page)
> > > > to return the difference instead of the bool, a large proportion of
> > > > huge tmpfs pages seemed to have count 1 too high to be freeable at
> > > > that point (and one huge tmpfs page had a count of 3477).
> > > 
> > > I'll reply to your other points later, but first I wanted to address this
> > > obvious bug.
> > 
> > Thanks.  That works better, but is not yet right: memory isn't freed
> > as it should be, so when I exit then try to run a second time, the
> > mmap() just gets ENOMEM (with /proc/sys/vm/overcommit_memory 0):
> > MemFree is low.  No rush to fix, I've other stuff to do.
> > 
> > I don't get as far as that on the laptop, since the first run is OOM
> > killed while swapping; but I can't vouch for the OOM-kill-correctness
> > of the base tree I'm using, and this laptop has a history of OOMing
> > rather too easily if all's not right.
> 
> Hm. I don't see the issue.
> 
> I tried to reproduce it in my VM with following script:
> 
> #!/bin/sh -efu
> 
> swapon -a
> 
> ram="$(grep MemTotal /proc/meminfo | sed 's,[^0-9\]\+,,; s, kB,k,')"
> 
> usemem -w -f /dev/zero "$ram"
> 
> swapoff -a
> swapon -a
> 
> usemem -w -f /dev/zero "$ram"
> 
> cat /proc/meminfo
> grep thp /proc/vmstat
> 
> -----
> 
> usemem is a tool from this archive:
> 
> http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
> 
> It works fine even if would double size of mapping.
> 
> Do you have a reproducer?

Yes, my reproducer is simpler (just cycling twice around the arena,
touching each page in order); and I too did not see it running your
script using usemem above.  It looks as if that invocation isn't doing
enough work with swap: if I add a "-r 2" to those usemem lines, then
I get "usemem: mmap failed: Cannot allocate memory" on the second.

I also added a "sleep 2" before the second call to usemem: I'm not sure
of the current state of vmstat, but historically it's slow to gather
back from each cpu to global, and I think it used to leave some cpu
counts stranded indefinitely once upon a time.  In my own testing,
I have a /proc/sys/vm/stat_refresh to touch before checking meminfo
or vmstat - and I think the vm_enough_memory() check in mmap() may
need that same care, since it refers to NR_FREE_PAGES etc.

8GB is my ramsize, if that matters.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9fa9e15594e9..86008f8f1f9b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -473,14 +473,12 @@  void drop_slab(void)
 
 static inline int is_page_cache_freeable(struct page *page)
 {
-	int radix_tree_pins = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
-
 	/*
 	 * A freeable page cache page is referenced only by the caller
 	 * that isolated the page, the page cache radix tree and
 	 * optional buffer heads at page->private.
 	 */
-	return page_count(page) - page_has_private(page) == 1 + radix_tree_pins;
+	return page_count(page) - page_has_private(page) == 2;
 }
 
 static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -550,6 +548,8 @@  static pageout_t pageout(struct page *page, struct address_space *mapping,
 	 * swap_backing_dev_info is bust: it doesn't reflect the
 	 * congestion state of the swapdevs.  Easy to fix, if needed.
 	 */
+	if (!is_page_cache_freeable(page))
+		return PAGE_KEEP;
 	if (!mapping) {
 		/*
 		 * Some data journaling orphaned pages can have
@@ -1055,8 +1055,14 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Adding to swap updated mapping */
 			mapping = page_mapping(page);
+		} else if (unlikely(PageTransHuge(page))) {
+			/* Split file THP */
+			if (split_huge_page_to_list(page, page_list))
+				goto keep_locked;
 		}
 
+		VM_BUG_ON_PAGE(PageTransHuge(page), page);
+
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
@@ -1112,15 +1118,6 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
-
-			if (!is_page_cache_freeable(page))
-				goto keep_locked;
-
-			if (unlikely(PageTransHuge(page))) {
-				if (split_huge_page_to_list(page, page_list))
-					goto keep_locked;
-			}
-
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;