diff mbox

Re: [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

Message ID 20171110010907.qfkqhrbtdkt5y3hy@smitten (mailing list archive)
State New, archived
Headers show

Commit Message

Tycho Andersen Nov. 10, 2017, 1:09 a.m. UTC
Hi Dave,

On Wed, Sep 20, 2017 at 05:27:02PM -0700, Dave Hansen wrote:
> On 09/20/2017 05:09 PM, Tycho Andersen wrote:
> >> I think the only thing that will really help here is if you batch the
> >> allocations.  For instance, you could make sure that the per-cpu-pageset
> >> lists always contain either all kernel or all user data.  Then remap the
> >> entire list at once and do a single flush after the entire list is consumed.
> > Just so I understand, the idea would be that we only flush when the
> > type of allocation alternates, so:
> > 
> > kmalloc(..., GFP_KERNEL);
> > kmalloc(..., GFP_KERNEL);
> > /* remap+flush here */
> > kmalloc(..., GFP_HIGHUSER);
> > /* remap+flush here */
> > kmalloc(..., GFP_KERNEL);
> 
> Not really.  We keep a free list per migrate type, and a per_cpu_pages
> (pcp) list per migratetype:
> 
> > struct per_cpu_pages {
> >         int count;              /* number of pages in the list */
> >         int high;               /* high watermark, emptying needed */
> >         int batch;              /* chunk size for buddy add/remove */
> > 
> >         /* Lists of pages, one per migrate type stored on the pcp-lists */
> >         struct list_head lists[MIGRATE_PCPTYPES];
> > };
> 
> The migratetype is derived from the GFP flags in
> gfpflags_to_migratetype().  In general, GFP_HIGHUSER and GFP_KERNEL come
> from different migratetypes, so they come from different free lists.
> 
> In your case above, the GFP_HIGHUSER allocation come through the
> MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
> MIGRATE_UNMOVABLE one.  Since we add a bunch of pages to those lists at
> once, you could do all the mapping/unmapping/flushing on a bunch of
> pages at once

So I've been playing around with an implementation of this, which is basically:

[    1.911293] CR2: ffff880139b75012 CR3: 000000011a965000 CR4: 00000000000006e0
[    1.912050] Call Trace:
[    1.912356]  ? register_shrinker+0x80/0x90
[    1.912826]  mount_bdev+0x177/0x1b0
[    1.913234]  ? ext4_calculate_overhead+0x4a0/0x4a0
[    1.913744]  ext4_mount+0x10/0x20
[    1.914115]  mount_fs+0x2d/0x140
[    1.914490]  ? __alloc_percpu+0x10/0x20
[    1.914903]  vfs_kern_mount.part.20+0x58/0x110
[    1.915394]  do_mount+0x1cc/0xca0
[    1.915758]  ? _copy_from_user+0x6b/0xa0
[    1.916198]  ? memdup_user+0x3d/0x70
[    1.916576]  SyS_mount+0x93/0xe0
[    1.916915]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[    1.917401] RIP: 0033:0x7f8169264b5a
[    1.917777] RSP: 002b:00007fff6ce82bc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[    1.918576] RAX: ffffffffffffffda RBX: 0000000000fb2030 RCX: 00007f8169264b5a
[    1.919313] RDX: 00007fff6ce84e61 RSI: 00007fff6ce84e70 RDI: 00007fff6ce84e66
[    1.920042] RBP: 0000000000008000 R08: 0000000000000000 R09: 0000000000000000
[    1.920771] R10: 0000000000008001 R11: 0000000000000202 R12: 0000000000000000
[    1.921512] R13: 0000000000000000 R14: 00007fff6ce82c70 R15: 0000000000445c20
[    1.922254] Code: 83 ee 01 48 c7 c7 70 e6 97 81 e8 1d 0c e2 ff 48 89 de 48 c7 c7 a4 48 96 81 e8 0e 0c e2 ff 8b 85 5c ff ff ff 41 39 44 24 40 75 0e <f6> 43 12 04 41 0f 44 c7 89 85 5c ff ff ff 48 c7 c7 ad 48 96 81 
[    1.924489] RIP: ext4_fill_super+0x2f3b/0x33c0 RSP: ffffc9001a33bce0
[    1.925334] CR2: ffff880139b75012
[    1.942161] ---[ end trace fe884f328a0a7338 ]---

This is the code:

if ((grp == sbi->s_groups_count) &&
    !(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED)))

in fs/ext4/super.c:ext4_check_descriptors() that's ultimately failing. It looks
like this allocation comes from sb_bread_unmovable(), which although it says
unmovable, seems to allocate the memory with:

MOVABLE
IO
NOFAIL
HARDWALL
DIRECT_RECLAIM
KSWAPD_RECLAIM

which I guess is from the additional flags in grow_dev_page() somewhere down
the stack. Anyway... it seems this is a kernel allocation that's using
MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.

Do you have any ideas?

> Or, you could hook your code into the places where the migratetype of
> memory is changed (set_pageblock_migratetype(), plus where we fall
> back).  Those changes are much more rare than page allocation.

I guess this has the same issue, that sometimes the kernel allocates MOVABLE
stuff that it wants to use.

Thanks,

Tycho

Comments

Dave Hansen Nov. 13, 2017, 10:20 p.m. UTC | #1
On 11/09/2017 05:09 PM, Tycho Andersen wrote:
> which I guess is from the additional flags in grow_dev_page() somewhere down
> the stack. Anyway... it seems this is a kernel allocation that's using
> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
> 
> Do you have any ideas?

It still has to do a kmap() or kmap_atomic() to be able to access it.  I
thought you hooked into that.  Why isn't that path getting hit for these?
Dave Hansen Nov. 13, 2017, 10:46 p.m. UTC | #2
On 11/13/2017 02:20 PM, Dave Hansen wrote:
> On 11/09/2017 05:09 PM, Tycho Andersen wrote:
>> which I guess is from the additional flags in grow_dev_page() somewhere down
>> the stack. Anyway... it seems this is a kernel allocation that's using
>> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
>> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
>>
>> Do you have any ideas?
> 
> It still has to do a kmap() or kmap_atomic() to be able to access it.  I
> thought you hooked into that.  Why isn't that path getting hit for these?

Oh, this looks to be accessing data mapped by a buffer_head.  It
(rudely) accesses data via:

void set_bh_page(struct buffer_head *bh,
...
	bh->b_data = page_address(page) + offset;
Tycho Andersen Nov. 15, 2017, 12:33 a.m. UTC | #3
Hi Dave,

On Mon, Nov 13, 2017 at 02:46:25PM -0800, Dave Hansen wrote:
> On 11/13/2017 02:20 PM, Dave Hansen wrote:
> > On 11/09/2017 05:09 PM, Tycho Andersen wrote:
> >> which I guess is from the additional flags in grow_dev_page() somewhere down
> >> the stack. Anyway... it seems this is a kernel allocation that's using
> >> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
> >> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
> >>
> >> Do you have any ideas?
> > 
> > It still has to do a kmap() or kmap_atomic() to be able to access it.  I
> > thought you hooked into that.  Why isn't that path getting hit for these?
> 
> Oh, this looks to be accessing data mapped by a buffer_head.  It
> (rudely) accesses data via:
> 
> void set_bh_page(struct buffer_head *bh,
> ...
> 	bh->b_data = page_address(page) + offset;

Ah, yes. I guess there will be many bugs like this :). Anyway, I'll
try to cook up a patch.

Thanks!

Tycho
Dave Hansen Nov. 15, 2017, 12:37 a.m. UTC | #4
On 11/14/2017 04:33 PM, Tycho Andersen wrote:
>>
>> void set_bh_page(struct buffer_head *bh,
>> ...
>> 	bh->b_data = page_address(page) + offset;
> Ah, yes. I guess there will be many bugs like this :). Anyway, I'll
> try to cook up a patch.

It won't catch all the bugs, but it might be handy to have a debugging
mode that records the location of the last user of page_address() and
friends.  That way, when we trip over an unmapped page, we have an
easier time finding the offender.
Tycho Andersen Nov. 15, 2017, 12:42 a.m. UTC | #5
On Tue, Nov 14, 2017 at 04:37:34PM -0800, Dave Hansen wrote:
> On 11/14/2017 04:33 PM, Tycho Andersen wrote:
> >>
> >> void set_bh_page(struct buffer_head *bh,
> >> ...
> >> 	bh->b_data = page_address(page) + offset;
> > Ah, yes. I guess there will be many bugs like this :). Anyway, I'll
> > try to cook up a patch.
> 
> It won't catch all the bugs, but it might be handy to have a debugging
> mode that records the location of the last user of page_address() and
> friends.  That way, when we trip over an unmapped page, we have an
> easier time finding the offender.

Ok, what I've been doing now is saving the stack frame of the code
that allocated the page, which also seems useful. I'll see about
adding a DEBUG_XPFO config option for the next series with both of
these things, though.

Cheers,

Tycho
Matthew Wilcox Nov. 15, 2017, 3:44 a.m. UTC | #6
On Mon, Nov 13, 2017 at 02:46:25PM -0800, Dave Hansen wrote:
> On 11/13/2017 02:20 PM, Dave Hansen wrote:
> > On 11/09/2017 05:09 PM, Tycho Andersen wrote:
> >> which I guess is from the additional flags in grow_dev_page() somewhere down
> >> the stack. Anyway... it seems this is a kernel allocation that's using
> >> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
> >> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
> >>
> >> Do you have any ideas?
> > 
> > It still has to do a kmap() or kmap_atomic() to be able to access it.  I
> > thought you hooked into that.  Why isn't that path getting hit for these?
> 
> Oh, this looks to be accessing data mapped by a buffer_head.  It
> (rudely) accesses data via:
> 
> void set_bh_page(struct buffer_head *bh,
> ...
> 	bh->b_data = page_address(page) + offset;

We don't need to kmap in order to access MOVABLE allocations.  kmap is
only needed for HIGHMEM allocations.  So there's nothing wrong with ext4
or set_bh_page().
Dave Hansen Nov. 15, 2017, 7 a.m. UTC | #7
On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> On Mon, Nov 13, 2017 at 02:46:25PM -0800, Dave Hansen wrote:
>> On 11/13/2017 02:20 PM, Dave Hansen wrote:
>>> On 11/09/2017 05:09 PM, Tycho Andersen wrote:
>>>> which I guess is from the additional flags in grow_dev_page() somewhere down
>>>> the stack. Anyway... it seems this is a kernel allocation that's using
>>>> MIGRATE_MOVABLE, so perhaps we need some more fine tuned heuristic than just
>>>> all MOVABLE allocations are un-mapped via xpfo, and all the others are mapped.
>>>>
>>>> Do you have any ideas?
>>>
>>> It still has to do a kmap() or kmap_atomic() to be able to access it.  I
>>> thought you hooked into that.  Why isn't that path getting hit for these?
>>
>> Oh, this looks to be accessing data mapped by a buffer_head.  It
>> (rudely) accesses data via:
>>
>> void set_bh_page(struct buffer_head *bh,
>> ...
>> 	bh->b_data = page_address(page) + offset;
> 
> We don't need to kmap in order to access MOVABLE allocations.  kmap is
> only needed for HIGHMEM allocations.  So there's nothing wrong with ext4
> or set_bh_page().

Yeah, it's definitely not _buggy_.

Although, I do wonder what we should do about these for XPFO.  Should we
just stick a kmap() in there and comment it?  What we really need is a
mechanism to say "use this as a kernel page" and "stop using this as a
kernel page".  kmap() does that... kinda.  It's not a perfect fit, but
it's pretty close.
Matthew Wilcox Nov. 15, 2017, 2:58 p.m. UTC | #8
On Tue, Nov 14, 2017 at 11:00:20PM -0800, Dave Hansen wrote:
> On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> > We don't need to kmap in order to access MOVABLE allocations.  kmap is
> > only needed for HIGHMEM allocations.  So there's nothing wrong with ext4
> > or set_bh_page().
> 
> Yeah, it's definitely not _buggy_.
> 
> Although, I do wonder what we should do about these for XPFO.  Should we
> just stick a kmap() in there and comment it?  What we really need is a
> mechanism to say "use this as a kernel page" and "stop using this as a
> kernel page".  kmap() does that... kinda.  It's not a perfect fit, but
> it's pretty close.

It'd be kind of funny if getting XPFO working better means improving
how well Linux runs on 32-bit machines with HIGHMEM.  I think there's
always going to be interest in those -- ARM developed 36 bit physmem
before biting the bullet and going to arm64.  Maybe OpenRISC will do
that next ;-)
Tycho Andersen Nov. 15, 2017, 4:20 p.m. UTC | #9
On Wed, Nov 15, 2017 at 06:58:35AM -0800, Matthew Wilcox wrote:
> On Tue, Nov 14, 2017 at 11:00:20PM -0800, Dave Hansen wrote:
> > On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> > > We don't need to kmap in order to access MOVABLE allocations.  kmap is
> > > only needed for HIGHMEM allocations.  So there's nothing wrong with ext4
> > > or set_bh_page().
> > 
> > Yeah, it's definitely not _buggy_.
> > 
> > Although, I do wonder what we should do about these for XPFO.  Should we
> > just stick a kmap() in there and comment it?  What we really need is a
> > mechanism to say "use this as a kernel page" and "stop using this as a
> > kernel page".  kmap() does that... kinda.  It's not a perfect fit, but
> > it's pretty close.
> 
> It'd be kind of funny if getting XPFO working better means improving
> how well Linux runs on 32-bit machines with HIGHMEM.  I think there's
> always going to be interest in those -- ARM developed 36 bit physmem
> before biting the bullet and going to arm64.  Maybe OpenRISC will do
> that next ;-)

Oh, sorry, I didn't realize that this wasn't a bug. In any case, this
seems like sort of an uphill battle -- lots of places are going to do
stuff like this since it's legal, adding code to work around it just
for XPFO seems like a lot of burden on the kernel. (Of course, I'm
open to convincing :)

How common are these MOVABLE allocations that the kernel does? What if
we did some hybrid approach, where we re-map the lists based on
MOVABLE/UNMOVABLE, but then check the actual GFP flags on allocation
to see if they match what we set when populating the free list, and
re-map accordingly if they don't.

Or is there some other way?

Cheers,

Tycho
Matthew Wilcox Nov. 15, 2017, 9:34 p.m. UTC | #10
On Wed, Nov 15, 2017 at 08:20:57AM -0800, Tycho Andersen wrote:
> On Wed, Nov 15, 2017 at 06:58:35AM -0800, Matthew Wilcox wrote:
> > On Tue, Nov 14, 2017 at 11:00:20PM -0800, Dave Hansen wrote:
> > > On 11/14/2017 07:44 PM, Matthew Wilcox wrote:
> > > > We don't need to kmap in order to access MOVABLE allocations.  kmap is
> > > > only needed for HIGHMEM allocations.  So there's nothing wrong with ext4
> > > > or set_bh_page().
> > > 
> > > Yeah, it's definitely not _buggy_.
> > > 
> > > Although, I do wonder what we should do about these for XPFO.  Should we
> > > just stick a kmap() in there and comment it?  What we really need is a
> > > mechanism to say "use this as a kernel page" and "stop using this as a
> > > kernel page".  kmap() does that... kinda.  It's not a perfect fit, but
> > > it's pretty close.
> > 
> > It'd be kind of funny if getting XPFO working better means improving
> > how well Linux runs on 32-bit machines with HIGHMEM.  I think there's
> > always going to be interest in those -- ARM developed 36 bit physmem
> > before biting the bullet and going to arm64.  Maybe OpenRISC will do
> > that next ;-)
> 
> Oh, sorry, I didn't realize that this wasn't a bug. In any case, this
> seems like sort of an uphill battle -- lots of places are going to do
> stuff like this since it's legal, adding code to work around it just
> for XPFO seems like a lot of burden on the kernel. (Of course, I'm
> open to convincing :)
> 
> How common are these MOVABLE allocations that the kernel does? What if
> we did some hybrid approach, where we re-map the lists based on
> MOVABLE/UNMOVABLE, but then check the actual GFP flags on allocation
> to see if they match what we set when populating the free list, and
> re-map accordingly if they don't.

The assumption is that HIGHMEM allocations aren't mapped (on 32-bit
systems) and so we always use kmap/kmap_atomic to access them.
The emphasis has been on moving the largest consumers of memory over to
HIGHMEM; we were trying to manage 64GB of memory in 1GB of kernel address
space, so the page cache was the first and obvious biggest consumer to
get booted out of the permanent mapping.  I know page tables were also
pushed into HIGHMEM.  So large chunks of the kernel use kmap() because they
don't know whether they'll be operating on highmem or lowmem, and it'll
do the right thing for either kind of memory.

I didn't think MOVABLE allocations were particularly common.  But I haven't
been paying close attention to MM development.
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d9c1b486e1f..47b46ff1148a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2348,6 +2348,7 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
 					      -(1 << order));
+		xpfo_pcp_refill(page, migratetype, order);
 	}
 
 	/*
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 080235a2f129..b381d83c6e78 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -260,3 +265,85 @@  void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 			kunmap_atomic(mapping[i]);
 }
 EXPORT_SYMBOL(xpfo_temp_unmap);
+
+void xpfo_pcp_refill(struct page *page, enum migratetype migratetype, int order)
+{
+	int i;
+	bool flush_tlb = false;
+
+	if (!static_branch_unlikely(&xpfo_initialized))
+		return;
+
+	for (i = 0; i < 1 << order; i++) {
+		struct xpfo *xpfo;
+
+		xpfo = lookup_xpfo(page + i);
+		if (!xpfo)
+			continue;
+
+		if (unlikely(!xpfo->initialized)) {
+			spin_lock_init(&xpfo->maplock);
+			atomic_set(&xpfo->mapcount, 0);
+			xpfo->initialized = true;
+		}
+
+		xpfo->trace.max_entries = 20;
+		xpfo->trace.skip = 1;
+		xpfo->trace.entries = xpfo->entries;
+		xpfo->trace.nr_entries = 0;
+		xpfo->trace2.max_entries = 20;
+		xpfo->trace2.skip = 1;
+		xpfo->trace2.entries = xpfo->entries2;
+		xpfo->trace2.nr_entries = 0;
+
+		xpfo->migratetype = migratetype;
+
+		save_stack_trace(&xpfo->trace);
+
+		if (migratetype == MIGRATE_MOVABLE) {
+			/* GPF_HIGHUSER */
+			set_kpte(page_address(page + i), page + i, __pgprot(0));
+			if (!test_and_set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+				flush_tlb = true;
+			set_bit(XPFO_PAGE_USER, &xpfo->flags);
+		} else {
+			/*
+			 * GFP_KERNEL and everything else; for now we just
+			 * leave it mapped
+			 */
+			set_kpte(page_address(page + i), page + i, PAGE_KERNEL);
+			if (test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+				flush_tlb = true;
+			clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+		}
+	}
+
+	if (flush_tlb)
+		xpfo_flush_kernel_tlb(page, order);
+}
+

But I'm getting some faults:

[    1.897311] BUG: unable to handle kernel paging request at ffff880139b75012
[    1.898244] IP: ext4_fill_super+0x2f3b/0x33c0
[    1.898827] PGD 1ea6067 
[    1.898828] P4D 1ea6067 
[    1.899170] PUD 1ea9067 
[    1.899508] PMD 119478063 
[    1.899850] PTE 139b75000
[    1.900211] 
[    1.900760] Oops: 0000 [#1] SMP
[    1.901160] Modules linked in:
[    1.901565] CPU: 3 PID: 990 Comm: exe Not tainted 4.13.0+ #85
[    1.902348] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[    1.903420] task: ffff88011ae7cb00 task.stack: ffffc9001a338000
[    1.904108] RIP: 0010:ext4_fill_super+0x2f3b/0x33c0
[    1.904649] RSP: 0018:ffffc9001a33bce0 EFLAGS: 00010246
[    1.905240] RAX: 00000000000000f0 RBX: ffff880139b75000 RCX: ffffffff81c456b8
[    1.906047] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000246
[    1.906874] RBP: ffffc9001a33bda8 R08: 0000000000000000 R09: 0000000000000183
[    1.908053] R10: ffff88011a9e0800 R11: ffffffff818493e0 R12: ffff88011a9e0800
[    1.908920] R13: ffff88011a9e6800 R14: 000000000077fefa R15: 0000000000000000
[    1.909775] FS:  00007f8169747700(0000) GS:ffff880139d80000(0000) knlGS:0000000000000000
[    1.910667] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033