Message ID | nycvar.YFH.7.76.1901051817390.16954@cbobk.fhfr.pm (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/mincore: allow for making sys_mincore() privileged | expand |
On 5.1.2019 18:27, Jiri Kosina wrote: > From: Jiri Kosina <jkosina@suse.cz> > > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > start returning -EPERM in case it's invoked by a process lacking > CAP_SYS_ADMIN. Haven't checked the details yet, but wouldn't it be safe if anonymous private mincore() kept working, and restrictions were applied only to page cache? > The default behavior stays "mincore() can be used by anybody" in order to > be conservative with respect to userspace behavior. What if we lied instead of returned -EPERM, to not break userspace so obviously? I guess false positive would be the safer lie? > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ > > Signed-off-by: Jiri Kosina <jkosina@suse.cz> > --- > Documentation/sysctl/vm.txt | 9 +++++++++ > kernel/sysctl.c | 8 ++++++++ > mm/mincore.c | 5 +++++ > 3 files changed, 22 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 187ce4f599a2..afb8635e925e 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm: > - min_free_kbytes > - min_slab_ratio > - min_unmapped_ratio > +- mincore_privileged > - mmap_min_addr > - mmap_rnd_bits > - mmap_rnd_compat_bits > @@ -485,6 +486,14 @@ files and similar are considered. > The default is 1 percent. > > ============================================================== > +mincore_privileged: > + > +mincore() could be potentially used to mount a side-channel attack against > +pagecache metadata. This sysctl provides system administrators means to > +make it available only to processess that own CAP_SYS_ADMIN capability. > + > +The default is 0, which means mincore() can be used without restrictions. > +============================================================== > > mmap_min_addr > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 1825f712e73b..f03cb07c8dd4 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; > #ifndef CONFIG_MMU > extern int sysctl_nr_trim_pages; > #endif > +extern int sysctl_mincore_privileged; > > /* Constants used for minimum and maximum */ > #ifdef CONFIG_LOCKUP_DETECTOR > @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = { > .extra2 = (void *)&mmap_rnd_compat_bits_max, > }, > #endif > + { > + .procname = "mincore_privileged", > + .data = &sysctl_mincore_privileged, > + .maxlen = sizeof(sysctl_mincore_privileged), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > { } > }; > > diff --git a/mm/mincore.c b/mm/mincore.c > index 218099b5ed31..77d4928cdfaa 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -21,6 +21,8 @@ > #include <linux/uaccess.h> > #include <asm/pgtable.h> > > +int sysctl_mincore_privileged; > + > static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, > unsigned long pages; > unsigned char *tmp; > > + if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > /* Check the start address: needs to be page-aligned.. */ > if (start & ~PAGE_MASK) > return -EINVAL; >
On Sat, 5 Jan 2019, Vlastimil Babka wrote: > > There are possibilities [1] how mincore() could be used as a converyor of > > a sidechannel information about pagecache metadata. > > > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > > start returning -EPERM in case it's invoked by a process lacking > > CAP_SYS_ADMIN. > > Haven't checked the details yet, but wouldn't it be safe if anonymous private > mincore() kept working, and restrictions were applied only to page cache? I was considering that, but then I decided not to do so, as that'd make the interface even more confusing and semantics non-obvious in the 'privileged' case. > > The default behavior stays "mincore() can be used by anybody" in order to > > be conservative with respect to userspace behavior. > > What if we lied instead of returned -EPERM, to not break userspace so > obviously? I guess false positive would be the safer lie? So your proposal basically would be if (privileged && !CAP_SYS_ADMIN) if (pagecache) return false; else return do_mincore() right ? I think userspace would hate us for that semantics, but on the other hand I can sort of understand the 'mincore() is racy anyway, so what' argument, if that's what you are suggesting. But then, I have no idea what userspace is using mincore() for. https://codesearch.debian.net/search?q=mincore might provide some insight I guess (thanks Matthew).
On 5.1.2019 20:24, Jiri Kosina wrote: > On Sat, 5 Jan 2019, Vlastimil Babka wrote: > >>> There are possibilities [1] how mincore() could be used as a converyor of >>> a sidechannel information about pagecache metadata. >>> >>> Provide vm.mincore_privileged sysctl, which makes it possible to mincore() >>> start returning -EPERM in case it's invoked by a process lacking >>> CAP_SYS_ADMIN. >> >> Haven't checked the details yet, but wouldn't it be safe if anonymous private >> mincore() kept working, and restrictions were applied only to page cache? > > I was considering that, but then I decided not to do so, as that'd make > the interface even more confusing and semantics non-obvious in the > 'privileged' case. > >>> The default behavior stays "mincore() can be used by anybody" in order to >>> be conservative with respect to userspace behavior. >> >> What if we lied instead of returned -EPERM, to not break userspace so >> obviously? I guess false positive would be the safer lie? > > So your proposal basically would be > > if (privileged && !CAP_SYS_ADMIN) > if (pagecache) > return false; I was thinking about "return true" here, assuming that userspace generally wants to ensure itself there won't be page faults when it starts doing something critical, and if it sees a "false" it will try to do some kind of prefaulting, possibly in a loop. There might be somebody trying to make sure something is out of pagecache (it wants to see "false"), but can't think of anything except benchmarks? > else > return do_mincore() > > right ? > > I think userspace would hate us for that semantics, but on the other hand > I can sort of understand the 'mincore() is racy anyway, so what' argument, > if that's what you are suggesting. > > But then, I have no idea what userspace is using mincore() for. > https://codesearch.debian.net/search?q=mincore might provide some insight > I guess (thanks Matthew). >
Hi Jiri,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=c6x
All errors (new ones prefixed by >>):
>> kernel/sysctl.o:(.fardata+0x6a0): undefined reference to `sysctl_mincore_privileged'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Sat, Jan 5, 2019 at 9:27 AM Jiri Kosina <jikos@kernel.org> wrote: > > From: Jiri Kosina <jkosina@suse.cz> > > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. Can we please just limit it to vma's that are either anonymous, or map a file that the user actually owns? Then the capability check could be for "override the file owner check" instead, which makes tons of sense. No new sysctl's for something like this, please. Linus
Hi Jiri,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: microblaze-nommu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=microblaze
All errors (new ones prefixed by >>):
>> kernel/sysctl.o:(.data+0x67c): undefined reference to `sysctl_mincore_privileged'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Sat, 5 Jan 2019, Linus Torvalds wrote: > > There are possibilities [1] how mincore() could be used as a converyor of > > a sidechannel information about pagecache metadata. > > Can we please just limit it to vma's that are either anonymous, or map > a file that the user actually owns? > > Then the capability check could be for "override the file owner check" > instead, which makes tons of sense. Makes sense. I am still not completely sure what to return in such cases though; we can either blatantly lie and always pretend that the pages are resident (to avoid calling process entering some prefaulting mode), or return -ENOMEM for mappings of files that don't belong to the user (in case it's not CAP_SYS_ADMIN one).
On Sat, Jan 5, 2019 at 11:46 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Can we please just limit it to vma's that are either anonymous, or map > a file that the user actually owns? .. or slightly simpler: a file that the user opened for writing. IOW, some (TOTALLY UNTESTED!) patch like this? Linus mm/mincore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..61e38895fb02 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || capable(CAP_SYS_ADMIN); +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err;
[ Crossed emails ] On Sat, Jan 5, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote: > > I am still not completely sure what to return in such cases though; we can > either blatantly lie and always pretend that the pages are resident That's what my untested patch did. Or maybe just claim they are all not present? And again, that patch was entirely untested, so it may be garbage and have some fundamental problem. I also don't know exactly what rule might make most sense, but "you can write to the file" certainly to me implies that you also could know what parts of it are in-core. Who actually _uses_ mincore()? That's probably the best guide to what we should do. Maybe they open the file read-only even if they are the owner, and we really should look at file ownership instead. I tried to make that "can_do_mincore()" function easy to understand and easy to just modify to some sane state. Again, my patch is meant as a "perhaps something like this?" rather than some "this is _exactly_ how it must be done". Take the patch as a quick suggestion, not some final answer. Linus
On Sat, 5 Jan 2019, Linus Torvalds wrote: > > I am still not completely sure what to return in such cases though; we can > > either blatantly lie and always pretend that the pages are resident > > That's what my untested patch did. Or maybe just claim they are all > not present? Thinking about it a little bit more, I believe Vlastimil has a good point with 'non present' potentially causing more bogus activity in userspace in response (in an effort to actually make them present, and failing indefinitely). IOW, I think it's a reasonable expectation that the common scenario is "check if it's present, and if not, try to fault it in" instead of "check if it's present, and if it is, try to evict it". > And again, that patch was entirely untested, so it may be garbage and > have some fundamental problem. I will be travelling for next ~24 hours, but I have just asked our QA guys to run it through some basic battery of testing (which will probably happen on monday anyway). > I also don't know exactly what rule might make most sense, but "you can > write to the file" certainly to me implies that you also could know what > parts of it are in-core. I think it's reasonable; I can't really imagine any sidechannel to a global state be possibly mounted on valid R/W mappings. I'd guess that probably the most interesting here are the code segments of shared libraries, allowing to trace victim's execution. > Who actually _uses_ mincore()? That's probably the best guide to what > we should do. Maybe they open the file read-only even if they are the > owner, and we really should look at file ownership instead. Yeah, well https://codesearch.debian.net/search?q=mincore is a bit too much mess to get some idea quickly I am afraid.
On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > Who actually _uses_ mincore()? That's probably the best guide to what > > we should do. Maybe they open the file read-only even if they are the > > owner, and we really should look at file ownership instead. > > Yeah, well > > https://codesearch.debian.net/search?q=mincore > > is a bit too much mess to get some idea quickly I am afraid. Yeah, heh. And the first hit is 'fincore', which probably nobody cares about anyway, but it does fd = open (name, O_RDONLY) .. mmap(window, len, PROT_NONE, MAP_PRIVATE, .. so if we want to keep that working, we'd really need to actually check file ownership rather than just looking at f_mode. But I don't know if anybody *uses* and cares about fincore, and it's particularly questionable for non-root users. And the Android go runtime code seems to oddly use mincore to figure out page size: // try using mincore to detect the physical page size. // mincore should return EINVAL when address is not a multiple of system page size. which is all kinds of odd, but whatever.. Why mincore, rather than something sane and obvious like mmap? Don't ask me... Anyway, the Debian code search just results in mostly non-present stuff. It's sad that google code search is no more. It was great for exactly these kinds of questions. The mono runtime seems to have some mono_pages_not_faulted() function, but I don't know if people use it for file mappings, and I couldn't find any interesting users of it. I didn't find anything that seems to really care, but I gave up after a few pages of really boring stuff. Linus
On Sat, Jan 5, 2019 at 6:27 PM Jiri Kosina <jikos@kernel.org> wrote: > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > start returning -EPERM in case it's invoked by a process lacking > CAP_SYS_ADMIN. > > The default behavior stays "mincore() can be used by anybody" in order to > be conservative with respect to userspace behavior. > > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() handler) is less problematic because it only returns data about the state of page tables, and doesn't query the address_space? In other words, it permits monitoring evictions, but non-intrusively detecting that something has been loaded into memory by another process is harder?
On Sat, Jan 5, 2019 at 2:55 PM Jann Horn <jannh@google.com> wrote: > > Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() > handler) is less problematic because it only returns data about the > state of page tables, and doesn't query the address_space? In other > words, it permits monitoring evictions, but non-intrusively detecting > that something has been loaded into memory by another process is > harder? Yes. I think it was brought up during the original report, but to use the pagemap for this, you'd basically need to first populate all the pages, and then wait for pageout. So pagemap *does* leak very similar data, but it's much harder to use as an attack vector. That said, I think "mincore()" is just the simple one. You *can* (and this was also discussed on the security list) do things like - fault in a single page - the kernel will opportunistically fault in pages that it already has available _around_ that page. - then use pagemap (or just _timing_ - no real kernel support needed) to see if those pages are now mapped in your address space so honestly, mincore is just the "big hammer" and easy way to get at some of this data. But it's probably worth closing exactly because it's easy. There are other ways to get at the "are these pages mapped" information, but they are a lot more combersome to use. Side note: maybe we could just remove the "__mincore_unmapped_range()" thing entirely, and basically make mincore() do what pagemap does, which is to say "are the pages mapped in this VM". That would be nicer than my patch, simply because removing code is always nice. And arguably it's a better semantic anyway. Linus
On Sat, 5 Jan 2019, Jann Horn wrote: > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > > start returning -EPERM in case it's invoked by a process lacking > > CAP_SYS_ADMIN. > > > > The default behavior stays "mincore() can be used by anybody" in order to > > be conservative with respect to userspace behavior. > > > > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ > > Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() > handler) is less problematic because it only returns data about the > state of page tables, and doesn't query the address_space? In other > words, it permits monitoring evictions, but non-intrusively detecting > that something has been loaded into memory by another process is > harder? So I was just about to immediately reply that we don't expose pagemap anymore due to rowhammer, but apparently that's not true any more (this behavioud was originally introduced by ab676b7d6fbf, but then changed via 1c90308e7a77 (and further adjusted for swap entries in ab6ecf247a, but I guess that's not all that interesting). Hmm. But unless it has been mapped with MAP_POPULATE (whcih is outside the attacker's control), there is no guarantee that the mappings would actually be there, right?
On Sat, Jan 5, 2019 at 3:05 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That would be nicer than my patch, simply because removing code is > always nice. And arguably it's a better semantic anyway. Yeah, I wonder why we did that thing where mincore() walks the page tables, but if they are empty it looks in the page cache. [... goes and looks in history ..] It goes back to forever, it looks like. I can't find a reason. Anyway, a removal patch would look something like the attached, I think. That makes mincore() actually say how many pages are in _this_ mapping, not how many pages could be paged in without doing IO. Hmm. Maybe we should try this first. Simplicity is always good. Again, obviously untested. Linus mm/mincore.c | 74 +++++------------------------------------------------------- 1 file changed, 6 insertions(+), 68 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..317eb64ea4ef 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,64 +42,12 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -/* - * Later we can get more picky about what "in core" means precisely. - * For now, simply check to see if the page is in the page cache, - * and is up to date; i.e. that no page-in operation would be required - * at this time if an application were to map and access this page. - */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) -{ - unsigned char present = 0; - struct page *page; - - /* - * When tmpfs swaps out a page from a file, any process mapping that - * file will not get a swp_entry_t in its pte, but rather it is like - * any other file mapping (ie. marked !present and faulted in with - * tmpfs's .fault). So swapped out tmpfs mappings are tested here. - */ -#ifdef CONFIG_SWAP - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - /* - * shmem/tmpfs may return swap: account for swapcache - * page too. - */ - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif - if (page) { - present = PageUptodate(page); - put_page(page); - } - - return present; -} - static int __mincore_unmapped_range(unsigned long addr, unsigned long end, struct vm_area_struct *vma, unsigned char *vec) { unsigned long nr = (end - addr) >> PAGE_SHIFT; - int i; - if (vma->vm_file) { - pgoff_t pgoff; - - pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { - for (i = 0; i < nr; i++) - vec[i] = 0; - } + memset(vec, 0, nr); return nr; } @@ -144,21 +92,11 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = 1; - } else { -#ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); -#else - WARN_ON(1); - *vec = 1; -#endif - } + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = !!non_swap_entry(entry); } vec++; }
On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It goes back to forever, it looks like. I can't find a reason. Our man-pages talk abouit the "without doing IO" part. That may be the result of our code, though, not the reason for it. The BSD man-page has other flags, but doesn't describe what "in core" really means: MINCORE_INCORE Page is in core (resident). MINCORE_REFERENCED Page has been referenced by us. MINCORE_MODIFIED Page has been modified by us. MINCORE_REFERENCED_OTHER Page has been referenced. MINCORE_MODIFIED_OTHER Page has been modified. MINCORE_SUPER Page is part of a large (``super'') page. but the fact that it has MINCORE_MODIFIED_OTHER does obviously imply that yes, historically it really did look up the pages elsewhere, not just in the page tables. Still, maybe we can get away with just making it be about our own page tables. That would be lovely. Linus
On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It goes back to forever, it looks like. I can't find a reason. mincore() was originally added in 2.3.52pre3, it looks like. Around 2000 or so. But sadly before the BK history. And that comment about "Later we can get more picky about what "in core" means precisely." that still exists above mincore_page() goes back to the original patch. Linus
On Sat, Jan 05, 2019 at 03:39:10PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It goes back to forever, it looks like. I can't find a reason. > > mincore() was originally added in 2.3.52pre3, it looks like. Around > 2000 or so. But sadly before the BK history. > > And that comment about > > "Later we can get more picky about what "in core" means precisely." > > that still exists above mincore_page() goes back to the original patch. FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!) https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html DESCRIPTION mincore() returns the primary memory residency status of pages in the address space covered by mappings in the range [addr, addr + len). The status is returned as a char-per-page in the character array referenced by *vec (which the system assumes to be large enough to encompass all the pages in the address range). The least significant bit of each character is set to 1 to indicate that the referenced page is in pri- mary memory, 0 if it is not. The settings of other bits in each char- acter is undefined and may contain other information in the future.
On Sat, Jan 5, 2019 at 4:11 PM Matthew Wilcox <willy@infradead.org> wrote: > > FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!) > > https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html > > DESCRIPTION > mincore() returns the primary memory residency status of pages in the > address space covered by mappings in the range [addr, addr + len). It's still not clear that "primary memory residency status" actually means. Does it mean "mapped", or does it mean "exists in caches and doesn't need IO". I don't even know what kind of caches SunOS 4.1.3 had. The Linux implementation depends on the page cache, and wouldn't work (at least not very well) in a system that has a traditional disk buffer cache. Anyway, I guess it's mostly moot. From a "does this cause regressions" standpoint, the only thing that matters is really whatever Linux programs that have used this since it was introduced 18+ years ago. But I think my patch to just rip out all that page lookup, and just base it on the page table state has the fundamental advantage that it gets rid of code. Maybe I should jst commit it, and see if anything breaks? We do have options in case things break, and then we'd at least know who cares (and perhaps a lot more information of _why_ they care). Linus
On Sat, Jan 5, 2019 at 4:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I think my patch to just rip out all that page lookup, and just > base it on the page table state has the fundamental advantage that it > gets rid of code. Maybe I should jst commit it, and see if anything > breaks? We do have options in case things break, and then we'd at > least know who cares (and perhaps a lot more information of _why_ they > care). Slightly updated patch in case somebody wants to try things out. Also, any comments about the pmd_trans_unstable() case? Linus mm/mincore.c | 94 +++++++++--------------------------------------------------- 1 file changed, 13 insertions(+), 81 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..f0f91461a9f4 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,72 +42,14 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -/* - * Later we can get more picky about what "in core" means precisely. - * For now, simply check to see if the page is in the page cache, - * and is up to date; i.e. that no page-in operation would be required - * at this time if an application were to map and access this page. - */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) -{ - unsigned char present = 0; - struct page *page; - - /* - * When tmpfs swaps out a page from a file, any process mapping that - * file will not get a swp_entry_t in its pte, but rather it is like - * any other file mapping (ie. marked !present and faulted in with - * tmpfs's .fault). So swapped out tmpfs mappings are tested here. - */ -#ifdef CONFIG_SWAP - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - /* - * shmem/tmpfs may return swap: account for swapcache - * page too. - */ - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif - if (page) { - present = PageUptodate(page); - put_page(page); - } - - return present; -} - -static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) -{ - unsigned long nr = (end - addr) >> PAGE_SHIFT; - int i; - - if (vma->vm_file) { - pgoff_t pgoff; - - pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { - for (i = 0; i < nr; i++) - vec[i] = 0; - } - return nr; -} - static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + unsigned char *vec = walk->private; + unsigned long nr = (end - addr) >> PAGE_SHIFT; + + memset(vec, 0, nr); + walk->private += nr; return 0; } @@ -127,8 +69,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } + /* We'll consider a THP page under construction to be there */ if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + memset(vec, 1, nr); goto out; } @@ -137,28 +80,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t pte = *ptep; if (pte_none(pte)) - __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + *vec = 0; else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = 1; - } else { -#ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); -#else - WARN_ON(1); - *vec = 1; -#endif - } + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = !!non_swap_entry(entry); } vec++; }
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > Anyway, the Debian code search just results in mostly non-present > stuff. It's sad that google code search is no more. It was great for > exactly these kinds of questions. If you select the "Group search results by Debian source package" option on the search results page it makes it a lot easier to skim through. It looks to me like Firefox is expecting mincore() not to fail on libraries that it has mapped: https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98 - Kevin > > The mono runtime seems to have some mono_pages_not_faulted() function, > but I don't know if people use it for file mappings, and I couldn't > find any interesting users of it. > > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. > > Linus > >
On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Slightly updated patch in case somebody wants to try things out. I decided to just apply that patch. It is *not* marked for stable, very intentionally, because I expect that we will need to wait and see if there are issues with it, and whether we might have to do something entirely different (more like the traditional behavior with some extra "only for owner" logic). But doing a test patch during the merge window (which is about to close) sounds like the right thing to do. Linus
Linus Torvalds wrote on Sat, Jan 05, 2019: > But I think my patch to just rip out all that page lookup, and just > base it on the page table state has the fundamental advantage that it > gets rid of code. Maybe I should jst commit it, and see if anything > breaks? We do have options in case things break, and then we'd at > least know who cares (and perhaps a lot more information of _why_ they > care). There actually are many tools like fincore which depend on mincore to try to tell whether a file is "loaded in cache" or not (I personally use vmtouch[1], but I know of at least nocache[2] uses it as well to only try to evict used pages) [1] https://hoytech.com/vmtouch/ [2] https://github.com/Feh/nocache I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or prefetch/lock whole files so my "production" use-cases don't actually rely on the mincore part of them; but when playing with these actions it's actually fairly useful to be able to visualize which part of a file ended in cache or monitor how a file's content evolve in cache... There are various non-obvious behaviours where being able to poke around is enlightening (e.g. fadvise dontneed is actually a hint, so even if nothing uses the file linux sometimes keep the data around if it thinks that would be useful and nocache has a mode to call fadvise multiple times and things like that...) Anyway, I agree the use of mincore for this is rather ugly, and frankly some "cache management API" might be better in the long run if only for performance reason (don't try these tools on a hundred TB sparse file...), but until that pipe dream comes true I think mincore as it was is useful for system admins. Linus Torvalds wrote on Sun, Jan 06, 2019: > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). FWIW I personally don't care much about "only for owner" or depending on mmap options; I don't understand much of the security implications honestly so I'm not sure how these limitations actually help. On the other hand, a simple CAP_SYS_ADMIN check making the call take either behaviour should be safe and would cover what I described above. (by the way, while we are discussing permissions, a regular user can use fadvise dontneed on files it doesn't own as well as long as it can open them for reading; I'm not sure if that would need restricting as well in the context of the security issue. Frankly even with mincore someone could likely tell the difference through timing, if they just do it a few times. Do magic, probe, flush out, repeat until satisfied.) Thanks,
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> It goes back to forever, it looks like. I can't find a reason. > > mincore() was originally added in 2.3.52pre3, it looks like. Around > 2000 or so. But sadly before the BK history. Yeah, it's here in the commit titled "Import 2.3.52pre3" (takes a second or two to load): https://github.com/mpe/linux-fullhistory/commit/a1bcda3256956318c95c8da8bee09f79190bb034#diff-fd2d793b8b4760b4887c8c7bbb3451d7R1730 But no further detail. (Instructions for using that tree https://github.com/mpe/linux-fullhistory/wiki) cheers
On 1/7/19 5:32 AM, Dominique Martinet wrote: > Linus Torvalds wrote on Sat, Jan 05, 2019: >> But I think my patch to just rip out all that page lookup, and just >> base it on the page table state has the fundamental advantage that it >> gets rid of code. Maybe I should jst commit it, and see if anything >> breaks? We do have options in case things break, and then we'd at >> least know who cares (and perhaps a lot more information of _why_ they >> care). > > There actually are many tools like fincore which depend on mincore to > try to tell whether a file is "loaded in cache" or not (I personally use > vmtouch[1], but I know of at least nocache[2] uses it as well to only > try to evict used pages) nocache could probably do fine without mincore. IIUC the point is to not evict anything that was already resident prior to running some command wrapped in nocache. Without the mincore checks, posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that others have mapped. That means without mincore() it will drop data that's in cache but not currently in use by anybody, which shouldn't cause large performance regressions? > [1] https://hoytech.com/vmtouch/ > [2] https://github.com/Feh/nocache > > > I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or > prefetch/lock whole files so my "production" use-cases don't actually > rely on the mincore part of them; Ah so you seem to confirm my above point. ... > FWIW I personally don't care much about "only for owner" or depending on > mmap options; I don't understand much of the security implications > honestly so I'm not sure how these limitations actually help. > On the other hand, a simple CAP_SYS_ADMIN check making the call take > either behaviour should be safe and would cover what I described above. So without CAP_SYS_ADMIN, mincore() would return mapping status, and with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy :( Maybe if we introduced mincore2() with flags similar to BSD mentioned earlier in the thread, and the cache residency flag would require CAP_SYS_ADMIN or something similar. > (by the way, while we are discussing permissions, a regular user can use > fadvise dontneed on files it doesn't own as well as long as it can open > them for reading; I'm not sure if that would need restricting as well in > the context of the security issue. Probably not, as I've mentioned it won't evict what's mapped by somebody else. And eviction is also possible via controlling LRU, which is what the paper [1] does anyway (and also mentions that DONTNEED doesn't work). Being able to evict somebody's page is AFAIU not sufficient for attack, the side channel is about knowing that somebody brought that page back to RAM by touching it. > Frankly even with mincore someone > could likely tell the difference through timing, if they just do it a > few times. Do magic, probe, flush out, repeat until satisfied.) That's my bigger concern here. In [1] there's described a remote attack (on webserver) using the page fault timing differences for present/not present page cache pages. Noisy but works, and I expect locally it to be much less noisy. Yet the countermeasures section only mentions restricting mincore() as if it was sufficient (and also how to make evictions harder, but that's secondary IMHO). [1] https://arxiv.org/abs/1901.01161 > > Thanks, >
Vlastimil Babka wrote on Mon, Jan 07, 2019: > On 1/7/19 5:32 AM, Dominique Martinet wrote: > > Linus Torvalds wrote on Sat, Jan 05, 2019: > >> But I think my patch to just rip out all that page lookup, and just > >> base it on the page table state has the fundamental advantage that it > >> gets rid of code. Maybe I should jst commit it, and see if anything > >> breaks? We do have options in case things break, and then we'd at > >> least know who cares (and perhaps a lot more information of _why_ they > >> care). > > > > There actually are many tools like fincore which depend on mincore to > > try to tell whether a file is "loaded in cache" or not (I personally use > > vmtouch[1], but I know of at least nocache[2] uses it as well to only > > try to evict used pages) > > nocache could probably do fine without mincore. IIUC the point is to not > evict anything that was already resident prior to running some command > wrapped in nocache. Without the mincore checks, > posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that > others have mapped. That means without mincore() it will drop data > that's in cache but not currently in use by anybody, which shouldn't > cause large performance regressions? Yes, I sent them a patch to allow this behaviour (which got merged ages ago); but the default nocache usage will try to preserve pages that were mapped before even if the user mapped it themselves so it's a little bit more robust that just trusting linux to do the right thing. I did various tests with fadvise dontneed and honestly the way it works is far from obvious when operating as a single user at least; I didn't test multi-users as that didn't really concern me. With the current mincore change, it will think everything was "in core" and not flush anything unless my option to just fadvise dontneed everything is passed though ; so even if we can make it work it is a change of behaviour that is breaking an existing application, and it has no way of telling it didn't work. Honestly though, as I said, mincore() is much more useful for debugging for me ; the application can be changed if required. I just pointed it out as it'll need changing, and it has no obvious way of testing at runtime if the syscall works (except dumb kernel version check, but that won't work with stable backports); so it's not that obvious. > > FWIW I personally don't care much about "only for owner" or depending on > > mmap options; I don't understand much of the security implications > > honestly so I'm not sure how these limitations actually help. > > On the other hand, a simple CAP_SYS_ADMIN check making the call take > > either behaviour should be safe and would cover what I described above. > > So without CAP_SYS_ADMIN, mincore() would return mapping status, and > with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy > :( Maybe if we introduced mincore2() with flags similar to BSD mentioned > earlier in the thread, and the cache residency flag would require > CAP_SYS_ADMIN or something similar. I agree, that's rather clumsy... Or rather might lead to some unexpected behaviours. I'm open to other ideas. I'm not sure how the BSD flags help though? > > (by the way, while we are discussing permissions, a regular user can use > > fadvise dontneed on files it doesn't own as well as long as it can open > > them for reading; I'm not sure if that would need restricting as well in > > the context of the security issue. > > Probably not, as I've mentioned it won't evict what's mapped by somebody > else. And eviction is also possible via controlling LRU, which is what > the paper [1] does anyway (and also mentions that DONTNEED doesn't > work). Being able to evict somebody's page is AFAIU not sufficient for > attack, the side channel is about knowing that somebody brought that > page back to RAM by touching it. Thanks for the link to the paper, I hadn't taken the time to extract it from the news article but it's much more interesting indeed. Some of what's been said here makes more sense to me now (checking about ownership and similar might indeed be good enough). For evicting page I agree most targets would be different users so I guess that makes sense as well; there seem to be other ways of efficiently evicting a page from cache anyway. (BTW, this was brought up earlier, the paper also mentions /proc/self/pagemap as not being useful.) > > Frankly even with mincore someone > > could likely tell the difference through timing, if they just do it a > > few times. Do magic, probe, flush out, repeat until satisfied.) (I meant "without mincore", obviously, as you understood properly) > That's my bigger concern here. In [1] there's described a remote attack > (on webserver) using the page fault timing differences for present/not > present page cache pages. Noisy but works, and I expect locally it to be > much less noisy. Yet the countermeasures section only mentions > restricting mincore() as if it was sufficient (and also how to make > evictions harder, but that's secondary IMHO). I'd suggest making clock rougher for non-root users but javascript tried that and it wasn't enough... :) Honestly won't be of much help there, good luck? Thanks,
On 1/7/19 12:08 PM, Dominique Martinet wrote: > > With the current mincore change, it will think everything was "in core" > and not flush anything unless my option to just fadvise dontneed > everything is passed though ; so even if we can make it work it is a > change of behaviour that is breaking an existing application, and it has > no way of telling it didn't work. IIUC the current change is commit 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages") which will not pretend everything is "in core", but only pages that the calling process has populated page table mapping for (which implies in core, but the opposite doesn't hold). "nocache" most certainly doesn't populate the mappings before calling mincore(), as that would bring pages to page cache and defeat the purpose of determining if they were already there prior the nocache execution. Instead it will think that nothing was "in core", and thus later call fadvise dontneed or everything, but as I've said earlier that shouldn't matter much. > Honestly though, as I said, mincore() is much more useful for debugging > for me ; the application can be changed if required. I just pointed it > out as it'll need changing, and it has no obvious way of testing at > runtime if the syscall works (except dumb kernel version check, but that > won't work with stable backports); so it's not that obvious. Agree. >>> FWIW I personally don't care much about "only for owner" or depending on >>> mmap options; I don't understand much of the security implications >>> honestly so I'm not sure how these limitations actually help. >>> On the other hand, a simple CAP_SYS_ADMIN check making the call take >>> either behaviour should be safe and would cover what I described above. >> >> So without CAP_SYS_ADMIN, mincore() would return mapping status, and >> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy >> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned >> earlier in the thread, and the cache residency flag would require >> CAP_SYS_ADMIN or something similar. > > I agree, that's rather clumsy... Or rather might lead to some unexpected > behaviours. I'm open to other ideas. > I'm not sure how the BSD flags help though? Definitely it would be a long-term solution, introducing new API, waiting for userspace to use it... and meanwhile we would have to keep the status quo or some kind of the clumsy/subtle approach. >>> (by the way, while we are discussing permissions, a regular user can use >>> fadvise dontneed on files it doesn't own as well as long as it can open >>> them for reading; I'm not sure if that would need restricting as well in >>> the context of the security issue. >> >> Probably not, as I've mentioned it won't evict what's mapped by somebody >> else. And eviction is also possible via controlling LRU, which is what >> the paper [1] does anyway (and also mentions that DONTNEED doesn't >> work). Being able to evict somebody's page is AFAIU not sufficient for >> attack, the side channel is about knowing that somebody brought that >> page back to RAM by touching it. > > Thanks for the link to the paper, I hadn't taken the time to extract it > from the news article but it's much more interesting indeed. It went public only this morning, the article was older :)
On 1/7/19 12:08 PM, Dominique Martinet wrote: >> That's my bigger concern here. In [1] there's described a remote attack >> (on webserver) using the page fault timing differences for present/not >> present page cache pages. Noisy but works, and I expect locally it to be >> much less noisy. Yet the countermeasures section only mentions >> restricting mincore() as if it was sufficient (and also how to make >> evictions harder, but that's secondary IMHO). > > I'd suggest making clock rougher for non-root users but javascript tried > that and it wasn't enough... :) > Honestly won't be of much help there, good luck? Restricting mincore() is sufficient to fix the hardware-agnostic part. If the attack is not hardware-agnostic anymore, an attacker could also just use a hardware cache attack, which has a higher temporal and spatial resolution, so there's no reason why the attacker would use page cache attacks instead then. Cheers, Daniel
On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Slightly updated patch in case somebody wants to try things out. > > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). So, I read the paper and before I was half way through it I figured there are a bunch of other similar page cache invalidation attacks we can perform without needing mincore. i.e. Focussing on mmap() and mincore() misses the wider issues we have with global shared caches. My first thought: fd = open(some_file, O_RDONLY); iov[0].iov_base = buf; iov[0].iov_len = 1; ret = preadv2(fd, iov, 1, off, RWF_NOWAIT); switch (ret) { case -EAGAIN: /* page not resident in page cache */ break; case 1: /* page resident in page cache */ break; default: /* beyond EOF or some other error */ break; } This is "as documented" in the man page for preadv2: RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN. Currently, this flag is meaningful only for preadv2(). IOWs, we've explicitly designed interfaces to communicate whether data is "immediately accessible" or not to the application so they can make sensible decisions about IO scheduling. i.e. IO won't block the main application processing loop and so can be scheduled in the background by the app and the data processed when that IO returns. That just so happens to be exactly the same information about the page cache that mincore is making available to userspace. If we "remove" this information from the interfaces like it has been done for mincore(), it doesn't mean userspace can't get it in other ways. e.g. it now just has to time the read(2) syscall duration and infer whether the data came from the page cache or disk from the timing information. IMO, there's nothing new or novel about this page cache information leak - it was just a matter of time before some researcher put 2 and 2 together and realised that sharing the page cache across a security boundary is little different from sharing deduplicated pages across those same security boundaries. i.e. As long as we shared caches across security boundaries and userspace can control both cache invalidation and instantiation, we cannot prevent userspace from constructing these invalidate+read information exfiltration scenarios. And then there is overlayfs. Overlay is really just a way to efficiently share the page cache of the same underlying read-only directory tree across all containers on a host. i.e. we have been specifically designing our container hosting systems to share the underlying read-only page cache across all security boundaries on the host. If overlay is vulnerable to these shared page cache attacks (seems extremely likely) then we've got much bigger problems than mincore to think about.... > But doing a test patch during the merge window (which is about to > close) sounds like the right thing to do. IMO it seems like the wrong thing to do. It's just a hacky band-aid over a specific extraction method and does nothing to reduce the actual scope of the information leak. Focussing on the band-aid means you've missed all the other avenues that the same information is exposed and all the infrastructure we've build on the core concept of sharing kernel side pages across security boundaries. And that's even without considering whether the change breaks userspace. Which it does. e.g. vmtouch is fairly widely used to manage page cache instantiation for rapid bring-up and migration of guest VMs and containers. They save the hot page cache information from a running container and then using that to instantiate the page cache in new instances running the same workload so they run at full speed right from the start. This use case calls mincore() to pull the page cache information from the running container. If anyone else proposed merging a syscall implementation change that was extremely likely to break userspace you'd be shouting at them that "we don't break userspace".... Cheers, Dave.
On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > > Yeah, heh. > > And the first hit is 'fincore', which probably nobody cares about > anyway, but it does > > fd = open (name, O_RDONLY) > .. > mmap(window, len, PROT_NONE, MAP_PRIVATE, .. > > so if we want to keep that working, we'd really need to actually check > file ownership rather than just looking at f_mode. > > But I don't know if anybody *uses* and cares about fincore, and it's > particularly questionable for non-root users. > ... > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. I've gone through everything in the Debian code search, and this is the stuff that seems like it would be affected at all by the current patch: util-linux Contains 'fincore' as already noted above. e2fsprogs e4defrag tries to drop pages that it caused to be loaded into the page cache, but it's not clear that this ever worked as designed anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..) but then after the sync_file_range it drops the pages that *were* in the page cache at the time of mincore()). pgfincore postgresql extension used to try to dump/restore page cache status of database backing files across reboots. It uses a fresh mapping with mincore() to try to determine the current page cache status of a file. nocache LD_PRELOAD library that tries to drop any pages that the victim program has caused to be loaded into the page cache, uses mincore on a fresh mapping to see what was resident beforehand. Also includes 'cachestats' command that's basically another 'fincore'. xfsprogs xfs_io has a 'mincore' sub-command that is roughly equivalent to 'fincore'. vmtouch vmtouch is "Portable file system cache diagnostics and control", among other things it implements 'fincore' type functionality, and one of its touted use-cases is "Preserving virtual memory profile when failing over servers". qemu qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to implement the "x-check-cache-dropped" option. ( https://patchwork.kernel.org/patch/10395865/ ) (Everything else I could see was either looking at anonymous VMAs, its own existing mapping that it's been using for actual IO, or was just using mincore() to see if an address was part of any mapping at all). - Kevin
On 05/01/2019 20:38, Vlastimil Babka wrote: [...] > I was thinking about "return true" here, assuming that userspace generally wants > to ensure itself there won't be page faults when it starts doing something > critical, and if it sees a "false" it will try to do some kind of prefaulting, > possibly in a loop. There might be somebody trying to make sure something is out Isn't that racy by design as the pages may get flushed out after the check? Shouldn't the application use e.g. mlock()/.... to guarantee no page faults in the first place? MfG, Bernd
On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > Shouldn't the application use e.g. mlock()/.... to guarantee no page > faults in the first place? Calling mincore() on pages you've just mlock()ed is sort of pointless though.
On 08/01/2019 12:37, Jiri Kosina wrote: > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > >> Shouldn't the application use e.g. mlock()/.... to guarantee no page >> faults in the first place? > > Calling mincore() on pages you've just mlock()ed is sort of pointless > though. Obviously;-) Sorry for being unclear above: If I want my application to avoid suffering from page faults, I use simply mlock() (and/or friends) to nail the relevant pages into physical RAM and not "look if they are out, if yes, get them in" which has also the risk that these important pages are too soon evicted again. But perhaps I'm missing some very special use cases .... MfG, Brend
On Tue, Jan 08, 2019 at 02:53:00PM +0100, Bernd Petrovitsch wrote: > On 08/01/2019 12:37, Jiri Kosina wrote: > > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > > > >> Shouldn't the application use e.g. mlock()/.... to guarantee no page > >> faults in the first place? > > > > Calling mincore() on pages you've just mlock()ed is sort of pointless > > though. > > Obviously;-) > > Sorry for being unclear above: If I want my application to > avoid suffering from page faults, I use simply mlock() > (and/or friends) to nail the relevant pages into physical > RAM and not "look if they are out, if yes, get them in" which > has also the risk that these important pages are too soon > evicted again. Note, that mlock() doesn't prevent minor page faults. Mlocked memory is still subject to mechanisms that makes the page temporary unmapped. For instance migration (including NUMA balancing), compaction, khugepaged...
On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote: > > So, I read the paper and before I was half way through it I figured > there are a bunch of other similar page cache invalidation attacks > we can perform without needing mincore. i.e. Focussing on mmap() and > mincore() misses the wider issues we have with global shared caches. Oh, agreed, and that was discussed in the original report too. The thing is, you can also depend on our pre-faulting of pages in the page fault handler, and use that to get the cached status of nearby pages. So do something like "fault one page, then do mincore() to see how many pages near it were mapped". See our "do_fault_around()" logic. But mincore is certainly the easiest interface, and the one that doesn't require much effort or setup. It's also the one where our old behavior was actually arguably simply stupid and actively wrong (ie "in caches" isn't even strictly speaking a valid question, since the caches in question may be invalid). So let's try to see if giving mincore() slightly more well-defined semantics actually causes any pain. I do think that the RWF_NOWAIT case might also be interesting to look at. Linus
On Tue, Jan 08, 2019 at 09:57:49AM -0800, Linus Torvalds wrote: > On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote: > > > > So, I read the paper and before I was half way through it I figured > > there are a bunch of other similar page cache invalidation attacks > > we can perform without needing mincore. i.e. Focussing on mmap() and > > mincore() misses the wider issues we have with global shared caches. > > Oh, agreed, and that was discussed in the original report too. > > The thing is, you can also depend on our pre-faulting of pages in the > page fault handler, and use that to get the cached status of nearby > pages. So do something like "fault one page, then do mincore() to see > how many pages near it were mapped". See our "do_fault_around()" > logic. Observing fault-around could help you detect what code an application is running, but it's not necessary (and can be turned off). Also, such an it observation is not dependent on using mincore. neither fault-around nor mincore are required functionality to exploit the information leaks. And, FWIW, fault-around actually destroys the information in the exfiltration channel described in the paper because it perturbs the carefully constructed page cache residency pattern that encodes the message. > But mincore is certainly the easiest interface, and the one that > doesn't require much effort or setup. Off the top of my head, here's a few vectors for reading the page cache residency state without perturbing the page cache residency pattern: - mincore - preadv2(RWF_NOWAIT) - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls - madvise(MADV_RANDOM); timed read of first byte in each page i.e. mincore is a messenger, but it's not the only trivial observation technique available. The only difference between mincore and the others will be the observation latency and hence channel bandwidth. IOWs, the question we need to focus on now is not "does breaking mincore affect anyone", it is "how the hell do we mitigate and isolate an information leak exposed by fundamental OS functionality that *everything* depends on for performance"? > It's also the one where our old > behavior was actually arguably simply stupid and actively wrong (ie > "in caches" isn't even strictly speaking a valid question, since the > caches in question may be invalid). This is irrelevant to the problem reported. Sure, mincore may be an awful interface, but it's semantics are not the cause of the information leak. You're just shooting the messenger... > I do think that the RWF_NOWAIT case might also be interesting to look at. As are all the other mechanisms you can use to observer page cache residency without perturbing it's state. Keep in mind that the researchers documented a remote observation technique that leaked the information across the network to a remote host, so this leak has much, much wider scope than changing mincore can address... Cheers, Dave.
On Wed, 9 Jan 2019, Dave Chinner wrote: > > But mincore is certainly the easiest interface, and the one that > > doesn't require much effort or setup. > > Off the top of my head, here's a few vectors for reading the page > cache residency state without perturbing the page cache residency > pattern: > - mincore > - preadv2(RWF_NOWAIT) > - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls > - madvise(MADV_RANDOM); timed read of first byte in each page While I obviously agree that all those are creating pagecache sidechannel in principle, I think we really should mostly focus on the first two (with mincore() already having been covered). Rationale has been provided by Daniel Gruss in this thread -- if the attacker is left with cache timing as the only available vector, he's going to be much more successful with mounting hardware cache timing attack anyway. Thanks,
On Wed, Jan 09, 2019 at 03:31:35AM +0100, Jiri Kosina wrote: > On Wed, 9 Jan 2019, Dave Chinner wrote: > > > > But mincore is certainly the easiest interface, and the one that > > > doesn't require much effort or setup. > > > > Off the top of my head, here's a few vectors for reading the page > > cache residency state without perturbing the page cache residency > > pattern: > > - mincore > > - preadv2(RWF_NOWAIT) > > - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls > > - madvise(MADV_RANDOM); timed read of first byte in each page > > While I obviously agree that all those are creating pagecache sidechannel > in principle, I think we really should mostly focus on the first two (with > mincore() already having been covered). FWIW, I just realised that the easiest, most reliable way to invalidate the page cache over a file range is simply to do a O_DIRECT read on it. IOWs, all three requirements of this information leak - highly specific, reliable cache invalidation control, controlled cache instantiation and 3rd-party detection of cache residency can all be performed with just the read(2) syscall... > Rationale has been provided by Daniel Gruss in this thread -- if the > attacker is left with cache timing as the only available vector, he's > going to be much more successful with mounting hardware cache timing > attack anyway. No, he said: "Restricting mincore() is sufficient to fix the hardware-agnostic part." That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and provides exactly the same information about the page cache as mincore. Timed read/mmap access loops for cache observation are also hardware agnostic, and on fast SSD based storage will only be marginally slower bandwidth than preadv2(RWF_NOWAIT). Attackers will pick whatever leak vector we don't fix, so we either fix them all (which I think is probably impossible without removing caching altogether) or we start thinking about how we need to isolate the page cache so that information isn't shared across important security boundaries (e.g. page cache contents are per-mount namespace). Cheers, Dave.
On Wed, 9 Jan 2019, Dave Chinner wrote: > FWIW, I just realised that the easiest, most reliable way to invalidate > the page cache over a file range is simply to do a O_DIRECT read on it. Neat, good catch indeed. Still, it's only the invalidation part, but the residency check is the crucial one. > > Rationale has been provided by Daniel Gruss in this thread -- if the > > attacker is left with cache timing as the only available vector, he's > > going to be much more successful with mounting hardware cache timing > > attack anyway. > > No, he said: > > "Restricting mincore() is sufficient to fix the hardware-agnostic > part." > > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and > provides exactly the same information about the page cache as mincore. Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has "just" been overlooked. I can't speak for Daniel, but I believe he might be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT is sufficient ...". > Timed read/mmap access loops for cache observation are also hardware > agnostic, and on fast SSD based storage will only be marginally slower > bandwidth than preadv2(RWF_NOWAIT). > > Attackers will pick whatever leak vector we don't fix, so we either fix > them all (which I think is probably impossible without removing caching > altogether) We can't really fix the fact that it's possible to do the timing on the HW caches though. > or we start thinking about how we need to isolate the page cache so that > information isn't shared across important security boundaries (e.g. page > cache contents are per-mount namespace). Umm, sorry for being dense, but how would that help that particular attack scenario on a system that doesn't really employ any namespacing? (which I still believe is a majority of the systems out there, but I might have just missed the containers train long time ago :) ).
On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote: > > FWIW, I just realised that the easiest, most reliable way to > invalidate the page cache over a file range is simply to do a > O_DIRECT read on it. If that's the case, that's actually an O_DIRECT bug. It should only invalidate the caches on write. On reads, it wants to either _flush_ any direct caches before the read, or just take the data from the caches. At no point is "invalidate" a valid model. Of course, I'm not in the least bit shocked if O_DIRECT is buggy like this. But looking at least at the ext4 routine, the read just does ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, and I don't see any invalidation. Having read access to a file absolutely should *not* mean that you can flush caches on it. That's a write op. Any filesystem that invalidates the caches on read is utterly buggy. Can you actually point to such a thing? Let's get that fixed, because it's completely wrong regardless of this whole mincore issue. Linus
On Wed, Jan 09, 2019 at 10:25:43AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote: > > > > FWIW, I just realised that the easiest, most reliable way to > > invalidate the page cache over a file range is simply to do a > > O_DIRECT read on it. > > If that's the case, that's actually an O_DIRECT bug. > > It should only invalidate the caches on write. Sounds nice from a theoretical POV, but reality has taught us very different lessons. FWIW, a quick check of XFS's history so you understand how long this behaviour has been around. It was introduced in the linux port in 2001 as direct IO support was being added: commit e837eac23662afae603aaaef7c94bc839c1b8f67 Author: Steve Lord <lord@sgi.com> Date: Mon Mar 5 16:47:52 2001 +0000 Add bounds checking for direct I/O, do the cache invalidation for data coherency on direct I/O. This was basically a direct port of the flush+invalidation code in the Irix direct IO path, which was introduced in 1995: > revision 1.149 > date: 1995/08/11 20:09:44; author: ajs; state: Exp; lines: +70 -2 > 280514 Adding page cache flusing calls to make direct > I/O coherent with buffered I/O. IOWs, history tells us that invalidation for direct IO reads has been done on XFS for almost 25 years. I know for certain that there have been applications out there that depend on this invalidation-on-read behaviour (another of those "reality bites" lessons) so we can't just remove it because you *think* it is a bug. i.e. we *could* remove the invalidation on read, but this we have a major behavioural change to the XFS direct IO path. This means we need to determine if we've just awoken sleeping data corruption krakens as well as determine if there are any performance regressions that result from the behavioural change. Which brings me to validation. If the recent clone/dedupe/copy_file_range() debacle has taught me anything, it's that validating a "simple" IO path mechanism is going to take months worth of machine time before we have any confidence that the change is not going to expose users to new data corruption problems. That's the difficulty here - it only takes 5 minutes to change the code, but there's months of machine time needed to determine if it's really safe to make that code change. Testing has a nasty habit of finding invalid assumptions; when those are assumptions about data coherency and integrity we can't test them on our users. And, really, this would be just another band-aid over a symptom of the information leak - it doesn't prevent users from being able to control page cache invalidation. It just removes one method, just like hacking mincore only removes one method of observing the page cache. And, like mincore(), there's every chance it impacts on userspace in a negative manner and so we need to be very careful here. > On reads, it wants to either _flush_ any direct caches before the > read, or just take the data from the caches. At no point is > "invalidate" a valid model. > > Of course, I'm not in the least bit shocked if O_DIRECT is buggy like > this. But looking at least at the ext4 routine, the read just does > > ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, > > and I don't see any invalidation. I wouldn't look at ext4 as an example of a reliable, problem free direct IO implementation because, historically speaking, it's been a series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and been far worse than XFS from data integrity, performance and reliability perspectives. IMO, "because ext4" has been a poor reason for justifying anything for a long time, not the least when talking about features that didn't even originate in extN.... > Can you actually point to such a thing? Let's get that fixed, because > it's completely wrong regardless of this whole mincore issue. The biggest problem that remains today is that we have no mechanism for serialising page faults against DIO. If we leave pages cached in memory while we have a AIO+DIO read (or write!) in progress, we can dirty the page and run a buffered read before the AIO+DIO read returns. This now leaves us in the state where where the AIO+DIO read returns different (stale) data to a buffered read that has already completed because it hit the dirty page cache. i.e. we still have nasty page cache vs direct IO coherency problems, and they are largely unsolvable because of the limitations of the core kernel infrastructure architecture. Yes, you can argue that userspace is doing an insane thing, but every so often we come across coherency issues like this that are out of a user's control (e.g. backup scan vs app accesses) and we do our best to ensure that they don't cause problems given the constraints we have. Invalidating the page cache on dio reads mostly mitigates these coherency race conditions and that's why it's still there in the XFS code paths... Cheers, Dave.
On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote: > On Wed, 9 Jan 2019, Dave Chinner wrote: > > > FWIW, I just realised that the easiest, most reliable way to invalidate > > the page cache over a file range is simply to do a O_DIRECT read on it. > > Neat, good catch indeed. Still, it's only the invalidation part, but the > residency check is the crucial one. > > > > Rationale has been provided by Daniel Gruss in this thread -- if the > > > attacker is left with cache timing as the only available vector, he's > > > going to be much more successful with mounting hardware cache timing > > > attack anyway. > > > > No, he said: > > > > "Restricting mincore() is sufficient to fix the hardware-agnostic > > part." > > > > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and > > provides exactly the same information about the page cache as mincore. > > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has > "just" been overlooked. I can't speak for Daniel, but I believe he might > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT > is sufficient ...". Good luck with restricting RWF_NOWAIT. I eagerly await all the fstests that exercise both the existing and new behaviours to demonstrate they work correctly. > > Timed read/mmap access loops for cache observation are also hardware > > agnostic, and on fast SSD based storage will only be marginally slower > > bandwidth than preadv2(RWF_NOWAIT). > > > > Attackers will pick whatever leak vector we don't fix, so we either fix > > them all (which I think is probably impossible without removing caching > > altogether) > > We can't really fix the fact that it's possible to do the timing on the HW > caches though. We can't really fix the fact that it's possible to do the timing on the page cache, either. > > or we start thinking about how we need to isolate the page cache so that > > information isn't shared across important security boundaries (e.g. page > > cache contents are per-mount namespace). > > Umm, sorry for being dense, but how would that help that particular attack > scenario on a system that doesn't really employ any namespacing? What's your security boundary? The "detect what code an app is running" exploit is based on invalidating and then observing how shared, non-user-owned files mapped with execute privileges change cache residency. If the security boundary is within the local container, should users inside that container be allowed to invalidate the cache of executable files and libraries they don't own? In this case, we can't stop observation, because that only require read permissions and high precision timing, hence the only thing that can be done here is prevent non-owners from invalidating the page cache. If the security boundary is a namespace or guest VM, then permission checks don't work - the user may own the file within that container. This problem now is that the page cache is observable and controllable from both sides of the fence. Hence the only way to prevent observation of the code being run in a different namespace is to prevent the page being shared across both containers. The exfiltration exploit requires the page cache to be observable and controllable on both sides of the security boundary. Should users be able to observe and control the cached pages accessed by a different container? KSM page deduplication lessons say no. This is an even harder problem, because page cache residency can be observed from remote machines.... What scares me is that new features being proposed could make our exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax") proposal will directly share host page cache pages into guest VMs w/ DAX capability. i.e. the guest directly accesses the host page cache. This opens up the potential for host page cache timing attacks from the guest VMs, and potential guest to guest observation/exploitation is possible if the same files are mapped into multiple guests.... IOws the two questions here are simply: "What's your security boundary?" and "Is the page cache visible and controllable on both sides?". Cheers, Dave.
On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > I wouldn't look at ext4 as an example of a reliable, problem free > direct IO implementation because, historically speaking, it's been a > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > been far worse than XFS from data integrity, performance and > reliability perspectives. That's some big words from somebody who just admitted to much worse hacks. Seriously. XFS is buggy in this regard, ext4 apparently isn't. Thinking that it's better to just invalidate the cache for direct IO reads is all kinds of odd. Linus
On Wed, Jan 9, 2019 at 5:18 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > > > I wouldn't look at ext4 as an example of a reliable, problem free > > direct IO implementation because, historically speaking, it's been a > > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > > been far worse than XFS from data integrity, performance and > > reliability perspectives. > > That's some big words from somebody who just admitted to much worse hacks. > > Seriously. XFS is buggy in this regard, ext4 apparently isn't. > > Thinking that it's better to just invalidate the cache for direct IO > reads is all kinds of odd. > This whole discussion seems to have gone a little bit off the rails... Linus, I think I agree with Dave's overall sentiment, though, and I think you should consider reverting your patch. Here's why. The basic idea behind the attack is that the authors found efficient ways to do two things: evict a page from page cache and detect, *without filling the cache*, whether a page is cached. The combination lets an attacker efficiently tell when another process reads a page. We need to keep in mind that this attack is a sophisticated attack, and anyone using it won't have any problem using a nontrivial way to detect whether a page is in page cache. So, unless we're going to try for real to make it hard to tell whether a page is cached without causing that page to become cached, it's not worth playing whack-a-mole. And, right now, mincore is whacking a mole. RWF_NOWAIT appears to do essentially the same thing at very little cost. I haven't really dug in, but I assume that various prefaulting tricks combined with various pagetable probing tricks can do similar things, but that's at least a *lot* more complicated. So unless we're going to lock down RWF_NOWAIT as well, I see no reason to lock down mincore(). Direct IO is a red herring -- O_DIRECT is destructive enough that it seems likely to make the attacks a lot less efficient. --- begin digression --- Since direct IO has been brought up, I have a question. I've wondered for years why direct IO works the way it does. If I were implementing it from scratch, my first inclination would be to use the page cache instead of fighting it. To do a single-page direct read, I would look that page up in the page cache (i.e. i_pages these days). If the page is there, I would do a normal buffered read. If the page is not there, I would insert a record into i_pages indicating that direct IO is in progress and then I would do the IO into the destination page. If any other read, direct or otherwise, sees a record saying "under direct IO", it would wait. To do a single-page direct write, I would look it up in i_pages. If it's there, I would do a buffered write followed by a sync (because applications expect a sync). If it's not there, I would again add a record saying "under direct IO" and do the IO. The idea is that this works as much like buffered IO as possible, except that the pages backing the IO aren't normal sharable page cache pages. The multi-page case would be just an optimization on top of the single-page case. The idea would be to somehow mark i_pages with entire extents under direct IO. It's a radix tree -- this can, at least in theory, be done efficiently. As long as all direct IO operations run in increasing order of offset, there shouldn't be lock ordering problems. Other than history and possibly performance, is there any reason that direct IO doesn't work this way? P.S. What, if anything, prevents direct writes from causing trouble when the underlying FS or backing store needs stable pages? Similarly, what, if anything, prevents direct reads from temporarily exposing unintended data to user code if the fs or underlying device transforms the data during the read process (e.g. by decrypting something)? --- end digression ---
On Wed, Jan 09, 2019 at 05:18:21PM -0800, Linus Torvalds wrote: > On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > > > I wouldn't look at ext4 as an example of a reliable, problem free > > direct IO implementation because, historically speaking, it's been a > > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > > been far worse than XFS from data integrity, performance and > > reliability perspectives. > > That's some big words from somebody who just admitted to much worse hacks. Sorry, what hacks did I just admit to making? This O_DIRECT behaviour long predates me - I'm just the messenger and you are shooting from the hip. Linus, the point I was making is that there are many, many ways to control page cache invalidation and measure page cache residency, and that trying to address them one-by-one is just a game of whack-a-mole. In future, can you please try not to go off the rails when someone mentions O_DIRECT? You have a terrible habit of going off on misdirected rants about O_DIRECT and/or XFS at any opportunity you can get, and all it does is derail whatever useful conversation was taking place. > Seriously. XFS is buggy in this regard, ext4 apparently isn't. So you keep asserting despite being presented with evidence that it mitigates other longstanding bugs that are really hard to solve. Ignoring all the evidence you've been presented with and re-asserting your original statement doesn't make it correct. Did you not think to ask "what are those problems, and what can do to solve them so we can remove the invalidation mitigations that XFS uses?". That would be a useful contribution, whereas shouting about how O_DIRECT is broken just pisses off the people working their asses off to fix the problems you just heard about and are ranting about. > Thinking that it's better to just invalidate the cache for direct IO > reads is all kinds of odd. No, it's practicality. If we can't fix the problem, we have to mitigate it. When we fix the underlying problem we can remove the mitigation code. having you assert that it's broken and demand that it be removed doesn't change the fact that we haven't fixed the underlying problems. It's being worked on, but we're not there yet. -Dave.
On Thu, 10 Jan 2019, Dave Chinner wrote: > > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has > > "just" been overlooked. I can't speak for Daniel, but I believe he might > > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT > > is sufficient ...". > > Good luck with restricting RWF_NOWAIT. I eagerly await all the > fstests that exercise both the existing and new behaviours to > demonstrate they work correctly. Well, we can still resurrect my original aproach of doing this opt-in based on a sysctl setting, and letting the admin choose his poison. If 'secure' mode is selected, RWF_NOWAIT will then probably just always fail wit EAGAIN.
On Wed, Jan 9, 2019 at 11:04 PM Dave Chinner <david@fromorbit.com> wrote: > > Sorry, what hacks did I just admit to making? This O_DIRECT > behaviour long predates me - I'm just the messenger and you are > shooting from the hip. Sure, sorry. I find this whole thing annoying. > Linus, the point I was making is that there are many, many ways to > control page cache invalidation and measure page cache residency, > and that trying to address them one-by-one is just a game of > whack-a-mole. .. and I agree. But let's a step back.Because there are different issues. First off, the whole page cache attack is not necessarily something many people will care about. As has been pointed out, it's often a matter of convenience and (relative) portability. And no, we're *never* going to stop all side channel leaks. Some parts of caching (notably the timing effects of it) are pretty fundamental. So at no point is this going to be some kind of absolute line in the sand _anyway_. There is no black-and-white "you're protected", there's only levels of convenience. A remote attacker is hopefully going to be limited by the interfaces to just timing attacks, although who knows what something like JS might expose. Presumably neither mincore() nor arbitrary O_DIRECT or pread2() flags. Anyway, the reason I was trying to plug mincore() is largely that that code didn't make much sense to begin with, and simply this: mm/mincore.c | 94 +++++++++--------------------------------------------------- 1 file changed, 13 insertions(+), 81 deletions(-) if we can make people happier by removing lines of code and making the semantics more clear anyway, it's worth trying. No? Is that everything? No. As mentioned, you'll never get to that "ok, we plugged everything" point anyway. But removing a fairly easy way to probe the cache that has no real upsides should be fairly non-controversial. But I do have to say that in many ways the page cache is *not* a great attack vector because there's often lots of it, and it's fairly hard to control. Once something is in the page cache for whatever reason, it tends to be pretty sticky, and flushing it tends to be fairly hard to predict. And a cheap and residency (whether a simple probe like mincore of or a NOWAIT flag) check is actually important just to try to control the flushing part. Brute-forcing the flushing is generally very expensive, but if you can't even see if you flushed it, it's way more so. If there's a way to control the cache residency directly, that's actually a much bigger hole than any residency check ever were. Because once you can flush caches by reading, at that point you can just flush a particular page and look at the IO stats for the root partition or something. No residency check even needed. So I do think that yes, as long as you can do a directed cache flush, mincore is *entirely* immaterial. Still, giving mincore clearer semantics and simpler code? Win-win. (Except, of course, if somebody actually notices outside of tests. Which may well happen and just force us to revert that commit. But that's a separate issue entirely). But I do think that we should strive to *never* invalidate caches on read accesses. I don't actually see where you are doing that, honestly: at least dio_complete() only does it for writes. So I'm actually hoping that you are mis-remembering this and it turns out that O_DIRECT reads don't invalidate caches. Linus
Linus Torvalds wrote on Thu, Jan 10, 2019: > (Except, of course, if somebody actually notices outside of tests. > Which may well happen and just force us to revert that commit. But > that's a separate issue entirely). Both Dave and I pointed at a couple of utilities that break with this. nocache can arguably work with the new behaviour but will behave differently; vmtouch on the other hand is no longer able to display what's in cache or not - people use that for example to "warm up" a container in page cache based on how it appears after it had been running for a while is a pretty valid usecase to me. From the list Kevin harvested out of the debian code search, the postgresql use case is pretty similar - probe what pages of the database were in cache at shutdown so when you restart it you can preload these and reach "cruse speed" faster. Sure that's probably not billions of users but this all looks fairly valid to me...
On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > Since direct IO has been brought up, I have a question. I've wondered > for years why direct IO works the way it does. If I were implementing > it from scratch, my first inclination would be to use the page cache > instead of fighting it. To do a single-page direct read, I would look > that page up in the page cache (i.e. i_pages these days). If the page > is there, I would do a normal buffered read. If the page is not > there, I would insert a record into i_pages indicating that direct IO > is in progress and then I would do the IO into the destination page. > If any other read, direct or otherwise, sees a record saying "under > direct IO", it would wait. OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out that what you want to do here is great for the mixed case, but it's pretty inefficient for IOs to files which are wholly uncached. So what I'm currently thinking about is an rwsem which works like this: O_DIRECT task: if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO, drop rwsem. if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index). buffered IO: if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem. if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on waitqueue. otherwise proceed as now.
On Thu, Jan 10, 2019 at 11:44:24AM +1100, Dave Chinner wrote: > And, really, this would be just another band-aid over a symptom of > the information leak - it doesn't prevent users from being able to > control page cache invalidation. It just removes one method, just > like hacking mincore only removes one method of observing the page > cache. And, like mincore(), there's every chance it impacts on > userspace in a negative manner and so we need to be very careful > here. Putting the mincore() / cache timing information leak aside though, the current behaviour of XFS means that an attacker can screw up the performance of random applications just by repeatedly doing O_DIRECT reads of libc.so. Maybe O_DIRECT reads should be forbidden from files on XFS unless you also have write access to them? (eg owner).
On Thu, Jan 10, 2019 at 06:47:11AM -0800, Matthew Wilcox wrote: > On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > > Since direct IO has been brought up, I have a question. I've wondered > > for years why direct IO works the way it does. If I were implementing > > it from scratch, my first inclination would be to use the page cache > > instead of fighting it. To do a single-page direct read, I would look > > that page up in the page cache (i.e. i_pages these days). If the page > > is there, I would do a normal buffered read. If the page is not > > there, I would insert a record into i_pages indicating that direct IO > > is in progress and then I would do the IO into the destination page. > > If any other read, direct or otherwise, sees a record saying "under > > direct IO", it would wait. > > OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out > that what you want to do here is great for the mixed case, but it's > pretty inefficient for IOs to files which are wholly uncached. > > So what I'm currently thinking about is an rwsem which works like this: > > O_DIRECT task: > if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO, > drop rwsem. GUP does page fault on user buffer which is a mmapped region of same file. page fault sets up for buffered IO, tries to take rwsem for write, deadlocks. Most of the schemes we come up with fall down at this point - you can't hold a lock over gup that is also used in the buffered IO path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK for truncation serialisation - we can't lock out both read()/write() and mmap IO paths with the same lock... > if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index). I assume you really mean add a tag to the entry? But this means there is no record ofthe direct IO being in flight except for the rwsem being held across the IO. Even if we did insert a flag to say "DIO in progress" and not rely on the lock.... > buffered IO: > if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem. > if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on > waitqueue. otherwise proceed as now. ... we'll sleep on that flags in the page fault and deadlock anyway. I'm pretty sure we explored this "record DIO state in the radix tree" 2 or 3 years ago and came to the conclusion that it didn't work for reasons like the above. i.e. it doesn't solve the problems we currently have with locking and serialisation between DIO and mmap... Cheers, Dave.
On Thu, Jan 10, 2019 at 1:44 PM Dave Chinner <david@fromorbit.com> wrote: > > GUP does page fault on user buffer which is a mmapped region of same > file. page fault sets up for buffered IO, tries to take rwsem for > write, deadlocks. > > Most of the schemes we come up with fall down at this point - you > can't hold a lock over gup that is also used in the buffered IO > path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK > for truncation serialisation - we can't lock out both read()/write() > and mmap IO paths with the same lock... Side note: a somewhat similar version of is true even in the absence of GUP and dio, for the case of doing a mmap of a file, and then reading or writing from the mapped region into the file itself. There are "interesting" locking scenarios wrt just holding the page locked, and trying to then fill that page with information with just a regular "copy_from_user()". Page fault -> try to read the file -> oops, the page we're trying to read from is locked because we're trying to write to it. So we have that odd dance in generic_perform_write() which does - touch the first user byte without holding any lock - do write_begin() (which gets the page lock) - copy from user space using the "atomic" copy (which just gives up on fault) - if nothing got copied, go back and try again with a smaller copy that can't cross a page. We might have raced with pageout. It might be possible to do something similar for direct IO, although simpler: just do the GUP entirely atomically (and in the fault case just fall back to non-direct IO). Linus
On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet <asmadeus@codewreck.org> wrote: > > Linus Torvalds wrote on Thu, Jan 10, 2019: > > (Except, of course, if somebody actually notices outside of tests. > > Which may well happen and just force us to revert that commit. But > > that's a separate issue entirely). > > Both Dave and I pointed at a couple of utilities that break with > this. nocache can arguably work with the new behaviour but will behave > differently; vmtouch on the other hand is no longer able to display > what's in cache or not - people use that for example to "warm up" a > container in page cache based on how it appears after it had been > running for a while is a pretty valid usecase to me. So honestly, the main reason I'm loath to revert is that yes, we know of theoretical differences, but they seem to all be performance-related. It would be really good to hear numbers. Is the warm-up optimization something that changes things from 3ms to 3.5ms? Or does it change things from 3ms to half a second? Because Dave is absolutely correct that mincore() isn't really even all that interesting an information leak if you can do the same with RWF_NOWAIT. But the other side of that same coin is that if we're not able to block mincore() sanely, then there's no point at looking at RWF_NOWAIT either. And we *can* do sane things about RWF_NOWAIT. For example, we could start async IO on RWF_NOWAIT, and suddenly it would go from "probe the page cache" to "probe and fill", and be much harder to use as an attack vector.. Do we want to do that? Maybe, maybe not. But if mincore() can't be fixed, there's no point in even trying. Now, if the mincore() change results in a big performance hit for people who use it as a heuristic for filling caches etc, then reverting the trial balloon is obviously something we must do, but at that point I'd also like to know which load it was that cared so much, and just what it did. Because we did have an alternate patch that just said "was the file writably opened, then we can do the page cache probing". But at least one user (fincore) didn't do even that. So right now, I consider the mincore change to be a "try to probe the state of mincore users", and we haven't really gotten a lot of information back yet. Linus
On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > Since direct IO has been brought up, I have a question. I've wondered > for years why direct IO works the way it does. If I were implementing > it from scratch, my first inclination would be to use the page cache > instead of fighting it. To do a single-page direct read, I would look > that page up in the page cache (i.e. i_pages these days). If the page > is there, I would do a normal buffered read. If the page is not Therein lies the problem. Copying data is prohibitively expensive, and that's the primary reason for O_DIRECT existing. i.e. O_DIRECT is a low-overhead, zero-copy data movement interface. The moment we switch from using CPU to dispatch IO to copying data, performance goes down because we will be unable to keep storage pipelines full. IOWs, any rework of O_DIRECT that involves copying data is a non-starter. But let's bring this back to the issue at hand - observability of page cache residency of file pages. If th epage is caceh resident, then it will have a latency of copying that data out of the page (i.e. very low latency). If the page is not resident, then it will do IO and take much, much longer to complete. i.e. we have clear timing differences between cachce hit and cache miss IO. This is exactly the timing information needed for observing page cache residency. We need to work out how to make page cache residency less observable, not add new, near perfect observation mechanisms that third parties can easily exploit... Cheers, Dave.
On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > And we *can* do sane things about RWF_NOWAIT. For example, we could > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > page cache" to "probe and fill", and be much harder to use as an > attack vector.. We can only do that if the application submits the read via AIO and has an async IO completion reporting mechanism. Otherwise we have to wait for the IO to complete to copy the data into the user's buffer. And given that the app is using RWF_NOWAIT to explicitly avoid blocking on the IO.... Also, keep in mind that RWF_NOWAIT also prevents blocking on filesystem locks and full request queues. One of the prime drivers of RWF_NOWAIT was to prevent AIO submission from blocking on filesystem locks - it allows userspace to submit other IO while it can't get all the access it requires to a single file or a single block device is congested. Hence I don't think there's a such a simple answer here - blocking for IO breaks RWF_NOWAIT. Cheers, Dave.
On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > > And we *can* do sane things about RWF_NOWAIT. For example, we could > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > > page cache" to "probe and fill", and be much harder to use as an > > attack vector.. > > We can only do that if the application submits the read via AIO and > has an async IO completion reporting mechanism. Oh, no, you misunderstand. RWF_NOWAIT has a lot of situations where it will potentially return early (the DAX and direct IO ones have their own), but I was thinking of the one in generic_file_buffered_read(), which triggers when you don't find a page mapping. That looks like the obvious "probe page cache" case. But we could literally move that test down just a few lines. Let it start read-ahead. .. and then it will actually trigger on the *second* case instead, where we have if (!PageUptodate(page)) { if (iocb->ki_flags & IOCB_NOWAIT) { put_page(page); goto would_block; } and that's where RWF_MNOWAIT would act. It would still return EAGAIN. But it would have started filling the page cache. So now the act of probing would fill the page cache, and the attacker would be left high and dry - the fact that the page cache now exists is because of the attack, not because of whatever it was trying to measure. See? But obviously this kind of change only matters if we also have mincore() not returning the probe data. mincore() obviously can't do the same kind of read-ahead to defeat things. Linus
On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > > > And we *can* do sane things about RWF_NOWAIT. For example, we could > > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > > > page cache" to "probe and fill", and be much harder to use as an > > > attack vector.. > > > > We can only do that if the application submits the read via AIO and > > has an async IO completion reporting mechanism. > > Oh, no, you misunderstand. > > RWF_NOWAIT has a lot of situations where it will potentially return > early (the DAX and direct IO ones have their own), but I was thinking > of the one in generic_file_buffered_read(), which triggers when you > don't find a page mapping. That looks like the obvious "probe page > cache" case. > > But we could literally move that test down just a few lines. Let it > start read-ahead. > > .. and then it will actually trigger on the *second* case instead, where we have > > if (!PageUptodate(page)) { > if (iocb->ki_flags & IOCB_NOWAIT) { > put_page(page); > goto would_block; > } > > and that's where RWF_MNOWAIT would act. > > It would still return EAGAIN. > > But it would have started filling the page cache. So now the act of > probing would fill the page cache, and the attacker would be left high > and dry - the fact that the page cache now exists is because of the > attack, not because of whatever it was trying to measure. > > See? Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in page_cache_sync_readahead(): /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) { force_page_cache_readahead(mapping, filp, offset, req_size); return; } So it will only read the single page we tried to access and won't perturb the rest of the message encoded into subsequent pages in file. Cheers, Dave.
> On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com> wrote: > >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote: >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: >>> >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: >>>> And we *can* do sane things about RWF_NOWAIT. For example, we could >>>> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the >>>> page cache" to "probe and fill", and be much harder to use as an >>>> attack vector.. >>> >>> We can only do that if the application submits the read via AIO and >>> has an async IO completion reporting mechanism. >> >> Oh, no, you misunderstand. >> >> RWF_NOWAIT has a lot of situations where it will potentially return >> early (the DAX and direct IO ones have their own), but I was thinking >> of the one in generic_file_buffered_read(), which triggers when you >> don't find a page mapping. That looks like the obvious "probe page >> cache" case. >> >> But we could literally move that test down just a few lines. Let it >> start read-ahead. >> >> .. and then it will actually trigger on the *second* case instead, where we have >> >> if (!PageUptodate(page)) { >> if (iocb->ki_flags & IOCB_NOWAIT) { >> put_page(page); >> goto would_block; >> } >> >> and that's where RWF_MNOWAIT would act. >> >> It would still return EAGAIN. >> >> But it would have started filling the page cache. So now the act of >> probing would fill the page cache, and the attacker would be left high >> and dry - the fact that the page cache now exists is because of the >> attack, not because of whatever it was trying to measure. >> >> See? > > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in > page_cache_sync_readahead(): > > /* be dumb */ > if (filp && (filp->f_mode & FMODE_RANDOM)) { > force_page_cache_readahead(mapping, filp, offset, req_size); > return; > } > > So it will only read the single page we tried to access and won't > perturb the rest of the message encoded into subsequent pages in > file. > There are two types of attacks. One is an intentional side channel where two cooperating processes communicate. This is, under some circumstances, a problem, but it’s not one we’re about to solve in general. The other is an attacker monitoring an unwilling process. I think we care a lot more about that, and Linus’ idea will help.
Linus Torvalds wrote on Thu, Jan 10, 2019: > On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet > <asmadeus@codewreck.org> wrote: > > Linus Torvalds wrote on Thu, Jan 10, 2019: > > > (Except, of course, if somebody actually notices outside of tests. > > > Which may well happen and just force us to revert that commit. But > > > that's a separate issue entirely). > > > > Both Dave and I pointed at a couple of utilities that break with > > this. nocache can arguably work with the new behaviour but will behave > > differently; vmtouch on the other hand is no longer able to display > > what's in cache or not - people use that for example to "warm up" a > > container in page cache based on how it appears after it had been > > running for a while is a pretty valid usecase to me. > > So honestly, the main reason I'm loath to revert is that yes, we know > of theoretical differences, but they seem to all be > performance-related. I don't see what other use mincore could have, yes - even the "debugging" use I gave is performance investigations and not hard problems (and I probably would go straight to perf nowadays, you'd get the info that the program doesn't use cache from the call graphs) > It would be really good to hear numbers. Is the warm-up optimization > something that changes things from 3ms to 3.5ms? Or does it change > things from 3ms to half a second? This is heavily workload and storage hardware dependant, so hard to give some absolute value. Trying with some big server, fast SSD, mysql and doing: # echo 3 > /proc/sys/vm/drop_caches # (optional) prefetch table and innodb files # systemctl restart mariadb # time mysql -q db "select * from mytable where id in $ENTRIES" > /dev/null # time mysql -q db "select * from mytable where id in $ENTRIES2" > /dev/null # time mysql -q db "select * from mytable where id in $ENTRIES3" > /dev/null (where ENTRIES* are lists of 1000 id, and id is indexed; the table is 8GB for 62590661 entries so 1000 entries is approx 128KB of data out of that file) I get on average over a few queries approximately a real time of 350ms, 230ms and 220ms immediately after drop cache and service restart, and 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I didn't have the patience to do proper testing). (In both cases, user/sys are less than 10ms; I don't see much difference there) If I restart the service without dropping caches and redo the query I get 60ms from the first query onwards so I must not be preloading everything properly, some real script that would look all over a container to properly restore the page cache would do better than me blindly preloading a few files. Either way, we're talking about a factor of 2-3 until the application has been looking at most of the entries, and I didn't try to see how that would look like on spinning disks or the kind of slow storage one would get on VPS somewhere in the cloud - I'm sure someone with time to waste could get much more impressive figures, but this already look pretty worthwhile to me.
On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote: > > So it will only read the single page we tried to access and won't > perturb the rest of the message encoded into subsequent pages in > file. Dave, you're being intentionally obtuse, aren't you? It's only that single page that *matters*. That's the page that the probe reveals the status of - but it's also the page that the probe then *changes* the status of. See? Think of it as "the act of measurement changes that which is measured". And that makes the measurement pointless. Linus
On Thu, Jan 10, 2019 at 8:58 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > I get on average over a few queries approximately a real time of 350ms, > 230ms and 220ms immediately after drop cache and service restart, and > 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I > didn't have the patience to do proper testing). > (In both cases, user/sys are less than 10ms; I don't see much difference > there) But those numbers aren't about the mincore() change. That's just from dropping caches. Now, what's the difference with the mincore change, and without? Is it actually measurable? Because that's all that matters: is the mincore change something you can even notice? Is it a big regression? The fact that things are slower when they are cold in the cache isn't the issue. The issue is whether the change to mincore semantics makes any difference to real loads. Linus
On Thu, Jan 10, 2019 at 08:08:37PM -0800, Andy Lutomirski wrote: > > On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com> > > wrote: > > > >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds > >> wrote: > >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner > >>> <david@fromorbit.com> wrote: > >>> > >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds > >>>> wrote: And we *can* do sane things about RWF_NOWAIT. For > >>>> example, we could start async IO on RWF_NOWAIT, and suddenly > >>>> it would go from "probe the page cache" to "probe and fill", > >>>> and be much harder to use as an attack vector.. > >>> > >>> We can only do that if the application submits the read via > >>> AIO and has an async IO completion reporting mechanism. > >> > >> Oh, no, you misunderstand. > >> > >> RWF_NOWAIT has a lot of situations where it will potentially > >> return early (the DAX and direct IO ones have their own), but I > >> was thinking of the one in generic_file_buffered_read(), which > >> triggers when you don't find a page mapping. That looks like > >> the obvious "probe page cache" case. > >> > >> But we could literally move that test down just a few lines. > >> Let it start read-ahead. > >> > >> .. and then it will actually trigger on the *second* case > >> instead, where we have > >> > >> if (!PageUptodate(page)) { if (iocb->ki_flags & > >> IOCB_NOWAIT) { put_page(page); goto would_block; > >> } > >> > >> and that's where RWF_MNOWAIT would act. > >> > >> It would still return EAGAIN. > >> > >> But it would have started filling the page cache. So now the > >> act of probing would fill the page cache, and the attacker > >> would be left high and dry - the fact that the page cache now > >> exists is because of the attack, not because of whatever it was > >> trying to measure. > >> > >> See? > > > > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code > > in page_cache_sync_readahead(): > > > > /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) > > { force_page_cache_readahead(mapping, filp, offset, > > req_size); return; } > > > > So it will only read the single page we tried to access and > > won't perturb the rest of the message encoded into subsequent > > pages in file. > > There are two types of attacks. One is an intentional side > channel where two cooperating processes communicate. This is, > under some circumstances, a problem, Yes, that's the covert communication channel that can cross container and machine boundaries without any required privileges. > but it’s not one > we’re about to solve in general. The other is an attacker > monitoring an unwilling process. Which uses exactly the same mechanisms as the first case. i.e. controlled invalidation and page cache residency monitoring.If we aren't going to solve the first problem case, the we aren't going to solve the second because they are one and the same problem... However, I suspect you have misunderstood the monitoring mechanism here - dispatch IO for this page doesn't prevent the information leak about that page. It's when we return EAGAIN that we leak information about page cache residency. What Linus is attempting to do is perturb the nearby state of the page cache by triggering async readahead in the EAGAIN case. Async readahead will fill all the holes in readahead window and hence destroy the information about where the page fault landed and instantiated the page cache. That would prevent the attacker from determining what code the executable is running as they would only be able to check a single page in an executable at a time and that makes the attack highly impractical. But if the attacker uses FADV_RANDOM, readahead is only triggered for the page the attacker is trying to read. Hence it does not disturb the nearby page cache residency pattern the executable's page faults left behind and so doesn't destroy the information that they are trying to extract from the unwilling process. Sure, Linus's change makes monitoring the executable file after FADV_RANDOM a "read-once" mechanism. However, detection of what code is executing is a repeated invalidate-and-sweep exercise to begin with, so it basically doesn't change the information or the rate at which the monitoring process can extract information from the file. /me hasn't thought about this sort of stuff since he was running page cache invalidation attacks on Irix system libraries way back in 2002 when he found a libc bug that killed the init process and paniced the kernel. This isn't my first rodeo - it's been well known for a long, long time that the system page cache can be exploited to monitor executing code... Cheers, Dave.
Linus Torvalds wrote on Thu, Jan 10, 2019: > But those numbers aren't about the mincore() change. That's just from > dropping caches. > > Now, what's the difference with the mincore change, and without? Is it > actually measurable? > > Because that's all that matters: is the mincore change something you > can even notice? Is it a big regression? > > The fact that things are slower when they are cold in the cache isn't > the issue. The issue is whether the change to mincore semantics makes > any difference to real loads. mincore itself isn't used to reload the data, but is necessary to know *what* you need to reload. If you don't know what pages are hot, how can you preload them? For small enough a database and with enough memory you can act stupid and load the whole tables in cache, that's what I did because I was lazy and only had a big mysql data set around. But the container warming up automaton Dave mentioned and postgresql db preloading with pgfincore explicitely depend on being able to tell what they need to preload. pgfincore documentation states: ---- set of PostgreSQL functions to manage blocks in memory Those functions let you know which and how many disk block from a relation are in the page cache of the operating system, and eventually write the result to a file. Then using this file, it is possible to restore the page cache state for each block of the relation. ---- If you cannot dump an arbitrary "hot state" x, you cannot restore it. This is all basically a repeat of the other subthread; sure precaching itself doesn't need mincore and if you're all-knowing the syscall isn't needed, but mere mortals need it. If it's about the commit itself, vmtouch tells me 0 page in these database files are in cache when I reboot to a 5.0-rc1 kernel and run some queries, so the difference after a fresh boot is exactly what I stated. I'm probably missing something but I'm not quite sure in what situation the "new mincore" has any use right now.
On Thu, Jan 10, 2019 at 11:08:07PM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote: > > > > So it will only read the single page we tried to access and won't > > perturb the rest of the message encoded into subsequent pages in > > file. > > Dave, you're being intentionally obtuse, aren't you? > > It's only that single page that *matters*. That's the page that the > probe reveals the status of - but it's also the page that the probe > then *changes* the status of. It changes the state of it /after/ we've already got the information we need from it. It's not up to date, it has to come from disk, we return EAGAIN, which means it was not in the cache. i.e. if we return EAGAIN, we've leaked the inforation the attacker wants regardless of how the act of initiating readahead on the page change the state of the page. Yes, it raises the complexity bar a bit, and lowers the monitoring frequency somewhat, but that's about it. Cheers, Dave.
On Thu, 10 Jan 2019, Dave Chinner wrote: > Sounds nice from a theoretical POV, but reality has taught us very > different lessons. > > FWIW, a quick check of XFS's history so you understand how long this > behaviour has been around. It was introduced in the linux port in 2001 > as direct IO support was being added: > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > Author: Steve Lord <lord@sgi.com> > Date: Mon Mar 5 16:47:52 2001 +0000 > > Add bounds checking for direct I/O, do the cache invalidation for > data coherency on direct I/O. Out of curiosity, which repository is this from please? Even google doesn't seem to know about this SHA. Thanks,
On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote: > > > It's only that single page that *matters*. That's the page that the > > probe reveals the status of - but it's also the page that the probe > > then *changes* the status of. > > It changes the state of it /after/ we've already got the information > we need from it. It's not up to date, it has to come from disk, we > return EAGAIN, which means it was not in the cache. Oh, I see the confusion. Yes, you get the information about whether something was in the cache or not, so the side channel does exist to some degree. But it's actually hugely reduced for a rather important reason: the _primary_ reason for needing to know whether some page is in the cache or not is not actually to see if it was ever accessed - it's to see that the cache has been scrubbed (and to _guide_ the scrubbing), and *when* it was accessed. Think of it this way: the buffer cache residency is actually a horribly bad signal on its own mainly because you generally have a very high hit-rate. In most normal non-streaming situations with sufficient amounts of memory you have pretty much everything cached. So in order to use it as a signal, first you have to first scrub the cache (because if the page was already there, there's no signal at all), and then for the signal to be as useful as possible, you're also going to want to try to get out more than one bit of information: you are going to try to see the patterns and the timings of how it gets filled. And that's actually quite painful. You don't know the initial cache state, and you're not (in general) controlling the machine entirely, because there's also that actual other entity that you're trying to attack and see what it does. So what you want to do is basically to first make sure the cache is scrubbed (only for the pages you're interested in!), then trigger whatever behavior you are looking for, and then look how that affected the cache. In other words, you want *multiple* residency status check - first to see what the cache state is (because you're going to want that for scrubbing), then to see that "yes, it's gone" when doing the scrubbing, and then to see the *pattern* and timings of how things are brought in. And then you're likely to want to do this over and over again, so that you can get real data out of the signal. This is why something that doesn't perturb what you measure is really important. If the act of measurement brings the page in, then you can't use it for that "did I successfully scrub it" phase at all, and you can't use it for measurement but once, so your view into patterns and timings is going to be *much* worse. And notice that this is true even if the act of measurement only affects the *one* page you're measuring. Sure, any additional noise around it would likely be annoying too, but it's not really necessary to make the attack much harder to carry out. In fact, it's almost irrelevant, since the signal you're trying to *see* is going to be affected by prefetching etc too, so the patterns and timings you need to look at are in bigger chunks than the readahead thing. So yes, you as an attacker can remove the prefetching from *your* load, but you can't remove it from the target load anyway, so you'll just have to live with it. Can you brute-force scrubbing? Yes. For something like an L1 cache, that's easy (well, QoS domains make it harder). For something like a disk cache, it's much harder, and makes any attempt to read out state a lot slower. The paper that started this all uses mincore() not just to see "is the page now scrubbed", but also to guide the scrubbing itself (working set estimation etc). And note that in many ways, the *scrubbing* is really the harder part. Populating the cache is really easy: just read the data you want to populate. So if you are looking for a particular signal, say "did this error case trigger so that it faulted in *that* piece of information", you'd want to scrub the target, populate everything else, and then try to measure at "did I trigger that target". Except you wouldn't want to do it one page at a time but see as much pattern of "they were touched in this order" as you can, and you'd like to get timing information of how the pages you are interested were populated too. And you'd generally do this over and over and over again because you're trying to read out some signal. Notice what the expensive operation was? It's the scrubbing.The "did the target do IO" you might actually even see other ways for the trivial cases, like even just look at iostat: just pre-populate everything but the part you care about, then try to trigger whatever you're searching for, and see if it caused IO or not. So it's a bit like a chalkboard: in order to read out the result, you need to erase it first, and doing that blindly is nasty. And you want to look at timings, which is also really nasty if every time you look, you smudge the very place you looked at. It makes it hard to see what somebody else is writing on the board if you're always overwriting what you just looked at. Did you get some new information? If not, now you have to go back and do that scrubbing again, and you'll likely be missing what *else* the person wrote. Ans as always: there is no "black and white". There is no "absolute security", and similarly, there is no "absolute leak proof". It's all about making it inconvenient enough that it's not really practical. Linus
On Fri, Jan 11, 2019 at 08:26:14AM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > It's only that single page that *matters*. That's the page that the > > > probe reveals the status of - but it's also the page that the probe > > > then *changes* the status of. > > > > It changes the state of it /after/ we've already got the information > > we need from it. It's not up to date, it has to come from disk, we > > return EAGAIN, which means it was not in the cache. > > Oh, I see the confusion. > > Yes, you get the information about whether something was in the cache > or not, so the side channel does exist to some degree. > > But it's actually hugely reduced for a rather important reason: the > _primary_ reason for needing to know whether some page is in the cache > or not is not actually to see if it was ever accessed - it's to see > that the cache has been scrubbed (and to _guide_ the scrubbing), and > *when* it was accessed. Oh, you're assuming that you need to probe the page cache to determine if brute force invalidation has progressed far enough to invalidate the page in question. I'm assuming that you can invalidate the page cache reliably by a means that does not repeated require probing to detect invalidation has occurred. I've mentioned one method in this discussion already... IOWs, just because the specific brute force attack documented in the paper required repeated probing it doesn't mean all future invalidation attacks will require repeated probing. Hence a robust defense mechanism should not rely on the attacker requiring multiple observations of the page cache to extract the information they are seeking... Cheers, Dave.
Linus Torvalds wrote on Thu, Jan 10, 2019: > So right now, I consider the mincore change to be a "try to probe the > state of mincore users", and we haven't really gotten a lot of > information back yet. <raises hand> For Netflix, losing accurate information from the mincore syscall would lengthen database cluster maintenance operations from days to months. We rely on cross-process mincore to migrate the contents of a page cache from machine to machine, and across reboots. To do this, I wrote and maintain happycache [1], a page cache dumper/loader tool. It is quite similar in architecture to pgfincore, except that it is agnostic to workload. The gist of happycache's operation is "produce a dump of residence status for each page, do some operation, then reload exactly the same pages which were present before." happycache is entirely dependent on accurate reporting of the in-core status of file-backed pages, as accessed by another process. We primarily use happycache with Cassandra, which (like Postgres + pgfincore) relies heavily on OS page cache to reduce disk accesses. Because our workloads never experience a cold page cache, we are able to provision hardware for a peak utilization level that is far lower than the hypothetical "every query is a cache miss" peak. A database warmed by happycache can be ready for service in seconds (bounded only by the performance of the drives and the I/O subsystem), with no period of in-service degradation. By contrast, putting a database in service without a page cache entails a potentially unbounded period of degradation (at Netflix, the time to populate a single node's cache via natural cache misses varies by workload from hours to weeks). If a single node upgrade were to take weeks, then upgrading an entire cluster would take months. Since we want to apply security upgrades (and other things) on a somewhat tighter schedule, we would have to develop more complex solutions to provide the same functionality already provided by mincore. At the bottom line, happycache is designed to benignly exploit the same information leak documented in the paper [2]. I think it makes perfect sense to remove cross-process mincore functionality from unprivileged users, but not to remove it entirely. Josh Snyder Netflix Cloud Database Engineering [1] https://github.com/hashbrowncipher/happycache [2] https://arxiv.org/abs/1901.01161
On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > I'm assuming that you can invalidate the page cache reliably by a > means that does not repeated require probing to detect invalidation > has occurred. I've mentioned one method in this discussion > already... Yes. And it was made clear to you that it was a bug in xfs dio and what the right thing to do was. And you ignored that, and claimed it was a feature. Why do you then bother arguing this thing? We absolutely agree that xfs has an information leak. If you don't care, just _say_ so. Don't try to argue against other people who are trying to fix things. We can easily just say "ok, xfs people don't care", and ignore the xfs invalidation issue. That's fine. But don't try to make it a big deal for other filesystems that _don't_ have the bug. I even pointed out how ext4 does the page cache flushing correcrly. You pooh-poohed it. You can't have it both ways. Either you care or you don't. If you don't care (and so far everything you said seems to imply you don't), then why are you even discussing this? Just admit you don't care, and we're done. Linus
On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote: > > For Netflix, losing accurate information from the mincore syscall would > lengthen database cluster maintenance operations from days to months. We > rely on cross-process mincore to migrate the contents of a page cache from > machine to machine, and across reboots. Ok, this is the kind of feedback we need, and means I guess we can't just use the mapping existence for mincore. The two other ways that we considered were: (a) owner of the file gets to know cache information for that file. (b) having the fd opened *writably* gets you cache residency information. Sadly, taking a look at happycache, you open the file read-only, so (b) doesn't work. Judging just from the source code, I can't tell how the user ownership works. Any input on that? And if you're not the owner of the file, do you have another suggestion for that "Yes, I have the right to see what's in-core for this file". Because the problem is literally that if it's some random read-only system file, the kernel shouldn't leak access patterns to it.. Linus
> On Jan 15, 2019, at 9:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote: >> >> For Netflix, losing accurate information from the mincore syscall would >> lengthen database cluster maintenance operations from days to months. We >> rely on cross-process mincore to migrate the contents of a page cache from >> machine to machine, and across reboots. > > Ok, this is the kind of feedback we need, and means I guess we can't > just use the mapping existence for mincore. > > The two other ways that we considered were: > > (a) owner of the file gets to know cache information for that file. > > (b) having the fd opened *writably* gets you cache residency information. > > Sadly, taking a look at happycache, you open the file read-only, so > (b) doesn't work. > > Judging just from the source code, I can't tell how the user ownership > works. Any input on that? > > And if you're not the owner of the file, do you have another > suggestion for that "Yes, I have the right to see what's in-core for > this file". Because the problem is literally that if it's some random > read-only system file, the kernel shouldn't leak access patterns to > it.. > > Something like CAP_DAC_READ_SEARCH might not be crazy.
On Wed, Jan 16, 2019 at 5:25 PM Andy Lutomirski <luto@amacapital.net> wrote: > > Something like CAP_DAC_READ_SEARCH might not be crazy. I agree that it would work. In fact' it's what Jiri's patch basically did. Except Jiri used CAP_SYS_ADMIN instead. But that then basically limits it to root (or root-like with capability masks), which is quite likely to not work in practice all that well. That's why I wanted to find alternatives. *Very* few people want to run their databases as root. Jiri's original patch kind of acknowledged that by making the new test be conditional, and off by default. So then it's a "only do this for lockdown mode, because normal people won't find it acceptable". And I'm not a huge fan of that approach. If you don't protect normal people, then what's the point, really? Linus
Linus Torvalds wrote on Wed, Jan 16, 2019: > *Very* few people want to run their databases as root. In the case of happycache, this isn't the database doing the dump/restore, but a separate process that could have the cap - it's better if we can do without though, and from his readme he runs as user cassandra in the /var/lib/cassandra directory for example so that'd match the file owner. For pgfincore, it's a postgres extension so the main process does it - but it does have files open as write as well as being the owner. > Jiri's original patch kind of acknowledged that by making the new test > be conditional, and off by default. So then it's a "only do this for > lockdown mode, because normal people won't find it acceptable". > > And I'm not a huge fan of that approach. If you don't protect normal > people, then what's the point, really? I agree with that. "Being owner or has cap" (whichever cap) is probably OK. On the other hand, writeability check makes more sense in general - could we somehow check if the user has write access to the file instead of checking if it currently is opened read-write?
On Wed, Jan 16, 2019 at 4:54 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > > > I'm assuming that you can invalidate the page cache reliably by a > > means that does not repeated require probing to detect invalidation > > has occurred. I've mentioned one method in this discussion > > already... > > Yes. And it was made clear to you that it was a bug in xfs dio and > what the right thing to do was. Side note: I actually think we *do* the right thing. Even for xfs. I couldn't find the alleged place that invalidates the page cache on dio reads. The *generic* dio code only does it for writes (which is correct and fine). And maybe xfs has some extra invalidation, but I don't see it. So I actually hope your "you can use direct-io read to do directed invalidating of the page cache" isn't true. I admittedly did *not* try to delve very deeply into it, but the invalidates I found looked correct. The generic code does it for writes, and at least ext4 does the "writeback and wait" for reads. There *does* seem to be a 'invalidate_inode_pages2_range()' call in iomap_dio_rw(). That has a *comment* that says it only is for writes, but it looks to me like it would trigger for reads too. Just a plain bug/oversight? Or me misreading things. So yes, maybe xfs does that "invalidate on read", but it really seems to be just a bug. If the xfs people insist on keeping the bug, fine (looks like gfs2 and xfs are the only users), but it seems kind of sad. Linus
On Wed, Jan 16, 2019 at 5:46 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > "Being owner or has cap" (whichever cap) is probably OK. > On the other hand, writeability check makes more sense in general - > could we somehow check if the user has write access to the file instead > of checking if it currently is opened read-write? That's likely the best option. We could say "is it open for write, or _could_ we open it for writing?" It's a slightly annoying special case, and I'd have preferred to avoid it, but it doesn't sound *compilcated*. I'm on the road, but I did send out this: https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/ originally. The "let's try to only do the mmap residency" was the optimistic "maybe we can just get rid of this complexity entirely" version.. Anybody willing to test the above patch instead? And replace the || capable(CAP_SYS_ADMIN) check with something like || inode_permission(inode, MAY_WRITE) == 0 instead? (This is obviously after you've reverted the "only check mmap residency" patch..) Linus
Linus Torvalds wrote on Wed, Jan 16, 2019: > Anybody willing to test the above patch instead? And replace the > > || capable(CAP_SYS_ADMIN) > > check with something like > > || inode_permission(inode, MAY_WRITE) == 0 > > instead? > > (This is obviously after you've reverted the "only check mmap > residency" patch..) That seems to work on an x86_64 vm. I've tested with the attached patch: - root can lookup pages on any file I tried; - user can lookup page on file it owns, assuming it can write to it (e.g. it won't work on a 0400 file you own) - user cannot lookup pages on e.g. /lib64/libc-2.28.so There is a difference with your previous patch though, that used to list no page in core when it didn't know; this patch lists pages as in core when it refuses to tell. I don't think that's very important, though. If anything, the 0400 user-owner file might be a problem in some edge case (e.g. if you're preloading git directories, many objects are 0444); should we *also* check ownership?...
On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > There is a difference with your previous patch though, that used to list no > page in core when it didn't know; this patch lists pages as in core when it > refuses to tell. I don't think that's very important, though. Is there a reason not to return -EPERM in this case? > > If anything, the 0400 user-owner file might be a problem in some edge > case (e.g. if you're preloading git directories, many objects are 0444); > should we *also* check ownership?... Yes, this seems valuable. Some databases with immutable files (e.g. git, as you've mentioned) conceivably operate this way. Josh
On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote: > On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> > wrote: > > > > There is a difference with your previous patch though, that used to list no > > page in core when it didn't know; this patch lists pages as in core when it > > refuses to tell. I don't think that's very important, though. > > Is there a reason not to return -EPERM in this case? When I was looking through the Debian Code Search results, quite a few of the hits were for code that uses mincore() as a way to check if _anything_ is mapped at an address or not. This code doesn't care about the in core / not in core result of mincore(), just whether it returned an error or not. I think a new error return would break most of the instances of that code I saw. - Kevin
On Wed, Jan 16, 2019 at 05:00:25PM +1200, Linus Torvalds wrote: > And if you're not the owner of the file, do you have another > suggestion for that "Yes, I have the right to see what's in-core for > this file". Because the problem is literally that if it's some random > read-only system file, the kernel shouldn't leak access patterns to > it.. This probably isn't a good heuristic, but thought I'd mention it anyway ... if the file is executable and you're not the owner, mincore always/never says its pages are resident. That'd fix all library leaks, but then there's probably a smart way of figuring out something from access patterns to a data file of some kind (/etc/passwd?)
On Wed, 16 Jan 2019, Linus Torvalds wrote: > > "Being owner or has cap" (whichever cap) is probably OK. On the other > > hand, writeability check makes more sense in general - could we > > somehow check if the user has write access to the file instead of > > checking if it currently is opened read-write? > > That's likely the best option. We could say "is it open for write, or > _could_ we open it for writing?" > > It's a slightly annoying special case, and I'd have preferred to avoid > it, but it doesn't sound *compilcated*. > > I'm on the road, but I did send out this: > > https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/ > > originally. The "let's try to only do the mmap residency" was the > optimistic "maybe we can just get rid of this complexity entirely" > version.. > > Anybody willing to test the above patch instead? And replace the > > || capable(CAP_SYS_ADMIN) > > check with something like > > || inode_permission(inode, MAY_WRITE) == 0 > > instead? > > (This is obviously after you've reverted the "only check mmap > residency" patch..) So that seems to deal with mincore() in a reasonable way indeed. It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it provide any good answer what to do about it, does it? Thanks,
On Thu, Jan 17, 2019 at 4:12 AM Jiri Kosina <jikos@kernel.org> wrote: > > So that seems to deal with mincore() in a reasonable way indeed. > > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it > provide any good answer what to do about it, does it? As I suggested earlier in the thread, the fix for RWF_NOWAIT might be to just move the test down to after readahead. We could/should be smarter than this,but it *might* be as simple as just diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read( page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); which starts readahead even if IOCB_NOWAIT is set and will then test IOCB_NOWAIT _later_ and not actually wait for it. Linus
On Thu, 17 Jan 2019, Linus Torvalds wrote: > > So that seems to deal with mincore() in a reasonable way indeed. > > > > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it > > provide any good answer what to do about it, does it? > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > to just move the test down to after readahead. So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the kernel with the three topmost patches from https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel applied (also attaching to this mail), and no obvious breakage popped up. So if noone sees any principal problem there, I'll happily submit it with proper attribution etc. Thanks,
On Wed, Jan 16, 2019 at 09:23:04PM +0100, Jiri Kosina wrote: > On Thu, 17 Jan 2019, Linus Torvalds wrote: > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > > to just move the test down to after readahead. Your patch 3/3 just removes the test. Am I right in thinking that it doesn't need to be *moved* because the existing test after !PageUptodate catches it? Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there any in LTP? Some typos in the commit messages: > Another aproach (checking file access permissions in order to decide "approach" > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative > > The semantics of what mincore() considers to be resident is not completely > clearar, but Linux has always (since 2.3.52, which is when mincore() was "clear" > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could "belong"
On Wed, 16 Jan 2019, Matthew Wilcox wrote: > > On Thu, 17 Jan 2019, Linus Torvalds wrote: > > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > > > to just move the test down to after readahead. > > Your patch 3/3 just removes the test. Am I right in thinking that it > doesn't need to be *moved* because the existing test after !PageUptodate > catches it? Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and if it's actually set, it'll be handled by the !PageUpdtodate case. > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > any in LTP? Not in the released version AFAIK. I've asked the LTP maintainer (in our internal bugzilla) to take care of this thread a few days ago, but not sure what came out of it. Adding him (Cyril) to CC. > Some typos in the commit messages: > > > Another aproach (checking file access permissions in order to decide > "approach" > > > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative > > > > The semantics of what mincore() considers to be resident is not completely > > clearar, but Linux has always (since 2.3.52, which is when mincore() was > "clear" > > > initially done) treated it as "page is available in page cache". > > > > That's potentially a problem, as that [in]directly exposes meta-information > > about pagecache / memory mapping state even about memory not strictly belonging > > to the process executing the syscall, opening possibilities for sidechannel > > attacks. > > > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > "belong" Thanks.
On Wed, Jan 16, 2019 at 04:54:49PM +1200, Linus Torvalds wrote: > On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > > > I'm assuming that you can invalidate the page cache reliably by a > > means that does not repeated require probing to detect invalidation > > has occurred. I've mentioned one method in this discussion > > already... > > Yes. And it was made clear to you that it was a bug in xfs dio and > what the right thing to do was. > > And you ignored that, and claimed it was a feature. Linus, either you aren't listening or you're being intentionally provocative. So, for the *third* time this thread: we can probably remove this code but first we need to be sure it doesn't cause unexpected regressions before we commit such a change. We are not cowboys who test userspace behavioural changes on users without review or discussion. Indeed, I wrote a patch to remove the invalidation /several days ago/ and put it into my test trees, and it's been there since. Just because you don't see immediate changes doesn't mean it isn't happening. > Either you care or you don't. If you don't care (and so far everything > you said seems to imply you don't), Linus, this is just a personal attack and IMO a violation of the CoC. It's straight out wrong, insulting, totally unprofessional and completely uncalled for. This is most definitely not a useful technical response to the issues I raised. i.e you cut out all the context of my response about whether "no probing necessary" page cache invalidation attacks are something we need to care about in the future. We don't need you to shout about existing "no probing necessary" page cache invalidation attacks that are already being addressed, we need to determine if it's going to be a recurring problem in future because that directly affects the mitigation strategies we can implement. -Dave.
Jiri Kosina wrote on Wed, Jan 16, 2019: > So if noone sees any principal problem there, I'll happily submit it with > proper attribution etc. I'm not convinced just the write permission check is enough for mincore(), as Josh also seems to share the concern I raised (e.g. map a git directory "hot" pages) We probably need to add an inode_owner_or_capable() or similar, the open question is do we still need the write access check after that - I don't really know how expensive these calls are. Thanks,
On Fri, Jan 11, 2019 at 08:36:55AM +0100, Jiri Kosina wrote: > On Thu, 10 Jan 2019, Dave Chinner wrote: > > > Sounds nice from a theoretical POV, but reality has taught us very > > different lessons. > > > > FWIW, a quick check of XFS's history so you understand how long this > > behaviour has been around. It was introduced in the linux port in 2001 > > as direct IO support was being added: > > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > Author: Steve Lord <lord@sgi.com> > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > Add bounds checking for direct I/O, do the cache invalidation for > > data coherency on direct I/O. > > Out of curiosity, which repository is this from please? Even google > doesn't seem to know about this SHA. because oss.sgi.com is no longer with us, it's fallen out of all the search engines. It was from the "archive/xfs-import.git" tree on oss.sgi.com: https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi but archive.org doesn't have a copy of the git tree. It contained the XFS history right back to the first Irix commit in 1993. Some of us still have copies of it sitting around.... Cheers, Dave.
On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote: > > Your patch 3/3 just removes the test. Am I right in thinking that it > doesn't need to be *moved* because the existing test after !PageUptodate > catches it? That's the _hope_. That's the simplest patch I can come up with as a potential solution. But it's possible that there's some nasty performance regression because somebody really relies on not even triggering read-ahead, and we might need to do some totally different thing. So it may be that somebody has a case that really wants something else, and we'd need to move the RWF_NOWAIT test elsewhere and do something slightly more complicated. As with the mincore() change, maybe reality doesn't like the simplest fix... > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > any in LTP? RWF_NOWAIT is actually _fairly_ new. It was introduced "only" about 18 months ago and made it into 4.13. Which makes me hopeful there aren't a lot of people who care deeply. And starting readahead *may* actually be what a RWF_NOWAIT read user generally wants, so for all we know it might even improve performance and/or allow new uses. With the "start readahead but don't wait for it" semantics, you can have a model where you try to handle all the requests that can be handled out of cache first (using RWF_NOWAIT) and then when you've run out of cached cases you clear the RWF_NOWAIT flag, but now the IO has been started early (and could overlap with the cached request handling), so then when you actually do a blocking version, you get much better performance. So there is an argument that removing that one RWF_NOWAIT case might actually be a good thing in general, outside of the "don't allow probing the cache without changing the state of it" issue. But that's handwavy and optimistic. Reality is often not as accomodating ;) Linus
On Thu, 17 Jan 2019, Dave Chinner wrote: > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > > Author: Steve Lord <lord@sgi.com> > > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > > > Add bounds checking for direct I/O, do the cache invalidation for > > > data coherency on direct I/O. > > > > Out of curiosity, which repository is this from please? Even google > > doesn't seem to know about this SHA. > > because oss.sgi.com is no longer with us, it's fallen out of all the > search engines. It was from the "archive/xfs-import.git" tree on > oss.sgi.com: > > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi > > but archive.org doesn't have a copy of the git tree. It contained > the XFS history right back to the first Irix commit in 1993. Some of > us still have copies of it sitting around.... For cases like this, would it be worth pushing it to git.kernel.org as an frozen historical reference archive? Thanks,
Hi! > > Your patch 3/3 just removes the test. Am I right in thinking that it > > doesn't need to be *moved* because the existing test after !PageUptodate > > catches it? > > Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and > if it's actually set, it'll be handled by the !PageUpdtodate case. > > > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > > any in LTP? > > Not in the released version AFAIK. I've asked the LTP maintainer (in our > internal bugzilla) to take care of this thread a few days ago, but not > sure what came out of it. Adding him (Cyril) to CC. So far not much, I've looked over our mincore() tests and noted down how to improve them here: https://github.com/linux-test-project/ltp/issues/461 We do plan to test the final mincore() fix: https://github.com/linux-test-project/ltp/issues/460 And we do have RWF_NOWAIT tests on our TODO for some time as well: https://github.com/linux-test-project/ltp/issues/286 I guess I can raise priority for that one so that we have basic functional tests in a week or so. Also if anyone has some RWF_NOWAIT tests already it would be nice if these could be shared with us. [A bit off topic rant] I've been telling kernel developers for years that if they have a test code they used when developing a kernel feature that they should share it with us (LTP community) and we will turn these into automated tests and maintain them for free. LTP is also used in many QA departements around the word so such tests will end up executed in different environments also for free. Sadly this does not happen much and there are only few exceptions so far. But maybe I wasn't shouting loudly enough.
On Thu, Jan 17, 2019 at 09:18:41AM +0100, Jiri Kosina wrote: > On Thu, 17 Jan 2019, Dave Chinner wrote: > > > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > > > Author: Steve Lord <lord@sgi.com> > > > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > > > > > Add bounds checking for direct I/O, do the cache invalidation for > > > > data coherency on direct I/O. > > > > > > Out of curiosity, which repository is this from please? Even google > > > doesn't seem to know about this SHA. > > > > because oss.sgi.com is no longer with us, it's fallen out of all the > > search engines. It was from the "archive/xfs-import.git" tree on > > oss.sgi.com: > > > > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi > > > > but archive.org doesn't have a copy of the git tree. It contained > > the XFS history right back to the first Irix commit in 1993. Some of > > us still have copies of it sitting around.... > > For cases like this, would it be worth pushing it to git.kernel.org as an > frozen historical reference archive? I'm not sure we should be putting code from Irix on kernel.org. I uploaded a copy to github a few months ago so XFS devs could easily reference relevant commits in email. The reasons we decided not to upload it to kernel.org should be clear from the readme... https://github.com/dchinner/xfs-history Cheers, Dave.
On 1/16/2019 8:52 AM, Josh Snyder wrote: > On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> > wrote: >> >> There is a difference with your previous patch though, that used to list no >> page in core when it didn't know; this patch lists pages as in core when it >> refuses to tell. I don't think that's very important, though. I've argued previously that reporting false positives (as your patch does) should be better, otherwise there might be somebody trying to fault in their pages in a loop until mincore reports positive, which would become an endless loop. So agreed with your change. Or maybe we could resort to the 5.0-rc1 page table check (that is now being reverted) but only in cases when we are not allowed the page cache residency check? Or would that be needlessly complicated? And it would be able to leak if a page was evicted from the page cache... > Is there a reason not to return -EPERM in this case? That would definitely break somebody. >> >> If anything, the 0400 user-owner file might be a problem in some edge >> case (e.g. if you're preloading git directories, many objects are 0444); >> should we *also* check ownership?... > > Yes, this seems valuable. Some databases with immutable files (e.g. git, as > you've mentioned) conceivably operate this way. > > Josh >
On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > Or maybe we could resort to the 5.0-rc1 page table check (that is now being > reverted) but only in cases when we are not allowed the page cache residency > check? Or would that be needlessly complicated? I think it would be good fallback semantics, but I'm not sure it's worth it. Have you tried writing a patch for it? I don't think you'd want to do the check *when* you find a hole, so you'd have to do it upfront and then pass the cached data down with the private pointer (or have a separate "struct mm_walk" structure, perhaps? So I suspect we're better off with the patch we have. But if somebody *wants* to try to do that fancier patch, and it doesn't look horrendous, I think it might be the "quality" solution. Linus
On Thu, Jan 17, 2019 at 4:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > Your patch 3/3 just removes the test. Am I right in thinking that it > > doesn't need to be *moved* because the existing test after !PageUptodate > > catches it? > > That's the _hope_. > > That's the simplest patch I can come up with as a potential solution. > But it's possible that there's some nasty performance regression > because somebody really relies on not even triggering read-ahead, and > we might need to do some totally different thing. Oh, and somebody should probably check that there isn't some simple way to just avoid that readahead code entirely. In particular, right now we skip readahead for at least these cases: /* no read-ahead */ if (!ra->ra_pages) return; if (blk_cgroup_congested()) return; and I don't think we need to worry about the cgroup congestion case - if the attack has to also congest its cgroup with IO, I think they have bigger problems. And I think 'ra_pages' can be zero only in the presence of IO errors, but I might be wrong. It would be good if somebody double-checks that. Linus
Hello, Linus. On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > And the first hit is 'fincore', which probably nobody cares about > anyway, but it does > > fd = open (name, O_RDONLY) > .. > mmap(window, len, PROT_NONE, MAP_PRIVATE, .. So, folks here have been using fincore(1) for diagnostic purposes and are also looking to expand on it to investigate per-cgroup cache usages (mmap -> mincore -> /proc/self/pagemap -> /proc/kpagecgroup -> cgroup path). These are all root-only usages to find out what's going on with the whole page cache. We aren't attached to doing things this particular way but it'd suck if there's no way. Thanks.
On 1/18/19 5:49 AM, Linus Torvalds wrote: > On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> Or maybe we could resort to the 5.0-rc1 page table check (that is now being >> reverted) but only in cases when we are not allowed the page cache residency >> check? Or would that be needlessly complicated? > > I think it would be good fallback semantics, but I'm not sure it's > worth it. Have you tried writing a patch for it? I don't think you'd > want to do the check *when* you find a hole, so you'd have to do it > upfront and then pass the cached data down with the private pointer > (or have a separate "struct mm_walk" structure, perhaps? > > So I suspect we're better off with the patch we have. But if somebody > *wants* to try to do that fancier patch, and it doesn't look > horrendous, I think it might be the "quality" solution. I thought to drop the idea because of leaking that page has been evicted, but then I realized there are other ways to check for that anyway in /proc. So I'll try, but probably not until after next week. If somebody else wants to, they are welcome. As you say, the current solution should be ok, so that would be a patch on top anyway, for bisectability etc. Vlastimil > Linus >
On Thu, Jan 17, 2019 at 9:23 AM Jiri Kosina <jikos@kernel.org> wrote: > > So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the > kernel with the three topmost patches from > > https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel > > applied (also attaching to this mail), and no obvious breakage popped up. > > So if noone sees any principal problem there, I'll happily submit it with > proper attribution etc. So this seems to have died down, and a week later we seem to not have a lot of noise here any more. I think it means people either weren't testing it, or just didn't find any real problems. I've reverted the 'let's try to just remove the code' part in my tree. But I didn't apply the two other patches yet. Any final comments before that should happen? Linus
On Thu, Jan 24, 2019 at 9:27 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I've reverted the 'let's try to just remove the code' part in my tree. > But I didn't apply the two other patches yet. Any final comments > before that should happen? Side note: the inode_permission() addition to can_do_mincore() in that patch 0002, seems to be questionable. We do +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} note how it tests whether vma->vm_file is NULL for the FMODE_WRITE test, but not for the inode_permission() test. So either we test unnecessarily in the second line, or we don't properly test it in the third one. I think the "test vm_file" thing may be unnecessary, because a non-anonymous mapping should always have a file pointer and an inode. But I could imagine some odd case (vdso mapping, anyone?) that doesn't have a vm_file, but also isn't anonymous. Anybody? Anyway, it's one reason why I didn't actually apply those other two patches yet. This may be a 5.1 issue.. Linus
On Thu, 24 Jan 2019, Linus Torvalds wrote: > Side note: the inode_permission() addition to can_do_mincore() in that > patch 0002, seems to be questionable. We do > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > +{ > + return vma_is_anonymous(vma) > + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; > +} > > note how it tests whether vma->vm_file is NULL for the FMODE_WRITE > test, but not for the inode_permission() test. > > So either we test unnecessarily in the second line, or we don't > properly test it in the third one. > > I think the "test vm_file" thing may be unnecessary, because a > non-anonymous mapping should always have a file pointer and an inode. > But I could imagine some odd case (vdso mapping, anyone?) that > doesn't have a vm_file, but also isn't anonymous. Hmm, good point. So dropping the 'vma->vm_file' test and checking whether given vma is special mapping should hopefully provide the desired semantics, shouldn't it?
On Thu, Jan 24, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > I think the "test vm_file" thing may be unnecessary, because a > > non-anonymous mapping should always have a file pointer and an inode. > > But I could imagine some odd case (vdso mapping, anyone?) that > > doesn't have a vm_file, but also isn't anonymous. > > Hmm, good point. > > So dropping the 'vma->vm_file' test and checking whether given vma is > special mapping should hopefully provide the desired semantics, shouldn't > it? Maybe. But on the whole I think it would be simpler and more straightforward to just instead add a vm_file test for the inode_permission() case. That way you at least know that you aren't following a NULL pointer. If the file then turns out to be some special thing, it doesn't really _matter_, I think. It won't have anything in the page cache etc, but the code should "work". Linus
On Thu, 24 Jan 2019, Dominique Martinet wrote: > Jiri, you've offered resubmitting the last two patches properly, can you > incorporate this change or should I just send this directly? (I'd take > most of your commit message and add your name somewhere) I've been running some basic smoke testing with the kernel from https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2 (attaching the respective two patches to apply on top of latest Linus' tree to this mail as well), and everything looks good so far. Thanks,
On Thu, 24 Jan 2019, Jiri Kosina wrote: > > Jiri, you've offered resubmitting the last two patches properly, can you > > incorporate this change or should I just send this directly? (I'd take > > most of your commit message and add your name somewhere) > > I've been running some basic smoke testing with the kernel from > > https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2 > > (attaching the respective two patches to apply on top of latest Linus' > tree to this mail as well), and everything looks good so far. So, any objections to aproaching it this way? I've not been able to spot any obvious breakage so far with it. Thanks,
Jiri Kosina wrote on Sun, Jan 27, 2019:
> So, any objections to aproaching it this way?
I'm not sure why I'm the main recipient of that mail but answering
because I am -- let's get these patches in through the regular -mm tree
though
Hi! > > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > > any in LTP? Just FYI I've send a patch with basic RWF_NOWAIT test for review to LTP ML and also CCed mailing lists from this thread. https://lkml.org/lkml/2019/1/28/416
On Mon, 28 Jan 2019, Dominique Martinet wrote: > > So, any objections to aproaching it this way? > > I'm not sure why I'm the main recipient of that mail but answering > because I am -- let's get these patches in through the regular -mm tree > though *prod to mm maintainers* (at least for an opinion)
On Wed 30-01-19 00:52:02, Jiri Kosina wrote: > On Mon, 28 Jan 2019, Dominique Martinet wrote: > > > > So, any objections to aproaching it this way? > > > > I'm not sure why I'm the main recipient of that mail but answering > > because I am -- let's get these patches in through the regular -mm tree > > though > > *prod to mm maintainers* (at least for an opinion) Could you repost those patches please? The thread is long and it is not really clear what is the most up-to-date state of patches (at least to me).
On Wed, 30 Jan 2019, Michal Hocko wrote: > > > I'm not sure why I'm the main recipient of that mail but answering > > > because I am -- let's get these patches in through the regular -mm > > > tree though > > > > *prod to mm maintainers* (at least for an opinion) > > Could you repost those patches please? The thread is long and it is not > really clear what is the most up-to-date state of patches (at least to > me). Vlastimil seems to have one extra patch to go on top, so we agreed that he'll be sending that as a complete self-contained series (either as a followup to the very first e-mail in this monsterthread, or completely separately) shortly. Thanks,
Linus Torvalds <torvalds@linux-foundation.org> writes: <snip> > So in order to use it as a signal, first you have to first scrub the > cache (because if the page was already there, there's no signal at > all), and then for the signal to be as useful as possible, you're also > going to want to try to get out more than one bit of information: you > are going to try to see the patterns and the timings of how it gets > filled. > > And that's actually quite painful. You don't know the initial cache > state, and you're not (in general) controlling the machine entirely, > because there's also that actual other entity that you're trying to > attack and see what it does. > > So what you want to do is basically to first make sure the cache is > scrubbed (only for the pages you're interested in!), then trigger > whatever behavior you are looking for, and then look how that affected > the cache. > > In other words, you want *multiple* residency status check - first to > see what the cache state is (because you're going to want that for > scrubbing), then to see that "yes, it's gone" when doing the > scrubbing, and then to see the *pattern* and timings of how things are > brought in. <snap> In an attempt to gain a better understanding of the guided eviction part resp. the relevance of mincore() & friends to that, I worked on reproducing the results from [1], section 6.1 ("Efficient Page Cache Eviction on Linux"). In case anybody wants to run their own experiments: the sources can be found at [2]. Disclaimer: I don't have access to the sources used by the [1]-paper's authors nor do I know anything about their experimental setup. So it might very well be the case, that my implementation is completely different and/or inefficient. Anyways, quoting from [1], section 6.1: "Eviction Set 1. These are pages already in the page cache, used by other processes. To keep them in the page cache, a thread continuously accesses these pages while also keep- ing the system load low by using sched yield and sleep. Consequently, they are among the most recently accessed pages of the system and eviction of these pages becomes highly unlikely." I had two questions: 1.) Do the actual contents of "Eviction set 1" matter for the guided eviction's performance or can they as well be arbitrary but fixed? Because if the set's actual contents weren't of any importance, then mincore() would not be needed to initialize it with "pages already in the page cache". 2.) How does keeping some fixed set resident + doing some IO compare to simply streaming a huge random file through the page cache? (To make it explicit: I didn't look into the probe part of the attack or the checking of the victim page's residency status as a termination condition for the eviction run.) Obviously, there are two primary factors affecting the victim page eviction performance: the file page cache size and disk read bandwidth. Naively speaking, I would suppose that keeping a certain set resident is a cheap and stable way to constrain IO to the remaining portion of the page cache and thus, reduce the amount of data required to be read from disk until the victim page gets evicted. Results summary (details can be found at the end of this mail): - The baseline benchmark of simply streaming random data through the page cache behaves as expected: avg of "Inactive(file)" / avg of "victim page eviction time" yields ~480MB/s, which approx. matches my disk's read bandwidth (note: the victim page was mapped as !PROT_EXEC). - I didn't do any sophisticated fine-tuning wrt. to parameters, but for the configuration yielding the best result, the average victim page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the "random but fixed resident set method". That's an improvement by a factor of 2.6 over the baseline "streaming random data method" (with the same amount of anonymous memory, i.e. "Eviction set 3", allocated: 7GB out of a total of 8GB). - In principle, question 1.) can't be answered by experiment without controlling the initial, legitimate system workload. I did some lax tests on my desktop running firefox, libreoffice etc. though and of course, overall responsiveness got a lot better if the "Eviction set 1" had been populated with pages already resident at the time the "attack" started. But the victim page eviction times didn't seem to improve -- at least not by factors such that my biased mind would have been able to recognize any change. In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident improved the victim page eviction performance by some factor over the "streaming" baseline, where "Eviction set 1" was populated from a single, attacker-controlled file and mincore() was not needed for determining its initial contents. To my surprise though, I needed to rely on mincore() at some other place, namely *my* implementation of keeping the resident set resident. My first naive approach was to have a single thread repeatedly iterating over the pages and reading the first few bytes from each through a mmapped area. That did not work out, because, even for smaller resident sets of 1GB and if run with realtime priority, the accessing thread would at some point in time encounter a reclaimed page and have to wait for the page fault to get served. While waiting for that, even more of the resident pages are likely to get reclaimed, causing additional delays later on. Eventually, the resident page accessor looses the game and will encounter page faults for almost the whole resident set (which isn't resident anymore). I worked around this by making the accessor thread check page residency via mincore(), touch only the resident ones and queue the others to some refault ("rewarm") thread. From briefly looking at iostat, this rewarmer thread actually seemed to saturate the disk and thus, there was no need for additional IO to put pressure on the page cache. For completeness, the amount of pages from the resident set actually found resident made up for ~97% of all file page cache pages (Inactive+Active(file)). Note that the way mincore() is used here is different than for the probe part of the attack: for probing, we'd like to know when a victim page has been faulted in again, while the residency keeper needs to check that some page has not been reclaimed before accessing it. Furthermore, mincore() is run on pages from an attacker-controlled and -owned file here. AFAICS, the patches currently under review (c.f. [3]) don't mitigate against this latter abuse of mincore() & Co. I personally doubt that doing something about it would be worth it though: first of all, until proven otherwise, I'm assuming that the improvement of the "resident set" based eviction method over the "streaming" baseline is not by orders of magnitude and that the victim page eviction times interesting to attackers (let's say better than 500ms) are achievable only under certain conditions: no swap and the ability to block a large fraction of total memory with anonymous mappings. What's more, I can imagine that there are other ways to keep the resident set resident without relying on mincore() at all: I haven't tried it, but simply spawning a larger number of accessor threads, each at a different position within "Eviction set 1", and hoping that most of the time at least one of them finds a resident page to touch, might work, too. Thanks, Nicolai Experiment setup + numbers ========================== My laptop has 8GB of RAM, a SSD and runs OpenSUSE 42.3 kernel package version 4.4.165-81. I stopped almost all services, but didn't setup any CPU isolation. I ran 5 x 2 experiments where I allocated (and filled, of course) 3,4,5,6,7GB of anonymous memory each and compared the baseline "read-in-a-huge-file" (mapped PROT_EXEC) results with the resident set based eviction for each of these. While doing so, I continuously measured eviction times of a !PROT_EXEC victim page and reported the 'Active(file)' + 'Inactive(file)' statistics from /proc/meminfo. Baseline "streaming" benchmark results: anon mem (GB): 7 6 5 4 3 Inactive(file) (MB): 189.3389 709.7063 1221.0910 1735.7510 2247.3340 eviction time (ms): 386.7712 1453.3975 2533.6084 3622.4995 4694.4668 quotient (MB/s): 489.5372 488.3085 481.9573 479.1584 478.7198 and, for comparison with the results from the resident set based eviction below: Inactive+Active(file) (MB): 358 1390 2415 3445 4469 Resident set based eviction: For the results below, I chose to draw the resident set from a single file filled with random data and made it total mem - allocated anon mem in size. The disk bandwidth was saturated all the time. anon mem (GB): 7 6 5 4 3 eviction time (ms): 146.6296 428.9594 620.2084 863.1423 977.8963 improvement over baseline: 2.6 3.4 4.1 4.2 4.8 Inactive+Active(file) (MB): 429 1449 2471 3494 4515 resident from res. set (MB): 417 1428 2426 3424 4429 fraction: 97.3% 98.6% 98.2% 98.0% 98.1% [1] https://arxiv.org/abs/1901.01161 ("Page Cache Attacks") [2] https://github.com/nicstange/pgc [3] https://lkml.kernel.org/r/20190130124420.1834-1-vbabka@suse.cz (*) The distribution of eviction times is not Gaussian.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 187ce4f599a2..afb8635e925e 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm: - min_free_kbytes - min_slab_ratio - min_unmapped_ratio +- mincore_privileged - mmap_min_addr - mmap_rnd_bits - mmap_rnd_compat_bits @@ -485,6 +486,14 @@ files and similar are considered. The default is 1 percent. ============================================================== +mincore_privileged: + +mincore() could be potentially used to mount a side-channel attack against +pagecache metadata. This sysctl provides system administrators means to +make it available only to processess that own CAP_SYS_ADMIN capability. + +The default is 0, which means mincore() can be used without restrictions. +============================================================== mmap_min_addr diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 1825f712e73b..f03cb07c8dd4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; #ifndef CONFIG_MMU extern int sysctl_nr_trim_pages; #endif +extern int sysctl_mincore_privileged; /* Constants used for minimum and maximum */ #ifdef CONFIG_LOCKUP_DETECTOR @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = { .extra2 = (void *)&mmap_rnd_compat_bits_max, }, #endif + { + .procname = "mincore_privileged", + .data = &sysctl_mincore_privileged, + .maxlen = sizeof(sysctl_mincore_privileged), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { } }; diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..77d4928cdfaa 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,6 +21,8 @@ #include <linux/uaccess.h> #include <asm/pgtable.h> +int sysctl_mincore_privileged; + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, unsigned long pages; unsigned char *tmp; + if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* Check the start address: needs to be page-aligned.. */ if (start & ~PAGE_MASK) return -EINVAL;