Message ID | 20160922164359.9035-1-vbabka@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > with the number of fds passed. We had a customer report page allocation > failures of order-4 for this allocation. This is a costly order, so it might > easily fail, as the VM expects such allocation to have a lower-order fallback. > > Such trivial fallback is vmalloc(), as the memory doesn't have to be > physically contiguous. Also the allocation is temporary for the duration of the > syscall, so it's unlikely to stress vmalloc too much. vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. So I guess allowing vmalloc() being called from an innocent application doing a select() might be dangerous, especially if this select() happens thousands of time per second. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/2016 06:49 PM, Eric Dumazet wrote: > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of the >> syscall, so it's unlikely to stress vmalloc too much. > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > So I guess allowing vmalloc() being called from an innocent application > doing a select() might be dangerous, especially if this select() happens > thousands of time per second. Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote: > On 09/22/2016 06:49 PM, Eric Dumazet wrote: > > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: > >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > >> with the number of fds passed. We had a customer report page allocation > >> failures of order-4 for this allocation. This is a costly order, so it might > >> easily fail, as the VM expects such allocation to have a lower-order fallback. > >> > >> Such trivial fallback is vmalloc(), as the memory doesn't have to be > >> physically contiguous. Also the allocation is temporary for the duration of the > >> syscall, so it's unlikely to stress vmalloc too much. > > > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > > > So I guess allowing vmalloc() being called from an innocent application > > doing a select() might be dangerous, especially if this select() happens > > thousands of time per second. > > Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? Possibly. We don't have a library function (attempting kmalloc(), fallback to vmalloc() presumably to avoid abuses, but I guess some patches were accepted without thinking about this. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/2016 07:07 PM, Eric Dumazet wrote: > On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote: >> On 09/22/2016 06:49 PM, Eric Dumazet wrote: >> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> >> with the number of fds passed. We had a customer report page allocation >> >> failures of order-4 for this allocation. This is a costly order, so it might >> >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> >> physically contiguous. Also the allocation is temporary for the duration of the >> >> syscall, so it's unlikely to stress vmalloc too much. >> > >> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. >> > >> > So I guess allowing vmalloc() being called from an innocent application >> > doing a select() might be dangerous, especially if this select() happens >> > thousands of time per second. >> >> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? > > Possibly. > > We don't have a library function (attempting kmalloc(), fallback to > vmalloc() presumably to avoid abuses, but I guess some patches were > accepted without thinking about this. So in the case of select() it seems like the memory we need 6 bits per file descriptor, multiplied by the highest possible file descriptor (nfds) as passed to the syscall. According to the man page of select: EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)). The code actually seems to silently cap the value instead of returning EINVAL though? (IIUC): /* max_fds can increase, so grab it once to avoid race */ rcu_read_lock(); fdt = files_fdtable(current->files); max_fds = fdt->max_fds; rcu_read_unlock(); if (n > max_fds) n = max_fds; The default for this cap seems to be 1024 where I checked (again, IIUC, it's what ulimit -n returns?). I wasn't able to change it to more than 2048, which makes the bitmaps still below PAGE_SIZE. So if I get that right, the system admin would have to allow really large RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large concern? Vlastimil -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Vlastimil Babka > Sent: 22 September 2016 18:55 ... > So in the case of select() it seems like the memory we need 6 bits per file > descriptor, multiplied by the highest possible file descriptor (nfds) as passed > to the syscall. According to the man page of select: > > EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see > getrlimit(2)). That second clause is relatively recent. > The code actually seems to silently cap the value instead of returning EINVAL > though? (IIUC): > > /* max_fds can increase, so grab it once to avoid race */ > rcu_read_lock(); > fdt = files_fdtable(current->files); > max_fds = fdt->max_fds; > rcu_read_unlock(); > if (n > max_fds) > n = max_fds; > > The default for this cap seems to be 1024 where I checked (again, IIUC, it's > what ulimit -n returns?). I wasn't able to change it to more than 2048, which > makes the bitmaps still below PAGE_SIZE. > > So if I get that right, the system admin would have to allow really large > RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large > concern? 4k open files isn't that many. Especially for programs that are using pipes to emulate windows events. I suspect that fdt->max_fds is an upper bound for the highest fd the process has open - not the RLIMIT_NOFILE value. select() shouldn't be silently ignoring large values of 'n' unless the fd_set bits are zero. Of course, select does scale well for high numbered fds and neither poll nor select scale well for large numbers of fds. David
On 09/23/2016 11:42 AM, David Laight wrote: > From: Vlastimil Babka >> Sent: 22 September 2016 18:55 > ... >> So in the case of select() it seems like the memory we need 6 bits per file >> descriptor, multiplied by the highest possible file descriptor (nfds) as passed >> to the syscall. According to the man page of select: >> >> EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see >> getrlimit(2)). > > That second clause is relatively recent. Interesting... so it was added without actually being true in the kernel code? >> The code actually seems to silently cap the value instead of returning EINVAL >> though? (IIUC): >> >> /* max_fds can increase, so grab it once to avoid race */ >> rcu_read_lock(); >> fdt = files_fdtable(current->files); >> max_fds = fdt->max_fds; >> rcu_read_unlock(); >> if (n > max_fds) >> n = max_fds; >> >> The default for this cap seems to be 1024 where I checked (again, IIUC, it's >> what ulimit -n returns?). I wasn't able to change it to more than 2048, which >> makes the bitmaps still below PAGE_SIZE. >> >> So if I get that right, the system admin would have to allow really large >> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large >> concern? > > 4k open files isn't that many. > Especially for programs that are using pipes to emulate windows events. Sure but IIUC we need 6 bits per file. That means up to almost 42k files, we should fit into order-3 allocation, which effectively cannot fail right now. > I suspect that fdt->max_fds is an upper bound for the highest fd the > process has open - not the RLIMIT_NOFILE value. I gathered that the highest fd effectively limits the number of files, so it's the same. I might be wrong. > select() shouldn't be silently ignoring large values of 'n' unless > the fd_set bits are zero. Yeah that doesn't seem to conform to the manpage. > Of course, select does scale well for high numbered fds > and neither poll nor select scale well for large numbers of fds. True. > David > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Vlastimil Babka > Sent: 23 September 2016 10:59 ... > > I suspect that fdt->max_fds is an upper bound for the highest fd the > > process has open - not the RLIMIT_NOFILE value. > > I gathered that the highest fd effectively limits the number of files, > so it's the same. I might be wrong. An application can reduce RLIMIT_NOFILE below that of an open file. David
On 09/23/2016 03:35 PM, David Laight wrote: > From: Vlastimil Babka >> Sent: 23 September 2016 10:59 > ... >> > I suspect that fdt->max_fds is an upper bound for the highest fd the >> > process has open - not the RLIMIT_NOFILE value. >> >> I gathered that the highest fd effectively limits the number of files, >> so it's the same. I might be wrong. > > An application can reduce RLIMIT_NOFILE below that of an open file. OK, I did some more digging in the code, and my understanding is that: - fdt->max_fds is the current size of the fdtable, which isn't allocated upfront to match the limit, but grows as needed. This means it's OK for core_sys_select() to silently cap nfds, as it knows there are no fd's with higher number in the fdtable, so it's a performance optimization. However, to match what the manpage says, there should be another check against RLIMIT_NOFILE to return -EINVAL, which there isn't, AFAICS. - fdtable is expanded (and fdt->max_fds bumped) by expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which seems to be 1048576 where I checked. - callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to limit the expansion. So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable and fdt->max_fds that is already above the limit. Select syscall would have to check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage. As for the original vmalloc() flood concern, I still think we're safe, as ordinary users are limited by RLIMIT_NOFILE way below sizes that would need vmalloc(), and root has many other options to DOS the system (or worse). Vlastimil > David > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Vlastimil Babka > Sent: 26 September 2016 11:02 > On 09/23/2016 03:35 PM, David Laight wrote: > > From: Vlastimil Babka > >> Sent: 23 September 2016 10:59 > > ... > >> > I suspect that fdt->max_fds is an upper bound for the highest fd the > >> > process has open - not the RLIMIT_NOFILE value. > >> > >> I gathered that the highest fd effectively limits the number of files, > >> so it's the same. I might be wrong. > > > > An application can reduce RLIMIT_NOFILE below that of an open file. > > OK, I did some more digging in the code, and my understanding is that: > > - fdt->max_fds is the current size of the fdtable, which isn't allocated upfront > to match the limit, but grows as needed. This means it's OK for > core_sys_select() to silently cap nfds, as it knows there are no fd's with > higher number in the fdtable, so it's a performance optimization. Not entirely, if any bits are set for fd above fdt->max_fds then select() call should fail - fd not open. > However, to > match what the manpage says, there should be another check against RLIMIT_NOFILE > to return -EINVAL, which there isn't, AFAICS. > > - fdtable is expanded (and fdt->max_fds bumped) by > expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which > seems to be 1048576 where I checked. > > - callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to > limit the expansion. > > So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable > and fdt->max_fds that is already above the limit. Select syscall would have to > check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage. I think the manpage should be fixed (delete that clause). Then add code to the system call to scan the high bit sets (above fdt->max_fds) for any non-zero bytes. This can be done into a small buffer. > As for the original vmalloc() flood concern, I still think we're safe, as > ordinary users are limited by RLIMIT_NOFILE way below sizes that would need > vmalloc(), and root has many other options to DOS the system (or worse). Some processes need very high numbers of fd. Likely they don't use select() on them, but trashing performance if they do is a bit silly. Trying to slit the 3 masks first seems sensible. David
On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > with the number of fds passed. We had a customer report page allocation > failures of order-4 for this allocation. This is a costly order, so it might > easily fail, as the VM expects such allocation to have a lower-order fallback. > > Such trivial fallback is vmalloc(), as the memory doesn't have to be > physically contiguous. Also the allocation is temporary for the duration of the > syscall, so it's unlikely to stress vmalloc too much. > > Note that the poll(2) syscall seems to use a linked list of order-0 pages, so > it doesn't need this kind of fallback. > > ... > > --- a/fs/select.c > +++ b/fs/select.c > @@ -29,6 +29,7 @@ > #include <linux/sched/rt.h> > #include <linux/freezer.h> > #include <net/busy_poll.h> > +#include <linux/vmalloc.h> > > #include <asm/uaccess.h> > > @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > struct fdtable *fdt; > /* Allocate small arguments on the stack to save memory and be faster */ > long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; > + unsigned long alloc_size; > > ret = -EINVAL; > if (n < 0) > @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > bits = stack_fds; > if (size > sizeof(stack_fds) / 6) { > /* Not enough space in on-stack array; must use kmalloc */ > + alloc_size = 6 * size; Well. `size' is `unsigned'. The multiplication will be done as 32-bit so there was no point in making `alloc_size' unsigned long. So can we tighten up the types in this function? size_t might make sense, but vmalloc() takes a ulong. > ret = -ENOMEM; > - bits = kmalloc(6 * size, GFP_KERNEL); > + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); > + if (!bits && alloc_size > PAGE_SIZE) > + bits = vmalloc(alloc_size); I don't share Eric's concerns about performance here. If the vmalloc() is called, we're about to write to that quite large amount of memory which we just allocated, and the vmalloc() overhead will be relatively low. > if (!bits) > goto out_nofds; > } > @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > > out: > if (bits != stack_fds) > - kfree(bits); > + kvfree(bits); > out_nofds: > return ret; It otherwise looks OK to me. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote: > I don't share Eric's concerns about performance here. If the vmalloc() > is called, we're about to write to that quite large amount of memory > which we just allocated, and the vmalloc() overhead will be relatively > low. I did not care of the performance of this particular select() system call really, but other cpus because of more TLB invalidations. At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe we do not care. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/27/2016 02:01 AM, Andrew Morton wrote: > On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of the >> syscall, so it's unlikely to stress vmalloc too much. >> >> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so >> it doesn't need this kind of fallback. >> >> ... >> >> --- a/fs/select.c >> +++ b/fs/select.c >> @@ -29,6 +29,7 @@ >> #include <linux/sched/rt.h> >> #include <linux/freezer.h> >> #include <net/busy_poll.h> >> +#include <linux/vmalloc.h> >> >> #include <asm/uaccess.h> >> >> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, >> struct fdtable *fdt; >> /* Allocate small arguments on the stack to save memory and be faster */ >> long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; >> + unsigned long alloc_size; >> >> ret = -EINVAL; >> if (n < 0) >> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, >> bits = stack_fds; >> if (size > sizeof(stack_fds) / 6) { >> /* Not enough space in on-stack array; must use kmalloc */ >> + alloc_size = 6 * size; > > Well. `size' is `unsigned'. The multiplication will be done as 32-bit > so there was no point in making `alloc_size' unsigned long. Uh, right. Thanks. > So can we tighten up the types in this function? size_t might make > sense, but vmalloc() takes a ulong. Let's do size_t then, as the conversion to ulong is safe. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/27/2016 03:38 AM, Eric Dumazet wrote: > On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote: > >> I don't share Eric's concerns about performance here. If the vmalloc() >> is called, we're about to write to that quite large amount of memory >> which we just allocated, and the vmalloc() overhead will be relatively >> low. > > I did not care of the performance of this particular select() system > call really, but other cpus because of more TLB invalidations. There are many other ways to cause those, AFAIK. The reclaim/compaction for order-3 allocation has its own impact on system, including TLB flushes. Or a flood of mmap(MAP_POPULATE) and madvise(MADV_DONTNEED) calls... This vmalloc() would however require raising RLIMIT_NOFILE above the defaults. > At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe > we do not care. I doubt anyone runs that in production, especially if performance is of concern. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2016-09-27 at 10:13 +0200, Vlastimil Babka wrote: > I doubt anyone runs that in production, especially if performance is of concern. > I doubt anyone serious runs select() on a large fd set in production. Last time I used it was in last century. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/select.c b/fs/select.c index 8ed9da50896a..b99e98524fde 100644 --- a/fs/select.c +++ b/fs/select.c @@ -29,6 +29,7 @@ #include <linux/sched/rt.h> #include <linux/freezer.h> #include <net/busy_poll.h> +#include <linux/vmalloc.h> #include <asm/uaccess.h> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; + unsigned long alloc_size; ret = -EINVAL; if (n < 0) @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, bits = stack_fds; if (size > sizeof(stack_fds) / 6) { /* Not enough space in on-stack array; must use kmalloc */ + alloc_size = 6 * size; ret = -ENOMEM; - bits = kmalloc(6 * size, GFP_KERNEL); + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); + if (!bits && alloc_size > PAGE_SIZE) + bits = vmalloc(alloc_size); + if (!bits) goto out_nofds; } @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, out: if (bits != stack_fds) - kfree(bits); + kvfree(bits); out_nofds: return ret; }
The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous. Also the allocation is temporary for the duration of the syscall, so it's unlikely to stress vmalloc too much. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. [eric.dumazet@gmail.com: fix failure path logic] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- fs/select.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)