Message ID | cover.1695679700.git.falcon@tinylab.org (mailing list archive) |
---|---|
Headers | show |
Series | DCE/DSE: Add Dead Syscalls Elimination support, part1 | expand |
On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote: > > This series aims to add DCE based DSE support, here is the first > revision of the RFC patchset [1], the whole series includes three parts, > here is the Part1. > > This Part1 adds basic DCE based DSE support. > > Part2 will further eliminate the unused syscalls forcely kept by the > exception tables. > > Part3 will add DSE test support with nolibc-test.c. I missed the RFC version, but I think this is a useful thing to have overall, though it will probably need to go through a couple of revisions and rewrites, mostly to ensure we are not adding complexity that gets in the way of other improvements I would like to see to the syscall entry handling. It would be nice to include some size numbers here for at least one practical use case. If you have a defconfig for a shipping product with a small kernel, what is the 'size -B' output you see comparing with and without DCE and, and with DCE+DSE? There is generally not much work going into micro-optimizing the size of the kernel image any more, for a number of reasons, but if you are able to show that this is a noticeable improvement, we should be able to find a way to do it. Geert is doing statistics about size bloat over time, and anything that undoes a couple of years worth of bloat would clearly be significant here. Another alternative would be to resume the work done by Nicolas Pitre, who added Kconfig symbols for controlling groups of system calls. Since we already have a number of those compile time options, adding more of them should generally be less controversial and more consistent, while bringing most of the same benefits. Arnd
On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote: > On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote: > > It would be nice to include some size numbers here for at least > one practical use case. If you have a defconfig for a shipping > product with a small kernel, what is the 'size -B' output you > see comparing with and without DCE and, and with DCE+DSE? To follow up on this myself, for a very rough baseline, I tried a riscv tinyconfig build with and without CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently not supported on arm, so I did not try it there), and then another build with simply *all* system calls stubbed out by hacking asm/syscall-wrapper.h: $ size build/tmp/vmlinux-* text data bss dec hex filename 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls This would put us at an upper bound of 10% size savings (80kb) for tinyconfig, which is clearly significant. For defconfig, it's still 2.0% or 275kb size reduction when all syscalls are dropped. Arnd
On Tue, Sep 26, 2023, at 13:24, Arnd Bergmann wrote: > $ size build/tmp/vmlinux-* > text data bss dec hex filename > 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig > 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls > 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections > 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls > 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig > 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc > 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls > > This would put us at an upper bound of 10% size savings (80kb) for > tinyconfig, which is clearly significant. For defconfig, it's > still 2.0% or 275kb size reduction when all syscalls are dropped. I did one more test to see which syscalls actually cause bloat in when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is set in order to drop them all. I build the above riscv tinyconfig with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and truncated the syscall table before and after each syscall to see the size difference. A lot of syscalls are already conditional, so those show up as 0, 4 or 8 bytes (not sure why they are not always 0). Others could probably be made to fit within some category that can be made optional (e.g. xattr or adjtimex). Having a Kconfig option for those would also let users remove even more code that is not useful without the syscalls but might be called from somewhere else in the kernel. Arnd syscall size name ------------------------- 0 8 io_setup 1 4 io_destroy 2 8 io_submit 3 4 io_cancel 4 8 io_getevents 5 1496 setxattr 6 28 lsetxattr 7 148 fsetxattr 8 1404 getxattr 9 16 lgetxattr 10 80 fgetxattr 11 276 listxattr 12 16 llistxattr 13 68 flistxattr 14 460 removexattr 15 20 lremovexattr 16 92 fremovexattr 17 240 getcwd 18 4 lookup_dcookie 19 8 eventfd2 20 4 epoll_create1 21 8 epoll_ctl 22 4 epoll_pwait 23 64 dup 24 300 dup3 25 1684 fcntl 26 4 inotify_init1 27 8 inotify_add_watch 29 0 ioctl 28 4 inotify_rm_watch 30 8 ioprio_set 31 4 ioprio_get 32 8 flock 33 456 mknodat 34 192 mkdirat 35 64 unlinkat 36 208 symlinkat 38 0 renameat 37 324 linkat 40 0 mount 39 64 umount2 42 0 nfsservctl 41 708 pivot_root 43 424 statfs 44 132 fstatfs 45 272 truncate 46 216 ftruncate 47 88 fallocate 48 420 faccessat 49 120 chdir 50 112 fchdir 51 120 chroot 52 68 fchmod 53 164 fchmodat 54 184 fchownat 55 136 fchown 56 184 openat 57 204 close 58 4 vhangup 59 648 pipe2 61 0 getdents64 60 4 quotactl 62 148 lseek 63 328 read 64 356 write 65 952 readv 66 252 writev 67 92 pread64 68 92 pwrite64 69 100 preadv 71 0 sendfile 72 0 pselect6 70 100 pwritev 73 132 ppoll 74 4 signalfd4 75 2808 vmsplice 76 1388 splice 77 536 tee 78 424 readlinkat 79 244 fstatat 80 64 fstat 81 296 sync 82 100 fsync 83 20 fdatasync 84 448 sync_file_range 85 8 timerfd_create 86 4 timerfd_settime 87 8 timerfd_gettime 88 300 utimensat 89 4 acct 90 8 capget 91 4 capset 92 24 personality 93 24 exit 94 24 exit_group 95 16 waitid 96 28 set_tid_address 97 608 unshare 98 4 futex 99 8 set_robust_list 100 4 get_robust_list 101 276 nanosleep 103 0 setitimer 102 8 getitimer 104 4 kexec_load 105 8 init_module 107 0 timer_create 108 0 timer_gettime 109 0 timer_getoverrun 110 0 timer_settime 111 0 timer_delete 106 4 delete_module 112 44 clock_settime 113 88 clock_gettime 114 64 clock_getres 115 160 clock_nanosleep 116 8 syslog 117 740 ptrace 118 140 sched_setparam 119 36 sched_setscheduler 120 64 sched_getscheduler 121 88 sched_getparam 122 196 sched_setaffinity 123 180 sched_getaffinity 124 24 sched_yield 125 60 sched_get_priority_max 126 60 sched_get_priority_min 127 164 sched_rr_get_interval 128 12 restart_syscall 129 304 kill 130 212 tkill 131 40 tgkill 132 100 sigaltstack 133 104 rt_sigsuspend 134 396 rt_sigaction 135 180 rt_sigprocmask 136 76 rt_sigpending 137 336 rt_sigtimedwait 139 0 rt_sigreturn 138 120 rt_sigqueueinfo 140 396 setpriority 141 276 getpriority 142 1256 reboot 143 4 setregid 144 8 setgid 145 4 setreuid 146 8 setuid 147 4 setresuid 148 8 getresuid 149 4 setresgid 150 8 getresgid 151 4 setfsuid 152 8 setfsgid 153 152 times 154 252 setpgid 155 48 getpgid 156 48 getsid 157 140 setsid 158 8 getgroups 159 4 setgroups 160 172 uname 161 132 sethostname 162 136 setdomainname 163 156 getrlimit 164 52 setrlimit 165 88 getrusage 167 0 prctl 168 0 getcpu 169 0 gettimeofday 170 0 settimeofday 166 24 umask 171 1514 adjtimex 172 20 getpid 173 20 getppid 174 4 getuid 175 4 geteuid 176 4 getgid 177 4 getegid 178 20 gettid 179 276 sysinfo 180 4 mq_open 181 8 mq_unlink 182 4 mq_timedsend 183 8 mq_timedreceive 184 4 mq_notify 185 8 mq_getsetattr 186 4 msgget 187 8 msgctl 188 4 msgrcv 189 8 msgsnd 190 4 semget 191 8 semctl 192 4 semtimedop 193 8 semop 194 4 shmget 195 8 shmctl 196 4 shmat 197 8 shmdt 198 4 socket 199 8 socketpair 200 4 bind 201 8 listen 202 4 accept 203 8 connect 204 4 getsockname 205 8 getpeername 206 4 sendto 207 8 recvfrom 208 4 setsockopt 209 8 getsockopt 210 4 shutdown 211 8 sendmsg 212 4 recvmsg 213 460 readahead 214 2872 brk 215 288 munmap 216 4268 mremap 217 4 add_key 218 8 request_key 219 4 keyctl 220 100 clone 221 724 execve 222 2504 mmap 223 8 fadvise64 224 4 swapon 225 8 swapoff 226 2180 mprotect 227 320 msync 228 1140 mlock 229 84 munlock 230 304 mlockall 231 52 munlockall 232 828 mincore 233 4 madvise 234 324 remap_file_pages 235 4 mbind 236 8 get_mempolicy 237 4 set_mempolicy 238 8 migrate_pages 239 4 move_pages 240 132 rt_tgsigqueueinfo 241 8 perf_event_open 242 4 accept4 244 0 arch_specific_syscall 243 8 recvmmsg 260 100 wait4 261 252 prlimit64 262 8 fanotify_init 263 4 fanotify_mark 264 8 name_to_handle_at 266 0 clock_adjtime 265 4 open_by_handle_at 267 120 syncfs 268 624 setns 269 4 sendmmsg 270 8 process_vm_readv 271 4 process_vm_writev 272 8 kcmp 274 0 sched_setattr 273 4 finit_module 275 208 sched_getattr 276 2364 renameat2 277 4 seccomp 278 124 getrandom 279 4 memfd_create 280 8 bpf 281 52 execveat 282 4 userfaultfd 283 8 membarrier 284 40 mlock2 285 708 copy_file_range 286 32 preadv2 287 32 pwritev2 288 8 pkey_mprotect 289 4 pkey_alloc 290 8 pkey_free 291 356 statx 292 4 io_pgetevents 424 244 pidfd_send_signal 425 8 io_uring_setup 426 4 io_uring_enter 427 8 io_uring_register 428 368 open_tree 429 404 move_mount 430 556 fsopen 431 1056 fsconfig 432 484 fsmount 433 220 fspick 434 124 pidfd_open 435 516 clone3 436 240 close_range 437 120 openat2 438 304 pidfd_getfd 439 12 faccessat2 440 8 process_madvise 441 4 epoll_pwait2 442 1088 mount_setattr 443 8 quotactl_fd 444 4 landlock_create_ruleset 445 8 landlock_add_rule 446 4 landlock_restrict_self 447 8 memfd_secret 448 240 process_mrelease 449 4 futex_waitv 450 8 set_mempolicy_home_node 451 4 cachestat 452 28 fchmodat2 454 4 futex_wake
On Tue, 26 Sep 2023, Arnd Bergmann wrote: > On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote: > > On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote: > > > > It would be nice to include some size numbers here for at least > > one practical use case. If you have a defconfig for a shipping > > product with a small kernel, what is the 'size -B' output you > > see comparing with and without DCE and, and with DCE+DSE? > > To follow up on this myself, for a very rough baseline, > I tried a riscv tinyconfig build with and without > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently > not supported on arm, so I did not try it there), and > then another build with simply *all* system calls stubbed > out by hacking asm/syscall-wrapper.h: > > $ size build/tmp/vmlinux-* > text data bss dec hex filename > 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig > 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls > 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections > 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls > 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig > 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc > 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls > > This would put us at an upper bound of 10% size savings (80kb) for > tinyconfig, which is clearly significant. For defconfig, it's > still 2.0% or 275kb size reduction when all syscalls are dropped. I did something similar a while ago. Results included here: https://lwn.net/Articles/746780/ In my case, stubbing out all syscalls produced a 7.8% reduction which was somewhat disappointing compared to other techniques. Of course it all depends on what is your actual goal. Nicolas
On Tue, Sep 26, 2023, at 22:49, Nicolas Pitre wrote: > On Tue, 26 Sep 2023, Arnd Bergmann wrote: > >> $ size build/tmp/vmlinux-* >> text data bss dec hex filename >> 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig >> 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls >> 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections >> 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls >> 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig >> 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc >> 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls >> >> This would put us at an upper bound of 10% size savings (80kb) for >> tinyconfig, which is clearly significant. For defconfig, it's >> still 2.0% or 275kb size reduction when all syscalls are dropped. > > I did something similar a while ago. Results included here: > > https://lwn.net/Articles/746780/ > > In my case, stubbing out all syscalls produced a 7.8% reduction which > was somewhat disappointing compared to other techniques. Of course it > all depends on what is your actual goal. Thanks for the link, I had forgotten about your article. With all the findings combined, I guess the filtering at the syscall table level is not all that promising any more. Going through the list of saved space, I ended up with 5.7% (47kb) in the best case after I left the 40 syscalls from the example in this thread. Removing entire groups of features using normal Kconfig symbols based on the remaining syscalls that have the largest size probably gives better results. I can see possible groups of syscalls that could be disabled under CONFIG_EXPERT, along with making their underlying infrastructure optional: - xattr - ptrace - adjtimex - splice/vmsplice/tee - unshare/setns - sched_* After those, one would quickly hit diminishing returns. Arnd
I don't know why linux-kernel@vger.kernel.org reject my email send out by thunderbird. So here I am resending this mail with git send-email. Here is a test result about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls elimination (DSE). It's based on config[1] and a simple hello.c initramfs. In the DSE test, we set CONFIG_SYSCALLS_USED="sys_write sys_exit sys_reboot," which is used by hello.c to simply print "Hello" then exit and shut down qemu. | | syscall remain | vmlinux size | vmlinux after strip | | ---------------------------------- | -------------- | ---------------- | ------------------- | | disable DCE | 236 | 2559632 | 1963400 | | enable DCE | 208 | 2037384 (-20.4%) | 1485776 (-24.3%) | | enable DCE and DSE(SHE_GROUP) | 3 | 1856640 (-27.6%) | 1354424 (-31.0%) | | enable DCE and DSE(SHE_LINK_ORDER) | 3 | 1856664 (-27.6%) | 1354424 (-31.0%) | It shows that dead syscalls elimination can save 7% of space based on DCE. [1]: https://pastebin.com/KG4fd7aT
I didn't test DSE with explicit KEEP() in the previous mail. So, I will make up for it now. This test result is about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls elimination (DSE). It's based on config[1] and a simple hello.c initramfs. We set CONFIG_SYSCALLS_USED="sys_write sys_exit sys_reboot", which is used by hello.c to simply print "Hello" then exit and shut down qemu. | | syscall remain | vmlinux size | vmlinux after strip | | ------------------------------------------------------------ | -------------- | ---------------- | ------------------- | | disable DCE | 236 | 2559632 | 1963400 | | enable DCE | 208 | 2037384 (-20.4%) | 1485776 (-24.3%) | | enable DCE and DSE with explicit KEEP() of except table | 17 | 1899208 (-25.8%) | 1387272 (-29.3%) | | enable DCE and DSE without KEEP() (By SHF_GROUP method) | 3 | 1856640 (-27.6%) | 1354424 (-31.0%) | | enable DCE and DSE without KEEP() (By SHE_LINK_ORDER method) | 3 | 1856664 (-27.6%) | 1354424 (-31.0%) | It shows that dead syscalls elimination can save 7% of space based on DCE. Although no KEEP() can only save up 2% space, it can reduce the attack surface and eliminate the misuse of KEEP(). It ensures that every orphan section is not orphaned anymore. [1]: https://pastebin.com/KG4fd7aT