mbox series

[v1,0/7] DCE/DSE: Add Dead Syscalls Elimination support, part1

Message ID cover.1695679700.git.falcon@tinylab.org (mailing list archive)
Headers show
Series DCE/DSE: Add Dead Syscalls Elimination support, part1 | expand

Message

Zhangjin Wu Sept. 25, 2023, 10:33 p.m. UTC
Hi, all

This series aims to add DCE based DSE support, here is the first
revision of the RFC patchset [1], the whole series includes three parts,
here is the Part1.

This Part1 adds basic DCE based DSE support.

Part2 will further eliminate the unused syscalls forcely kept by the
exception tables.

Part3 will add DSE test support with nolibc-test.c.

Changes from RFC patchset [1]:

- The DCE support [2] for RISC-V has been merged [3]
- The "nolibc: Record used syscalls in their own sections" [4] will be
  delayed to Part3

- Add debug support for DCE
- Further allows CONFIG_USED_SYSCALLS accept a file stores used syscalls
- Now, only accepts symbolic syscalls, not support integral number again
- Works with newly added riscv syscalls suffix: __riscv_
- Further trims the syscall tables by removing the tailing invalid parts

The nolibc-test based initrd run well on riscv64 kernel image with dead
syscalls eliminated:

    $ nm build/riscv64/virt/linux/v6.6-rc2/vmlinux | grep "T __riscv_sys" | grep -v sys_ni_syscall | wc -l
    48

These options should be enabled:

    CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
    CONFIG_LD_DEAD_CODE_DATA_ELIMINATION_DEBUG=y
    CONFIG_TRIM_UNUSED_SYSCALLS=y
    CONFIG_USED_SYSCALLS="sys_dup sys_dup3 sys_ioctl sys_mknodat sys_mkdirat sys_unlinkat sys_symlinkat sys_linkat sys_mount sys_chdir sys_chroot sys_fchmodat sys_fchownat sys_openat sys_close sys_pipe2 sys_getdents64 sys_lseek sys_read sys_write sys_pselect6 sys_ppoll sys_exit sys_sched_yield sys_kill sys_reboot sys_getpgid sys_prctl sys_gettimeofday sys_getpid sys_getppid sys_getuid sys_geteuid sys_brk sys_munmap sys_clone sys_execve sys_mmap sys_wait4 sys_statx"

The really used syscalls:

    $ echo "sys_dup sys_dup3 sys_ioctl sys_mknodat sys_mkdirat sys_unlinkat sys_symlinkat sys_linkat sys_mount sys_chdir sys_chroot sys_fchmodat sys_fchownat sys_openat sys_close sys_pipe2 sys_getdents64 sys_lseek sys_read sys_write sys_pselect6 sys_ppoll sys_exit sys_sched_yield sys_kill sys_reboot sys_getpgid sys_prctl sys_gettimeofday sys_getpid sys_getppid sys_getuid sys_geteuid sys_brk sys_munmap sys_clone sys_execve sys_mmap sys_wait4 sys_statx" | tr ' ' '\n' | wc -l
    40

Thanks to Yuan Tan, he has researched and verified the elimination of
the unused syscalls forcely kept by the exception tables, both section
group and section link order attributes of ld work. part2 will be sent
out soon to further remove another 8 unused syscalls and eventually we
are able to run a dead loop application on a kernel image without
syscalls.

Best Regards,
Zhangjin Wu

---
[1]: https://lore.kernel.org/lkml/cover.1676594211.git.falcon@tinylab.org/
[2]: https://lore.kernel.org/lkml/234017be6d06ef84844583230542e31068fa3685.1676594211.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/CAFP8O3+41QFVyNTVJ2iZYkB0tqnvdLTAoGShgGy-qPP1PHjBEw@mail.gmail.com/
[4]: https://lore.kernel.org/lkml/cbcbfbb37cabfd9aed6088c75515e4ea86006cff.1676594211.git.falcon@tinylab.org/

Zhangjin Wu (7):
  DCE: add debug support
  DCE/DSE: add unused syscalls elimination configure support
  DCE/DSE: Add a new scripts/Makefile.syscalls
  DCE/DSE: mips: add HAVE_TRIM_UNUSED_SYSCALLS support
  DCE/DSE: riscv: move syscall tables to syscalls/
  DCE/DSE: riscv: add HAVE_TRIM_UNUSED_SYSCALLS support
  DCE/DSE: riscv: trim syscall tables

 Makefile                                      |  3 +
 arch/mips/Kconfig                             |  1 +
 arch/mips/kernel/syscalls/Makefile            | 23 ++++++-
 arch/riscv/Kconfig                            |  1 +
 arch/riscv/include/asm/unistd.h               |  2 +
 arch/riscv/kernel/Makefile                    |  7 +-
 arch/riscv/kernel/syscalls/Makefile           | 69 +++++++++++++++++++
 .../{ => syscalls}/compat_syscall_table.c     |  4 +-
 .../kernel/{ => syscalls}/syscall_table.c     |  4 +-
 init/Kconfig                                  | 49 +++++++++++++
 scripts/Makefile.syscalls                     | 29 ++++++++
 11 files changed, 182 insertions(+), 10 deletions(-)
 create mode 100644 arch/riscv/kernel/syscalls/Makefile
 rename arch/riscv/kernel/{ => syscalls}/compat_syscall_table.c (82%)
 rename arch/riscv/kernel/{ => syscalls}/syscall_table.c (83%)
 create mode 100644 scripts/Makefile.syscalls

Comments

Arnd Bergmann Sept. 26, 2023, 7:14 a.m. UTC | #1
On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote:
>
> This series aims to add DCE based DSE support, here is the first
> revision of the RFC patchset [1], the whole series includes three parts,
> here is the Part1.
>
> This Part1 adds basic DCE based DSE support.
>
> Part2 will further eliminate the unused syscalls forcely kept by the
> exception tables.
>
> Part3 will add DSE test support with nolibc-test.c.

I missed the RFC version, but I think this is a useful thing to
have overall, though it will probably need to go through a couple
of revisions and rewrites, mostly to ensure we are not adding
complexity that gets in the way of other improvements I would
like to see to the syscall entry handling.

It would be nice to include some size numbers here for at least
one practical use case. If you have a defconfig for a shipping
product with a small kernel, what is the 'size -B' output you
see comparing with and without DCE and, and with DCE+DSE?

There is generally not much work going into micro-optimizing
the size of the kernel image any more, for a number of reasons,
but if you are able to show that this is a noticeable improvement,
we should be able to find a way to do it. Geert is doing statistics
about size bloat over time, and anything that undoes a couple
of years worth of bloat would clearly be significant here.

Another alternative would be to resume the work done by Nicolas
Pitre, who added Kconfig symbols for controlling groups of
system calls. Since we already have a number of those compile
time options, adding more of them should generally be
less controversial and more consistent, while bringing most
of the same benefits.

     Arnd
Arnd Bergmann Sept. 26, 2023, 11:24 a.m. UTC | #2
On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote:
> On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote:
>
> It would be nice to include some size numbers here for at least
> one practical use case. If you have a defconfig for a shipping
> product with a small kernel, what is the 'size -B' output you
> see comparing with and without DCE and, and with DCE+DSE?

To follow up on this myself, for a very rough baseline,
I tried a riscv tinyconfig build with and without 
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently
not supported on arm, so I did not try it there), and
then another build with simply *all* system calls stubbed
out by hacking asm/syscall-wrapper.h:

$ size build/tmp/vmlinux-*
   text	   data	    bss	     dec    hex	filename
  754772  220016  71841	 1046629  ff865	vmlinux-tinyconfig
  717500  223368  71841	 1012709  f73e5	vmlinux-tiny+nosyscalls
  567310  176200  71473	  814983  c6f87	vmlinux-tiny+gc-sections
  493278  170752  71433	  735463  b38e7	vmlinux-tiny+gc-sections+nosyscalls
10120058 3572756 493701	14186515 d87813	vmlinux-defconfig
 9953934 3529004 491525	13974463 d53bbf	vmlinux-defconfig+gc
 9709856 3500600 489221	13699677 d10a5d	vmlinux-defconfig+gc+nosyscalls

This would put us at an upper bound of 10% size savings (80kb) for
tinyconfig, which is clearly significant. For defconfig, it's
still 2.0% or 275kb size reduction when all syscalls are dropped.

     Arnd
Arnd Bergmann Sept. 26, 2023, 2:07 p.m. UTC | #3
On Tue, Sep 26, 2023, at 13:24, Arnd Bergmann wrote:
> $ size build/tmp/vmlinux-*
>    text	   data	    bss	     dec    hex	filename
>   754772  220016  71841	 1046629  ff865	vmlinux-tinyconfig
>   717500  223368  71841	 1012709  f73e5	vmlinux-tiny+nosyscalls
>   567310  176200  71473	  814983  c6f87	vmlinux-tiny+gc-sections
>   493278  170752  71433	  735463  b38e7	vmlinux-tiny+gc-sections+nosyscalls
> 10120058 3572756 493701	14186515 d87813	vmlinux-defconfig
>  9953934 3529004 491525	13974463 d53bbf	vmlinux-defconfig+gc
>  9709856 3500600 489221	13699677 d10a5d	vmlinux-defconfig+gc+nosyscalls
>
> This would put us at an upper bound of 10% size savings (80kb) for
> tinyconfig, which is clearly significant. For defconfig, it's
> still 2.0% or 275kb size reduction when all syscalls are dropped.

I did one more test to see which syscalls actually cause bloat in
when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is set in order to drop them
all. I build the above riscv tinyconfig with
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and truncated the syscall
table before and after each syscall to see the size difference.

A lot of syscalls are already conditional, so those show up as
0, 4 or 8 bytes (not sure why they are not always 0). Others
could probably be made to fit within some category that can
be made optional (e.g. xattr or adjtimex). Having a Kconfig
option for those would also let users remove even more code that
is not useful without the syscalls but might be called from
somewhere else in the kernel.

      Arnd

syscall  size   name
-------------------------
0	8	io_setup
1	4	io_destroy
2	8	io_submit
3	4	io_cancel
4	8	io_getevents
5	1496	setxattr
6	28	lsetxattr
7	148	fsetxattr
8	1404	getxattr
9	16	lgetxattr
10	80	fgetxattr
11	276	listxattr
12	16	llistxattr
13	68	flistxattr
14	460	removexattr
15	20	lremovexattr
16	92	fremovexattr
17	240	getcwd
18	4	lookup_dcookie
19	8	eventfd2
20	4	epoll_create1
21	8	epoll_ctl
22	4	epoll_pwait
23	64	dup
24	300	dup3
25	1684	fcntl
26	4	inotify_init1
27	8	inotify_add_watch
29	0	ioctl
28	4	inotify_rm_watch
30	8	ioprio_set
31	4	ioprio_get
32	8	flock
33	456	mknodat
34	192	mkdirat
35	64	unlinkat
36	208	symlinkat
38	0	renameat
37	324	linkat
40	0	mount
39	64	umount2
42	0	nfsservctl
41	708	pivot_root
43	424	statfs
44	132	fstatfs
45	272	truncate
46	216	ftruncate
47	88	fallocate
48	420	faccessat
49	120	chdir
50	112	fchdir
51	120	chroot
52	68	fchmod
53	164	fchmodat
54	184	fchownat
55	136	fchown
56	184	openat
57	204	close
58	4	vhangup
59	648	pipe2
61	0	getdents64
60	4	quotactl
62	148	lseek
63	328	read
64	356	write
65	952	readv
66	252	writev
67	92	pread64
68	92	pwrite64
69	100	preadv
71	0	sendfile
72	0	pselect6
70	100	pwritev
73	132	ppoll
74	4	signalfd4
75	2808	vmsplice
76	1388	splice
77	536	tee
78	424	readlinkat
79	244	fstatat
80	64	fstat
81	296	sync
82	100	fsync
83	20	fdatasync
84	448	sync_file_range
85	8	timerfd_create
86	4	timerfd_settime
87	8	timerfd_gettime
88	300	utimensat
89	4	acct
90	8	capget
91	4	capset
92	24	personality
93	24	exit
94	24	exit_group
95	16	waitid
96	28	set_tid_address
97	608	unshare
98	4	futex
99	8	set_robust_list
100	4	get_robust_list
101	276	nanosleep
103	0	setitimer
102	8	getitimer
104	4	kexec_load
105	8	init_module
107	0	timer_create
108	0	timer_gettime
109	0	timer_getoverrun
110	0	timer_settime
111	0	timer_delete
106	4	delete_module
112	44	clock_settime
113	88	clock_gettime
114	64	clock_getres
115	160	clock_nanosleep
116	8	syslog
117	740	ptrace
118	140	sched_setparam
119	36	sched_setscheduler
120	64	sched_getscheduler
121	88	sched_getparam
122	196	sched_setaffinity
123	180	sched_getaffinity
124	24	sched_yield
125	60	sched_get_priority_max
126	60	sched_get_priority_min
127	164	sched_rr_get_interval
128	12	restart_syscall
129	304	kill
130	212	tkill
131	40	tgkill
132	100	sigaltstack
133	104	rt_sigsuspend
134	396	rt_sigaction
135	180	rt_sigprocmask
136	76	rt_sigpending
137	336	rt_sigtimedwait
139	0	rt_sigreturn
138	120	rt_sigqueueinfo
140	396	setpriority
141	276	getpriority
142	1256	reboot
143	4	setregid
144	8	setgid
145	4	setreuid
146	8	setuid
147	4	setresuid
148	8	getresuid
149	4	setresgid
150	8	getresgid
151	4	setfsuid
152	8	setfsgid
153	152	times
154	252	setpgid
155	48	getpgid
156	48	getsid
157	140	setsid
158	8	getgroups
159	4	setgroups
160	172	uname
161	132	sethostname
162	136	setdomainname
163	156	getrlimit
164	52	setrlimit
165	88	getrusage
167	0	prctl
168	0	getcpu
169	0	gettimeofday
170	0	settimeofday
166	24	umask
171	1514	adjtimex
172	20	getpid
173	20	getppid
174	4	getuid
175	4	geteuid
176	4	getgid
177	4	getegid
178	20	gettid
179	276	sysinfo
180	4	mq_open
181	8	mq_unlink
182	4	mq_timedsend
183	8	mq_timedreceive
184	4	mq_notify
185	8	mq_getsetattr
186	4	msgget
187	8	msgctl
188	4	msgrcv
189	8	msgsnd
190	4	semget
191	8	semctl
192  	4	semtimedop
193	8	semop
194	4	shmget
195	8	shmctl
196	4	shmat
197	8	shmdt
198	4	socket
199	8	socketpair
200	4	bind
201	8	listen
202	4	accept
203	8	connect
204	4	getsockname
205	8	getpeername
206	4	sendto
207	8	recvfrom
208	4	setsockopt
209	8	getsockopt
210	4	shutdown
211	8	sendmsg
212	4	recvmsg
213	460	readahead
214	2872	brk
215	288	munmap
216	4268	mremap
217	4	add_key
218	8	request_key
219	4	keyctl
220	100	clone
221	724	execve
222	2504	mmap
223	8	fadvise64
224	4	swapon
225	8	swapoff
226	2180	mprotect
227	320	msync
228	1140	mlock
229	84	munlock
230	304	mlockall
231	52	munlockall
232	828	mincore
233	4	madvise
234	324	remap_file_pages
235	4	mbind
236	8	get_mempolicy
237	4	set_mempolicy
238	8	migrate_pages
239	4	move_pages
240	132	rt_tgsigqueueinfo
241	8	perf_event_open
242	4	accept4
244	0	arch_specific_syscall
243	8	recvmmsg
260	100	wait4
261	252	prlimit64
262	8	fanotify_init
263	4	fanotify_mark
264	8	name_to_handle_at
266	0	clock_adjtime
265	4	open_by_handle_at
267	120	syncfs
268	624	setns
269	4	sendmmsg
270	8	process_vm_readv
271	4	process_vm_writev
272	8	kcmp
274	0	sched_setattr
273	4	finit_module
275	208	sched_getattr
276	2364	renameat2
277	4	seccomp
278	124	getrandom
279	4	memfd_create
280	8	bpf
281	52	execveat
282	4	userfaultfd
283	8	membarrier
284	40	mlock2
285	708	copy_file_range
286	32	preadv2
287	32	pwritev2
288	8	pkey_mprotect
289	4	pkey_alloc
290	8	pkey_free
291	356	statx
292	4	io_pgetevents
424	244	pidfd_send_signal
425	8	io_uring_setup
426	4	io_uring_enter
427	8	io_uring_register
428	368	open_tree
429	404	move_mount
430	556	fsopen
431	1056	fsconfig
432	484	fsmount
433	220	fspick
434	124	pidfd_open
435	516	clone3
436	240	close_range
437	120	openat2
438	304	pidfd_getfd
439	12	faccessat2
440	8	process_madvise
441	4	epoll_pwait2
442	1088	mount_setattr
443	8	quotactl_fd
444	4	landlock_create_ruleset
445	8	landlock_add_rule
446	4	landlock_restrict_self
447	8	memfd_secret
448	240	process_mrelease
449	4	futex_waitv
450	8	set_mempolicy_home_node
451	4	cachestat
452	28	fchmodat2
454	4	futex_wake
Nicolas Pitre Sept. 26, 2023, 8:49 p.m. UTC | #4
On Tue, 26 Sep 2023, Arnd Bergmann wrote:

> On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote:
> > On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote:
> >
> > It would be nice to include some size numbers here for at least
> > one practical use case. If you have a defconfig for a shipping
> > product with a small kernel, what is the 'size -B' output you
> > see comparing with and without DCE and, and with DCE+DSE?
> 
> To follow up on this myself, for a very rough baseline,
> I tried a riscv tinyconfig build with and without 
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently
> not supported on arm, so I did not try it there), and
> then another build with simply *all* system calls stubbed
> out by hacking asm/syscall-wrapper.h:
> 
> $ size build/tmp/vmlinux-*
>    text	   data	    bss	     dec    hex	filename
>   754772  220016  71841	 1046629  ff865	vmlinux-tinyconfig
>   717500  223368  71841	 1012709  f73e5	vmlinux-tiny+nosyscalls
>   567310  176200  71473	  814983  c6f87	vmlinux-tiny+gc-sections
>   493278  170752  71433	  735463  b38e7	vmlinux-tiny+gc-sections+nosyscalls
> 10120058 3572756 493701	14186515 d87813	vmlinux-defconfig
>  9953934 3529004 491525	13974463 d53bbf	vmlinux-defconfig+gc
>  9709856 3500600 489221	13699677 d10a5d	vmlinux-defconfig+gc+nosyscalls
> 
> This would put us at an upper bound of 10% size savings (80kb) for
> tinyconfig, which is clearly significant. For defconfig, it's
> still 2.0% or 275kb size reduction when all syscalls are dropped.

I did something similar a while ago. Results included here:

https://lwn.net/Articles/746780/

In my case, stubbing out all syscalls produced a 7.8% reduction which 
was somewhat disappointing compared to other techniques. Of course it 
all depends on what is your actual goal.


Nicolas
Arnd Bergmann Sept. 27, 2023, 10:21 a.m. UTC | #5
On Tue, Sep 26, 2023, at 22:49, Nicolas Pitre wrote:
> On Tue, 26 Sep 2023, Arnd Bergmann wrote:
>
>> $ size build/tmp/vmlinux-*
>>    text	   data	    bss	     dec    hex	filename
>>   754772  220016  71841	 1046629  ff865	vmlinux-tinyconfig
>>   717500  223368  71841	 1012709  f73e5	vmlinux-tiny+nosyscalls
>>   567310  176200  71473	  814983  c6f87	vmlinux-tiny+gc-sections
>>   493278  170752  71433	  735463  b38e7	vmlinux-tiny+gc-sections+nosyscalls
>> 10120058 3572756 493701	14186515 d87813	vmlinux-defconfig
>>  9953934 3529004 491525	13974463 d53bbf	vmlinux-defconfig+gc
>>  9709856 3500600 489221	13699677 d10a5d	vmlinux-defconfig+gc+nosyscalls
>> 
>> This would put us at an upper bound of 10% size savings (80kb) for
>> tinyconfig, which is clearly significant. For defconfig, it's
>> still 2.0% or 275kb size reduction when all syscalls are dropped.
>
> I did something similar a while ago. Results included here:
>
> https://lwn.net/Articles/746780/
>
> In my case, stubbing out all syscalls produced a 7.8% reduction which 
> was somewhat disappointing compared to other techniques. Of course it 
> all depends on what is your actual goal.

Thanks for the link, I had forgotten about your article.

With all the findings combined, I guess the filtering
at the syscall table level is not all that promising
any more. Going through the list of saved space, I ended up
with 5.7% (47kb) in the best case after I left the 40 syscalls
from the example in this thread.

Removing entire groups of features using normal Kconfig symbols
based on the remaining syscalls that have the largest size
probably gives better results. I can see possible groups
of syscalls that could be disabled under CONFIG_EXPERT,
along with making their underlying infrastructure optional:

- xattr
- ptrace
- adjtimex
- splice/vmsplice/tee
- unshare/setns
- sched_*

After those, one would quickly hit diminishing returns.

     Arnd
Yuan Tan Sept. 30, 2023, 9:31 a.m. UTC | #6
I don't know why linux-kernel@vger.kernel.org reject my email send out by
thunderbird. So here I am resending this mail with git send-email.

Here is a test result about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls
elimination (DSE). It's based on config[1] and a simple hello.c initramfs.

In the DSE test, we set CONFIG_SYSCALLS_USED="sys_write sys_exit
sys_reboot," which is used by hello.c to simply print "Hello" then exit
and shut down qemu.


|                                    | syscall remain | vmlinux size     | vmlinux after strip |
| ---------------------------------- | -------------- | ---------------- | ------------------- |
| disable DCE                        | 236            | 2559632          | 1963400             |
| enable DCE                         | 208            | 2037384 (-20.4%) | 1485776 (-24.3%)    |
| enable DCE and DSE(SHE_GROUP)      | 3              | 1856640 (-27.6%) | 1354424 (-31.0%)    |
| enable DCE and DSE(SHE_LINK_ORDER) | 3              | 1856664 (-27.6%) | 1354424 (-31.0%)    |

It shows that dead syscalls elimination can save 7% of space based on DCE.

[1]: https://pastebin.com/KG4fd7aT
Yuan Tan Oct. 3, 2023, 4:43 p.m. UTC | #7
I didn't test DSE with explicit KEEP() in the previous mail. So, I will make up
for it now.

This test result is about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls
elimination (DSE). It's based on config[1] and a simple hello.c initramfs.

We set CONFIG_SYSCALLS_USED="sys_write sys_exit sys_reboot", which is used by
hello.c to simply print "Hello" then exit and shut down qemu.

|                                                              | syscall remain | vmlinux size     | vmlinux after strip |
| ------------------------------------------------------------ | -------------- | ---------------- | ------------------- |
| disable DCE                                                  | 236            | 2559632          | 1963400             |
| enable DCE                                                   | 208            | 2037384 (-20.4%) | 1485776 (-24.3%)    |
| enable DCE and DSE with explicit KEEP() of except table      | 17             | 1899208 (-25.8%) | 1387272 (-29.3%)    |
| enable DCE and DSE without KEEP() (By SHF_GROUP method)      | 3              | 1856640 (-27.6%) | 1354424 (-31.0%)    |
| enable DCE and DSE without KEEP() (By SHE_LINK_ORDER method) | 3              | 1856664 (-27.6%) | 1354424 (-31.0%)    |


It shows that dead syscalls elimination can save 7% of space based on DCE.

Although no KEEP() can only save up 2% space, it can reduce the attack surface
and eliminate the misuse of KEEP(). It ensures that every orphan section is not
orphaned anymore.

[1]: https://pastebin.com/KG4fd7aT