diff mbox series

objtool,x86: Teach decode about LOOP* instructions

Message ID Yxhd4EMKyoFoH9y4@hirez.programming.kicks-ass.net (mailing list archive)
State Not Applicable
Delegated to: BPF
Headers show
Series objtool,x86: Teach decode about LOOP* instructions | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Peter Zijlstra Sept. 7, 2022, 9:01 a.m. UTC
On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote:
> 
> > +/* Return the jump target address or 0 */
> > +static inline unsigned long insn_get_branch_addr(struct insn *insn)
> > +{
> > +	switch (insn->opcode.bytes[0]) {
> > +	case 0xe0:	/* loopne */
> > +	case 0xe1:	/* loope */
> > +	case 0xe2:	/* loop */
> 
> Oh cute, objtool doesn't know about those, let me go add them.

---
Subject: objtool,x86: Teach decode about LOOP* instructions

With kprobes also needing to follow control flow; it was found that
objtool is missing the branches from the LOOP* instructions.

Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/objtool/arch/x86/decode.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

David Laight Sept. 7, 2022, 9:06 a.m. UTC | #1
From: Peter Zijlstra
> Sent: 07 September 2022 10:01
> 
> On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote:
> >
> > > +/* Return the jump target address or 0 */
> > > +static inline unsigned long insn_get_branch_addr(struct insn *insn)
> > > +{
> > > +	switch (insn->opcode.bytes[0]) {
> > > +	case 0xe0:	/* loopne */
> > > +	case 0xe1:	/* loope */
> > > +	case 0xe2:	/* loop */
> >
> > Oh cute, objtool doesn't know about those, let me go add them.

Do they ever appear in the kernel?
They are so slow on Intel cpu that finding one ought to
deemed a bug!

Have you got jcxz (0xe3) in there?
They are fast on both Intel and AMD cpus - so are usable.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Zijlstra Sept. 7, 2022, 9:40 a.m. UTC | #2
On Wed, Sep 07, 2022 at 09:06:12AM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 07 September 2022 10:01
> > 
> > On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote:
> > >
> > > > +/* Return the jump target address or 0 */
> > > > +static inline unsigned long insn_get_branch_addr(struct insn *insn)
> > > > +{
> > > > +	switch (insn->opcode.bytes[0]) {
> > > > +	case 0xe0:	/* loopne */
> > > > +	case 0xe1:	/* loope */
> > > > +	case 0xe2:	/* loop */
> > >
> > > Oh cute, objtool doesn't know about those, let me go add them.
> 
> Do they ever appear in the kernel?

No; that is, not on any of the random vmlinux.o images I checked this
morning.

Still, best to properly decode them anyway.
David Laight Sept. 7, 2022, 11:13 a.m. UTC | #3
From: Peter Zijlstra
> Sent: 07 September 2022 10:40
> 
> On Wed, Sep 07, 2022 at 09:06:12AM +0000, David Laight wrote:
> > From: Peter Zijlstra
> > > Sent: 07 September 2022 10:01
> > >
> > > On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote:
> > > >
> > > > > +/* Return the jump target address or 0 */
> > > > > +static inline unsigned long insn_get_branch_addr(struct insn *insn)
> > > > > +{
> > > > > +	switch (insn->opcode.bytes[0]) {
> > > > > +	case 0xe0:	/* loopne */
> > > > > +	case 0xe1:	/* loope */
> > > > > +	case 0xe2:	/* loop */
> > > >
> > > > Oh cute, objtool doesn't know about those, let me go add them.
> >
> > Do they ever appear in the kernel?
> 
> No; that is, not on any of the random vmlinux.o images I checked this
> morning.
> 
> Still, best to properly decode them anyway.

It is annoying that cpu with adox/adcx have slow loop.
You really want to be able to do:
	1:	adox ...
		adcx ...
		loop	1b
That would never run with one iteration/clock.
But unrolling once would probably be enough.

What you can do (and gives the fastest IPcsum loop) is:
	1:	jcxz	2f
		....
		lea	%rcx,...
		jmp	1b
	2:
The extra instructions mean that needs unrolling 4 times.
I've got over 12 bytes/clock that way.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index c260006106be..1c253b4b7ce0 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -635,6 +635,12 @@  int arch_decode_instruction(struct objtool_file *file, const struct section *sec
 		*type = INSN_CONTEXT_SWITCH;
 		break;
 
+	case 0xe0: /* loopne */
+	case 0xe1: /* loope */
+	case 0xe2: /* loop */
+		*type = INSN_JUMP_CONDITIONAL;
+		break;
+
 	case 0xe8:
 		*type = INSN_CALL;
 		/*