[07/13] llvm: fix output OP_ADD mixed with pointers

Message ID	20170305112047.3411-8-luc.vanoostenryck@gmail.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-sparse-owner@kernel.org> From: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> To: linux-sparse@vger.kernel.org Cc: Dibyendu Majumdar <mobile@majumdar.org.uk>, Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Subject: [PATCH 07/13] llvm: fix output OP_ADD mixed with pointers Date: Sun, 5 Mar 2017 12:20:41 +0100 Message-Id: <20170305112047.3411-8-luc.vanoostenryck@gmail.com> In-Reply-To: <20170305112047.3411-1-luc.vanoostenryck@gmail.com> References: <20170305112047.3411-1-luc.vanoostenryck@gmail.com> Sender: linux-sparse-owner@vger.kernel.org Precedence: bulk

Luc Van Oostenryck March 5, 2017, 11:20 a.m. UTC

In sparse, pointer arithmetic and accessing the field
of a structure or an array is simply done via OP_ADD,
the offset being calculated at evaluation time.
On the other hand, LLVM allows addition only on two
integers and pointer arithmetic/member access is done
via 'getelementptr'.

sparse-llvm didn't took this in account which resulted
in type error in 'add' instructions.

Fix this by catching addition involving pointer and issuing
a getelementptr' instruction for these.

Originally-by: Dibyendu Majumdar <mobile@majumdar.org.uk>
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
---
 sparse-llvm.c | 4 ++++
 1 file changed, 4 insertions(+)

Christopher Li March 6, 2017, 3:16 p.m. UTC | #1

On Sun, Mar 5, 2017 at 7:20 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> In sparse, pointer arithmetic and accessing the field
> of a structure or an array is simply done via OP_ADD,
> the offset being calculated at evaluation time.
> On the other hand, LLVM allows addition only on two
> integers and pointer arithmetic/member access is done
> via 'getelementptr'.
>
> sparse-llvm didn't took this in account which resulted
> in type error in 'add' instructions.
>
> Fix this by catching addition involving pointer and issuing
> a getelementptr' instruction for these.

I have one related question. In the case of anonymous
structure or union, how to you figure out which series of
GEP it needs to be? I think sparse already lost the element
pointer index information. You can construct it back by looking
at the bit offset. But if there is union then the element point can
have multiple path to reach to the same bit offset. I don't
know how to deal with that.

>
> Originally-by: Dibyendu Majumdar <mobile@majumdar.org.uk>
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> ---
>  sparse-llvm.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/sparse-llvm.c b/sparse-llvm.c
> index 27cc1b88c..ee374b217 100644
> --- a/sparse-llvm.c
> +++ b/sparse-llvm.c
> @@ -472,6 +472,10 @@ static void output_op_binary(struct function *fn, struct instruction *insn)
>         case OP_ADD:
>                 if (symbol_is_fp_type(insn->type))
>                         target = LLVMBuildFAdd(fn->builder, lhs, rhs, target_name);
> +               else if (LLVMGetTypeKind(LLVMTypeOf(lhs)) == LLVMPointerTypeKind)
> +                       target = LLVMBuildGEP(fn->builder, lhs, &rhs, 1, "");
> +               else if (LLVMGetTypeKind(LLVMTypeOf(rhs)) == LLVMPointerTypeKind)
> +                       target = LLVMBuildGEP(fn->builder, rhs, &lhs, 1, "");

It seems that you only have one level of indices. Also the OP_ADD use the member
offset of the struct. I am not sure how it map into GEP indices.
Correct me if I am
wrong, it seems to me you use the member offset as element indices. I
think it need
to get a mapping between offset to indices.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dibyendu Majumdar March 6, 2017, 3:32 p.m. UTC | #2

Hi Chris,

On 6 March 2017 at 15:16, Christopher Li <sparse@chrisli.org> wrote:
> On Sun, Mar 5, 2017 at 7:20 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>> In sparse, pointer arithmetic and accessing the field
>> of a structure or an array is simply done via OP_ADD,
>> the offset being calculated at evaluation time.
>> On the other hand, LLVM allows addition only on two
>> integers and pointer arithmetic/member access is done
>> via 'getelementptr'.
>>
>> sparse-llvm didn't took this in account which resulted
>> in type error in 'add' instructions.
>>
>> Fix this by catching addition involving pointer and issuing
>> a getelementptr' instruction for these.
>
> I have one related question. In the case of anonymous
> structure or union, how to you figure out which series of
> GEP it needs to be? I think sparse already lost the element
> pointer index information. You can construct it back by looking
> at the bit offset. But if there is union then the element point can
> have multiple path to reach to the same bit offset. I don't
> know how to deal with that.

Sparse-llvm appears to bypass the normal struct GEP in LLVM. It
basically casts everything to char *, uses GEP to obtain a pointer to
member, and then casts it back to member type. So this should work for
structs and unions.

>
>>
>> Originally-by: Dibyendu Majumdar <mobile@majumdar.org.uk>
>> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
>> ---
>>  sparse-llvm.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/sparse-llvm.c b/sparse-llvm.c
>> index 27cc1b88c..ee374b217 100644
>> --- a/sparse-llvm.c
>> +++ b/sparse-llvm.c
>> @@ -472,6 +472,10 @@ static void output_op_binary(struct function *fn, struct instruction *insn)
>>         case OP_ADD:
>>                 if (symbol_is_fp_type(insn->type))
>>                         target = LLVMBuildFAdd(fn->builder, lhs, rhs, target_name);
>> +               else if (LLVMGetTypeKind(LLVMTypeOf(lhs)) == LLVMPointerTypeKind)
>> +                       target = LLVMBuildGEP(fn->builder, lhs, &rhs, 1, "");
>> +               else if (LLVMGetTypeKind(LLVMTypeOf(rhs)) == LLVMPointerTypeKind)
>> +                       target = LLVMBuildGEP(fn->builder, rhs, &lhs, 1, "");
>
> It seems that you only have one level of indices. Also the OP_ADD use the member
> offset of the struct. I am not sure how it map into GEP indices.
> Correct me if I am
> wrong, it seems to me you use the member offset as element indices. I
> think it need
> to get a mapping between offset to indices.
>

This case is more about handing pointer arithmetic correctly. Here GEP
is being used as array element access and not for struct member
access.

Hope this is helpful.

Regards
Dibyendu
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christopher Li March 6, 2017, 4:22 p.m. UTC | #3

On Mon, Mar 6, 2017 at 11:32 PM, Dibyendu Majumdar
<mobile@majumdar.org.uk> wrote:
>
> Sparse-llvm appears to bypass the normal struct GEP in LLVM. It
> basically casts everything to char *, uses GEP to obtain a pointer to
> member, and then casts it back to member type. So this should work for
> structs and unions.

That is exactly my question. I haven't understand how this by pass
can work in all the situation.

>
> This case is more about handing pointer arithmetic correctly. Here GEP
> is being used as array element access and not for struct member
> access.

Even using array element access, I still have some question regarding
the the mapping process of the array index.
Correct me if I am wrong, the GEP indices are express as array index.
So after convert to (char*) + offset, the offset will express as
n*(sizeof(element type)).
"n" is the array index.

However, the OP_ADD instruction, the offset is byte offset value. The offset
part might not be align to n*(sizeof(array_elelment)). Assume we can use
some pack struct to get some member located not at the natural alignment offset.

Even if the offset is at the type size alignment, shouldn't the indices express
as ( offset / sizeof(element type))? What do I miss?

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luc Van Oostenryck March 6, 2017, 4:43 p.m. UTC | #4

On Tue, Mar 07, 2017 at 12:22:07AM +0800, Christopher Li wrote:
> On Mon, Mar 6, 2017 at 11:32 PM, Dibyendu Majumdar
> <mobile@majumdar.org.uk> wrote:
> >
> > Sparse-llvm appears to bypass the normal struct GEP in LLVM. It
> > basically casts everything to char *, uses GEP to obtain a pointer to
> > member, and then casts it back to member type. So this should work for
> > structs and unions.
> 
> That is exactly my question. I haven't understand how this by pass
> can work in all the situation.
> 
> >
> > This case is more about handing pointer arithmetic correctly. Here GEP
> > is being used as array element access and not for struct member
> > access.
> 
> Even using array element access, I still have some question regarding
> the the mapping process of the array index.
> Correct me if I am wrong, the GEP indices are express as array index.
> So after convert to (char*) + offset, the offset will express as
> n*(sizeof(element type)).
> "n" is the array index.
> 
> However, the OP_ADD instruction, the offset is byte offset value. The offset
> part might not be align to n*(sizeof(array_elelment)). Assume we can use
> some pack struct to get some member located not at the natural alignment offset.
> 
> Even if the offset is at the type size alignment, shouldn't the indices express
> as ( offset / sizeof(element type))? What do I miss?

With an example:
== C code ==
	void *foo(int *p) { return p + 5; }

== linearized code ==
	foo:
	.L0:
	        <entry-point>
	        add.64      %r2 <- %arg1, $20
	        cast.64     %r3 <- (64) %r2
	        ret.64      %r3

== LLVM code from sparse-llvm ==
	; ModuleID = '<stdin>'
	source_filename = "sparse"
	
	define i8* @foo(i32* %ARG1) {
	L0:
	  %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
	  %R3 = bitcast i32* %0 to i8*
	  ret i8* %R3
	}


-- Luc
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dibyendu Majumdar March 6, 2017, 5:06 p.m. UTC | #5

Hi Luc,

On 6 March 2017 at 16:43, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> On Tue, Mar 07, 2017 at 12:22:07AM +0800, Christopher Li wrote:
>> On Mon, Mar 6, 2017 at 11:32 PM, Dibyendu Majumdar
>> <mobile@majumdar.org.uk> wrote:
>> >
>> > Sparse-llvm appears to bypass the normal struct GEP in LLVM. It
>> > basically casts everything to char *, uses GEP to obtain a pointer to
>> > member, and then casts it back to member type. So this should work for
>> > structs and unions.
>>
>> That is exactly my question. I haven't understand how this by pass
>> can work in all the situation.
>>
>> >
>> > This case is more about handing pointer arithmetic correctly. Here GEP
>> > is being used as array element access and not for struct member
>> > access.
>>
>> Even using array element access, I still have some question regarding
>> the the mapping process of the array index.
>> Correct me if I am wrong, the GEP indices are express as array index.
>> So after convert to (char*) + offset, the offset will express as
>> n*(sizeof(element type)).
>> "n" is the array index.
>>
>> However, the OP_ADD instruction, the offset is byte offset value. The offset
>> part might not be align to n*(sizeof(array_elelment)). Assume we can use
>> some pack struct to get some member located not at the natural alignment offset.
>>
>> Even if the offset is at the type size alignment, shouldn't the indices express
>> as ( offset / sizeof(element type))? What do I miss?
>
> With an example:
> == C code ==
>         void *foo(int *p) { return p + 5; }
>
> == linearized code ==
>         foo:
>         .L0:
>                 <entry-point>
>                 add.64      %r2 <- %arg1, $20
>                 cast.64     %r3 <- (64) %r2
>                 ret.64      %r3
>
> == LLVM code from sparse-llvm ==
>         ; ModuleID = '<stdin>'
>         source_filename = "sparse"
>
>         define i8* @foo(i32* %ARG1) {
>         L0:
>           %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
>           %R3 = bitcast i32* %0 to i8*
>           ret i8* %R3
>         }
>

This may be the issue Chris is highlighting -  above the offset should
be 5 not 20 in the LLVM code as the array type is i32* not i8*. Does
sparse always output array offsets in char*?

Regards
Dibyendu
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christopher Li March 6, 2017, 5:07 p.m. UTC | #6

On Tue, Mar 7, 2017 at 12:43 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> With an example:
> == C code ==
>         void *foo(int *p) { return p + 5; }
>
> == linearized code ==
>         foo:
>         .L0:
>                 <entry-point>
>                 add.64      %r2 <- %arg1, $20
>                 cast.64     %r3 <- (64) %r2
>                 ret.64      %r3
>
> == LLVM code from sparse-llvm ==
>         ; ModuleID = '<stdin>'
>         source_filename = "sparse"
>
>         define i8* @foo(i32* %ARG1) {
>         L0:
>           %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
>           %R3 = bitcast i32* %0 to i8*
>           ret i8* %R3
>         }
>


OK, good. Let's use this example.
I am using clang to get the llvm bytecode. t.c is your example.

clang -S -emit-llvm /tmp/t.c
Here is output, compare the line I am pointing at:
The indices should be 5 according clang's output.
However sparse-llvm generate as 20 per your output
if I am reading it correctly.

Chris

============== t.ll =====================
 ModuleID = '/tmp/t.c'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nounwind uwtable
define i8* @foo(i32* %p) #0 {
  %1 = alloca i32*, align 8
  store i32* %p, i32** %1, align 8
  %2 = load i32*, i32** %1, align 8
  %3 = getelementptr inbounds i32, i32* %2, i64 5 <======== look here
  %4 = bitcast i32* %3 to i8*
  ret i8* %4
}

attributes #0 = { nounwind uwtable "disable-tail-calls"="false"
"less-precise-fpmad"="false" "no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false"
"no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
"target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2"
"unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 3.8.1 (tags/RELEASE_381/final)"}
================================================
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds March 6, 2017, 6:17 p.m. UTC | #7

On Mon, Mar 6, 2017 at 8:43 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> With an example:
> == C code ==
>         void *foo(int *p) { return p + 5; }
>
> == linearized code ==
>         foo:
>         .L0:
>                 <entry-point>
>                 add.64      %r2 <- %arg1, $20
>                 cast.64     %r3 <- (64) %r2
>                 ret.64      %r3

This is correct.

> == LLVM code from sparse-llvm ==
>
>         define i8* @foo(i32* %ARG1) {
>         L0:
>           %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
>           %R3 = bitcast i32* %0 to i8*

This is garbage, I'm afraid.

When sparse does the "add 20 to pointer", it adds the *byte offset* 20
to the pointer. The LLVM module should not use "getelementptr" for
this, because it's not element #20, it's the element at offset 20.

I think you're supposed to either use "uglygep" with the base pointer
cast to a simple address-unit pointer (ie unsigned char).

Or not use GEP at all.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luc Van Oostenryck March 6, 2017, 7:50 p.m. UTC | #8

On Mon, Mar 06, 2017 at 05:06:01PM +0000, Dibyendu Majumdar wrote:
> 
> This may be the issue Chris is highlighting -  above the offset should
> be 5 not 20 in the LLVM code as the array type is i32* not i8*. Does
> sparse always output array offsets in char*?

Yes, of course.

I think that the best way to see things is not to consider that 
"sparse output offset in char*". As you have already seen, what
sparse produce after linearization is more low-level that LLVM's IR,
it's closer to what a CPU would see. So I think you should see
sparse's offset simply as an *address offset*, the values involved
being the values that would in the register of some CPU.

-- Luc 
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luc Van Oostenryck March 6, 2017, 7:52 p.m. UTC | #9

On Tue, Mar 07, 2017 at 01:07:27AM +0800, Christopher Li wrote:
> On Tue, Mar 7, 2017 at 12:43 AM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >
> > With an example:
> > == C code ==
> >         void *foo(int *p) { return p + 5; }
> >
> > == linearized code ==
> >         foo:
> >         .L0:
> >                 <entry-point>
> >                 add.64      %r2 <- %arg1, $20
> >                 cast.64     %r3 <- (64) %r2
> >                 ret.64      %r3
> >
> > == LLVM code from sparse-llvm ==
> >         ; ModuleID = '<stdin>'
> >         source_filename = "sparse"
> >
> >         define i8* @foo(i32* %ARG1) {
> >         L0:
> >           %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
> >           %R3 = bitcast i32* %0 to i8*
> >           ret i8* %R3
> >         }
> >
> 
> 
> OK, good. Let's use this example.
> I am using clang to get the llvm bytecode. t.c is your example.
> 
> clang -S -emit-llvm /tmp/t.c
> Here is output, compare the line I am pointing at:
> The indices should be 5 according clang's output.
> However sparse-llvm generate as 20 per your output
> if I am reading it correctly.

Absolutely.

-- Luc 
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luc Van Oostenryck March 6, 2017, 8:09 p.m. UTC | #10

On Mon, Mar 06, 2017 at 10:17:35AM -0800, Linus Torvalds wrote:
> On Mon, Mar 6, 2017 at 8:43 AM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >
> > With an example:
> > == C code ==
> >         void *foo(int *p) { return p + 5; }
> >
> > == linearized code ==
> >         foo:
> >         .L0:
> >                 <entry-point>
> >                 add.64      %r2 <- %arg1, $20
> >                 cast.64     %r3 <- (64) %r2
> >                 ret.64      %r3
> 
> This is correct.

Interestingly, the following code is linearized a bit differently:
== C code ==
	void *foo(int *p) { return p += 5; }
== linearized code ==
	foo:
	.L0:
	        <entry-point>
	        cast.64     %r98 <- (64) %arg1
	        add.64      %r99 <- %r98, $20
	        ptrcast.64  %r100 <- (64) %r99
	        cast.64     %r101 <- (64) %r100
	        ret.64      %r101

Where the type of %r98 & %r99 is 'unsigned long' (or more probably size_t).
This is then correctly LLVMized as:
	define i8* @foo(i32* %ARG1) {
	L18:
	  %R98 = ptrtoint i32* %ARG1 to i64
	  %R99 = add i64 %R98, 20
	  %R100 = inttoptr i64 %R99 to i32*
	  %R101 = bitcast i32* %R100 to i8*
	  ret i8* %R101
	}
 
> > == LLVM code from sparse-llvm ==
> >
> >         define i8* @foo(i32* %ARG1) {
> >         L0:
> >           %0 = getelementptr i32, i32* %ARG1, inttoptr (i64 20 to i32*)
> >           %R3 = bitcast i32* %0 to i8*
> 
> This is garbage, I'm afraid.
> 
> When sparse does the "add 20 to pointer", it adds the *byte offset* 20
> to the pointer. The LLVM module should not use "getelementptr" for
> this, because it's not element #20, it's the element at offset 20.

It's clear that there is a semantic gap between sparse's & LLVM's IR.

> I think you're supposed to either use "uglygep" with the base pointer
> cast to a simple address-unit pointer (ie unsigned char).
> 
> Or not use GEP at all.
> 
>                 Linus

I bet we'll have others problems with GEP.

Luc
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[07/13] llvm: fix output OP_ADD mixed with pointers

Commit Message

Comments

Patch