Linux 2.6.39-rc3

Message ID	4DA60C30.4060606@kernel.org (mailing list archive)
State	New, archived
Headers	show Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p3DKpBxf019184 for <patchwork-dri-devel@patchwork.kernel.org>; Wed, 13 Apr 2011 20:51:31 GMT Message-ID: <4DA60C30.4060606@kernel.org> Date: Wed, 13 Apr 2011 13:48:48 -0700 From: Yinghai Lu <yinghai@kernel.org> User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) Gecko/20110221 SUSE/3.1.8 Thunderbird/3.1.8 MIME-Version: 1.0 To: Joerg Roedel <joro@8bytes.org> Subject: Re: Linux 2.6.39-rc3 References: <BANLkTimNvX=U6qy2FgvDpWE6HY7cehWvyQ@mail.gmail.com> <20110412090207.GE19819@8bytes.org> <BANLkTimSgAmT5a1s0ciBr2jsE-Nu7n4buQ@mail.gmail.com> <20110412184433.GF19819@8bytes.org> <20110413064609.GA18777@elte.hu> <20110413172147.GI19819@8bytes.org> <4DA5F62F.3030504@kernel.org> <20110413193459.GL19819@8bytes.org> In-Reply-To: <20110413193459.GL19819@8bytes.org> Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, dri-devel@lists.freedesktop.org, Tejun Heo <tj@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>, Linus Torvalds <torvalds@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org Errors-To: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org

Yinghai Lu April 13, 2011, 8:48 p.m. UTC

On 04/13/2011 12:34 PM, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 12:14:55PM -0700, Yinghai Lu wrote:
>> thanks for the bisecting...
>>
>> so those two patches uncover some problems.
>>
>> [    0.000000] Checking aperture...
>> [    0.000000] No AGP bridge found
>> [    0.000000] Node 0: aperture @ a0000000 size 32 MB
>> [    0.000000] Aperture pointing to e820 RAM. Ignoring.
>> [    0.000000] Your BIOS doesn't leave a aperture memory hole
>> [    0.000000] Please enable the IOMMU option in the BIOS setup
>> [    0.000000] This costs you 64 MB of RAM
>> [    0.000000]     memblock_x86_reserve_range: [0xa0000000-0xa3ffffff]       aperture64
>> [    0.000000] Mapping aperture over 65536 KB of RAM @ a0000000
>>
>> so kernel try to reallocate apperture. because BIOS allocated is pointed to RAM or size is too small.
> 
> It is actually beyond 4GB on that machine, this value read here is from
> the previous kernel-boot. The BIOS does not reset these values on a
> reboot.
> 
>> but your radeon does use [0xa0000000, 0xbfffffff)
> 
> Yes, I suspected that too (and spent a few hours reading radeon code),
> but then I talked the Alex Deucher and he explained that these addresses
> which the driver prints for GTT and VRAM are in the GPU address space
> and do not refer to system ram. So this shouldn't be the problem.


can you try following change ? it will push gart to 0x80000000

Linus Torvalds April 13, 2011, 8:54 p.m. UTC | #1

On Wed, Apr 13, 2011 at 1:48 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> can you try following change ? it will push gart to 0x80000000
>
> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> index 86d1ad4..3b6a9d5 100644
> --- a/arch/x86/kernel/aperture_64.c
> +++ b/arch/x86/kernel/aperture_64.c
> @@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void)
>         * so don't use 512M below as gart iommu, leave the space for kernel
>         * code for safe
>         */
> -       addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
> +       addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);

What are all the magic numbers, and why would 0x80000000 be special?

Why don't we write code that just works?

Or absent a "just works" set of patches, why don't we revert to code
that has years of testing?

This kind of "I broke things, so now I will jiggle things randomly
until they unbreak" is not acceptable.

Either explain why that fixes a real BUG (and why the magic constants
need to be what they are), or just revert the patch that caused the
problem, and go back to the allocation patters that have years of
experience.

Guys, we've had this discussion before, in PCI allocation. We don't do
this. We tried switching the PCI region allocations to top-down, and
IT WAS A FAILURE. We reverted it to what we had years of testing with.

Don't just make random changes. There really are only two acceptable
models of development: "think and analyze" or "years and years of
testing on thousands of machines". Those two really do work.

                   Linus

Yinghai Lu April 13, 2011, 9:23 p.m. UTC | #2

On 04/13/2011 01:54 PM, Linus Torvalds wrote:
> On Wed, Apr 13, 2011 at 1:48 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>
>> can you try following change ? it will push gart to 0x80000000
>>
>> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
>> index 86d1ad4..3b6a9d5 100644
>> --- a/arch/x86/kernel/aperture_64.c
>> +++ b/arch/x86/kernel/aperture_64.c
>> @@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void)
>>         * so don't use 512M below as gart iommu, leave the space for kernel
>>         * code for safe
>>         */
>> -       addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
>> +       addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
> 
> What are all the magic numbers, and why would 0x80000000 be special?

that is the old value when kernel was doing bottom-up bootmem allocation.

> 
> Why don't we write code that just works?
> 
> Or absent a "just works" set of patches, why don't we revert to code
> that has years of testing?
> 
> This kind of "I broke things, so now I will jiggle things randomly
> until they unbreak" is not acceptable.
> 
> Either explain why that fixes a real BUG (and why the magic constants
> need to be what they are), or just revert the patch that caused the
> problem, and go back to the allocation patters that have years of
> experience.
> 
> Guys, we've had this discussion before, in PCI allocation. We don't do
> this. We tried switching the PCI region allocations to top-down, and
> IT WAS A FAILURE. We reverted it to what we had years of testing with.
> 
> Don't just make random changes. There really are only two acceptable
> models of development: "think and analyze" or "years and years of
> testing on thousands of machines". Those two really do work.

We did do the analyzing, and only difference seems to be:
good one is using 0x80000000
and bad one is using 0xa0000000.

We try to figure out if it needs low address and it happen to work 
because kernel was doing bottom up allocation.

Thanks

Yinghai

Joerg Roedel April 13, 2011, 9:50 p.m. UTC | #3

On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);

Btw, while looking at this code I wondered why the 512M goal is enforced
by the alignment. Start could be set to 512M instead and the alignment
can be aper_size as it should. Any reason for such a big alignment?

	Joerg

P.S.: The box is still in the office, I will try this debug-patch
      tomorrow.

Yinghai Lu April 13, 2011, 9:59 p.m. UTC | #4

On 04/13/2011 02:50 PM, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
>> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
>> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
> 
> Btw, while looking at this code I wondered why the 512M goal is enforced
> by the alignment. Start could be set to 512M instead and the alignment
> can be aper_size as it should. Any reason for such a big alignment?
> 

when using bootmem, try to use big alignment (512M ), so we could avoid take ram range below 512M.

commit 7677b2ef6c0c4fddc84f6473f3863f40eb71821b
Author: Yinghai Lu <yhlu.kernel.send@gmail.com>
Date:   Mon Apr 14 20:40:37 2008 -0700

    x86_64: allocate gart aperture from 512M
    
    because we try to reserve dma32 early, so we have chance to get aperture
    from 64M.
    
    with some sequence aperture allocated from RAM, could become E820_RESERVED.
    
    and then if doing a kexec with a big kernel that uncompressed size is above
    64M we could have a range conflict with still using gart.
    
    So allocate gart aperture from 512M instead.
    
    Also change the fallback_aper_order to 5, because we don't have chance to get
    2G or 4G aperture.

We can change it back to 32M or make it equal to size.

> 
> P.S.: The box is still in the office, I will try this debug-patch
>       tomorrow.

Alexandre's system is working at 0xa4000000 with 2.6.38.2

So it is not low address problem. could be other reason like
some other code could need lower address.

Thanks

Yinghai

H. Peter Anvin April 13, 2011, 10:01 p.m. UTC | #5

On 04/13/2011 02:50 PM, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
>> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
>> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
> 
> Btw, while looking at this code I wondered why the 512M goal is enforced
> by the alignment. Start could be set to 512M instead and the alignment
> can be aper_size as it should. Any reason for such a big alignment?
> 
> 	Joerg
> 
> P.S.: The box is still in the office, I will try this debug-patch
>       tomorrow.

The only reason that I can think of is that the aperture itself can be
huge, and perhaps 512 MiB is the biggest such known.  512ULL<<21 is of
course a particularly moronic way to write 1 GiB, but it was a debug patch.

The value 512 MiB apparently comes from
7677b2ef6c0c4fddc84f6473f3863f40eb71821b, which is apparently totally ad
hoc; effectively it tries to prevent a collision with kexec by
hardcoding the kdump allocation as it sat at that point in time in the
GART assignment rules.

Yeah.  Brilliant.

	-hpa

H. Peter Anvin April 13, 2011, 10:11 p.m. UTC | #6

On 04/13/2011 02:59 PM, Yinghai Lu wrote:
> On 04/13/2011 02:50 PM, Joerg Roedel wrote:
>> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
>>> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
>>> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
>>
>> Btw, while looking at this code I wondered why the 512M goal is enforced
>> by the alignment. Start could be set to 512M instead and the alignment
>> can be aper_size as it should. Any reason for such a big alignment?
>>
> 
> when using bootmem, try to use big alignment (512M ), so we could avoid take ram range below 512M.
> 

Yes, his question was why on Earth are you using 0 as start if that is
the purpose.

On top of that, where the hell does the magic 512 MiB come from?  It
looks like it is either completly ad hoc, or it has something to do with
where the kexec kernel was allocated once upon a time.

	-hpa

Joerg Roedel April 13, 2011, 10:22 p.m. UTC | #7

On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote:
> On 04/13/2011 02:50 PM, Joerg Roedel wrote:
> > On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
> >> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
> >> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
> > 
> > Btw, while looking at this code I wondered why the 512M goal is enforced
> > by the alignment. Start could be set to 512M instead and the alignment
> > can be aper_size as it should. Any reason for such a big alignment?
> > 
> > 	Joerg
> > 
> > P.S.: The box is still in the office, I will try this debug-patch
> >       tomorrow.
> 
> The only reason that I can think of is that the aperture itself can be
> huge, and perhaps 512 MiB is the biggest such known. 

Well, that would work as well by just using aper_size as alignment, the
aperture needs to be aligned on its size anyway. This code only runs
when Linux allocates the aperture itself and if I am mistaken is uses
always 64MB when doing this.

	Joerg

H. Peter Anvin April 13, 2011, 10:31 p.m. UTC | #8

On 04/13/2011 03:22 PM, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote:
>> On 04/13/2011 02:50 PM, Joerg Roedel wrote:
>>> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
>>>> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
>>>> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
>>>
>>> Btw, while looking at this code I wondered why the 512M goal is enforced
>>> by the alignment. Start could be set to 512M instead and the alignment
>>> can be aper_size as it should. Any reason for such a big alignment?
>>>
>>> 	Joerg
>>>
>>> P.S.: The box is still in the office, I will try this debug-patch
>>>       tomorrow.
>>
>> The only reason that I can think of is that the aperture itself can be
>> huge, and perhaps 512 MiB is the biggest such known. 
> 
> Well, that would work as well by just using aper_size as alignment, the
> aperture needs to be aligned on its size anyway. This code only runs
> when Linux allocates the aperture itself and if I am mistaken is uses
> always 64MB when doing this.

Yes, I would agree with that.  The sane thing would be to set the base
to whatever address needs to be guarded against (WHICH SHOULD BE
MOTIVATED), and use aper_size as alignment, *unless* we are only using
the initial portion of a much larger hardware structure that needs
natural alignment (which isn't clear to me, I do know we sometimes use
only a fraction of the GART, but that doesn't mean we need to
naturally-align the entire thing, nor that 512 MiB is sufficient to do so.)

	-hpa

Linus Torvalds April 13, 2011, 11:39 p.m. UTC | #9

On Wed, Apr 13, 2011 at 2:23 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>
>> What are all the magic numbers, and why would 0x80000000 be special?
>
> that is the old value when kernel was doing bottom-up bootmem allocation.

I understand, BUT THAT IS STILL A TOTALLY MAGIC NUMBER!

It makes it come out the same ON THAT ONE MACHINE.  So no, it's not
"the old value". It's a random value that gets the old value in one
specific case.

>> Why don't we write code that just works?
>>
>> Or absent a "just works" set of patches, why don't we revert to code
>> that has years of testing?
>>
>> This kind of "I broke things, so now I will jiggle things randomly
>> until they unbreak" is not acceptable.
>>
>> Either explain why that fixes a real BUG (and why the magic constants
>> need to be what they are), or just revert the patch that caused the
>> problem, and go back to the allocation patters that have years of
>> experience.
>>
>> Guys, we've had this discussion before, in PCI allocation. We don't do
>> this. We tried switching the PCI region allocations to top-down, and
>> IT WAS A FAILURE. We reverted it to what we had years of testing with.
>>
>> Don't just make random changes. There really are only two acceptable
>> models of development: "think and analyze" or "years and years of
>> testing on thousands of machines". Those two really do work.
>
> We did do the analyzing, and only difference seems to be:

No.

Yinghai, we have had this discussion before, and dammit, you need to
understand the difference between "understanding the problem" and "put
in random values until it works on one machine".

There was absolutely _zero_ analysis done. You do not actually
understand WHY the numbers matter. You just look at two random
numbers, and one works, the other does not. That's not "analyzing".
That's just "random number games".

If you cannot see and understand the difference between an actual
analytical solution where you _understand_ what the code is doing and
why, and "random numbers that happen to work on one machine", I don't
know what to tell you.

> good one is using 0x80000000
> and bad one is using 0xa0000000.
>
> We try to figure out if it needs low address and it happen to work
> because kernel was doing bottom up allocation.

No.

Let me repeat my point one more time.

You have TWO choices. Not more, not less:

 - choice #1: go back to the old allocation model. It's tested. It
doesn't regress. Admittedly we may not know exactly _why_ it works,
and it might not work on all machines, but it doesn't cause
regressions (ie the machines it doesn't work on it _never_ worked on).

   And this doesn't mean "old value for that _one_ machine". It means
"old value for _every_ machine". So it means we revert the whole
bottom-down thing entirely. Not just "change one random number so that
the totally different allocation pattern happens to give the same
result on one particular machine".

   Quite frankly, I don't see the point of doing top-to-bottom anyway,
so I think we should do this regardless. Just revert the whole
"allocate from top". It didn't work for PCI, it's not working for this
case either. Stop doing it.

 - Choice #2: understand exactly _what_ goes wrong, and fix it
analytically (ie by _understanding_ the problem, and being able to
solve it exactly, and in a way you can argue about without having to
resort to "magic happens").

Now, the whole analytic approach (aka "computer sciency" approach),
where you can actually think about the problem without having any
pesky "reality" impact the solution is obviously the one we tend to
prefer. Sadly, it's seldom the one we can use in reality when it comes
to things like resource allocation, since we end up starting off with
often buggy approximations of what the actual hardware is all about
(ie broken firmware tables).

So I'd love to know exactly why one random number works, and why
another one doesn't. But as long as we do _not_ know the "Why" of it,
we will have to revert.

It really is that simple. It's _always_ that simple.

So the numbers shouldn't be "magic", they should have real
explanations. And in the absense of real explanation, the model that
works is "this is what we've always done". Including, very much, the
whole allocation order. Not just one random number on one random
machine.

                        Linus

Yinghai Lu April 14, 2011, 12:10 a.m. UTC | #10

On 04/13/2011 04:39 PM, Linus Torvalds wrote:
> On Wed, Apr 13, 2011 at 2:23 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>>
>>> What are all the magic numbers, and why would 0x80000000 be special?
>>
>> that is the old value when kernel was doing bottom-up bootmem allocation.
> 
> I understand, BUT THAT IS STILL A TOTALLY MAGIC NUMBER!
> 
> It makes it come out the same ON THAT ONE MACHINE.  So no, it's not
> "the old value". It's a random value that gets the old value in one
> specific case.

Alexandre's system is working 2.6.38.2 and kernel allocate from 0xa4000000
Joerg's system working 2.6.39-rc3 while revert the top down bootmem patch 
	1a4a678b12c84db9ae5dce424e0e97f0559bb57c
and kernel allocate to 0x80000000.
Alexandre's system is working while increasing alignment to 1g, and make kernel to
allocate 0x80000000 to gart.

they are not working if kernel allocate from 0xa0000000

the 0xa0000000 looks like same value from radon GTT.


[    4.250159] radeon 0000:01:05.0: VRAM: 320M 0x00000000C0000000 - 0x00000000D3FFFFFF (320M used)
[    4.258830] radeon 0000:01:05.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF
[    4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
[    4.271549] [drm] RAM width 32bits DDR
[    4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
[    4.282066] [TTM] Initializing pool allocator.
[    4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
[    4.293076] [drm] radeon: 320M of VRAM memory ready
[    4.298277] [drm] radeon: 512M of GTT memory ready.
[    4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    4.309854] [drm] Driver supports precise vblank timestamp query.
[    4.315970] [drm] radeon: irq initialized.
[    4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072

Alex said that 0xa0000000 is ok and is from GPU address space
---
The VRAM and GTT addresses in the dmesg are internal GPU addresses not
system addresses.  The GPU has it's own internal address space for
on-chip memory clients (texture samplers, render buffers, display
controllers, etc.).  The GPU sets up two apertures in it's internal
address space and on-chip client requests are forwarded to the
appropriate place by the GPU's memory controller.  Addresses in the
GPU's VRAM aperture go to local vram on discrete cards, or to the
stolen memory at the top of system memory for IGP cards.  Addresses in
the GPU's GTT aperture hit a page table and get forwarded to the
appropriate dma pages.
---

> 
>>> Why don't we write code that just works?
>>>
>>> Or absent a "just works" set of patches, why don't we revert to code
>>> that has years of testing?
>>>
>>> This kind of "I broke things, so now I will jiggle things randomly
>>> until they unbreak" is not acceptable.
>>>
>>> Either explain why that fixes a real BUG (and why the magic constants
>>> need to be what they are), or just revert the patch that caused the
>>> problem, and go back to the allocation patters that have years of
>>> experience.
>>>
>>> Guys, we've had this discussion before, in PCI allocation. We don't do
>>> this. We tried switching the PCI region allocations to top-down, and
>>> IT WAS A FAILURE. We reverted it to what we had years of testing with.
>>>
>>> Don't just make random changes. There really are only two acceptable
>>> models of development: "think and analyze" or "years and years of
>>> testing on thousands of machines". Those two really do work.
>>
>> We did do the analyzing, and only difference seems to be:
> 
> No.
> 
> Yinghai, we have had this discussion before, and dammit, you need to
> understand the difference between "understanding the problem" and "put
> in random values until it works on one machine".
> 
> There was absolutely _zero_ analysis done. You do not actually
> understand WHY the numbers matter. You just look at two random
> numbers, and one works, the other does not. That's not "analyzing".
> That's just "random number games".
> 
> If you cannot see and understand the difference between an actual
> analytical solution where you _understand_ what the code is doing and
> why, and "random numbers that happen to work on one machine", I don't
> know what to tell you.
> 
>> good one is using 0x80000000
>> and bad one is using 0xa0000000.
>>
>> We try to figure out if it needs low address and it happen to work
>> because kernel was doing bottom up allocation.
> 
> No.
> 
> Let me repeat my point one more time.
> 
> You have TWO choices. Not more, not less:
> 
>  - choice #1: go back to the old allocation model. It's tested. It
> doesn't regress. Admittedly we may not know exactly _why_ it works,
> and it might not work on all machines, but it doesn't cause
> regressions (ie the machines it doesn't work on it _never_ worked on).
> 
>    And this doesn't mean "old value for that _one_ machine". It means
> "old value for _every_ machine". So it means we revert the whole
> bottom-down thing entirely. Not just "change one random number so that
> the totally different allocation pattern happens to give the same
> result on one particular machine".
> 
>    Quite frankly, I don't see the point of doing top-to-bottom anyway,
> so I think we should do this regardless. Just revert the whole
> "allocate from top". It didn't work for PCI, it's not working for this
> case either. Stop doing it.

we did some codes to prevent bootmem to use low range.

> 
>  - Choice #2: understand exactly _what_ goes wrong, and fix it
> analytically (ie by _understanding_ the problem, and being able to
> solve it exactly, and in a way you can argue about without having to
> resort to "magic happens").
> 
> Now, the whole analytic approach (aka "computer sciency" approach),
> where you can actually think about the problem without having any
> pesky "reality" impact the solution is obviously the one we tend to
> prefer. Sadly, it's seldom the one we can use in reality when it comes
> to things like resource allocation, since we end up starting off with
> often buggy approximations of what the actual hardware is all about
> (ie broken firmware tables).
> 
> So I'd love to know exactly why one random number works, and why
> another one doesn't. But as long as we do _not_ know the "Why" of it,
> we will have to revert.
> 
> It really is that simple. It's _always_ that simple.
> 
> So the numbers shouldn't be "magic", they should have real
> explanations. And in the absense of real explanation, the model that
> works is "this is what we've always done". Including, very much, the
> whole allocation order. Not just one random number on one random
> machine.

Ok, let's try to figure out why 0xa0000000 can not be used.

if we can not figure out, we can revert

1a4a678b12c84db9ae5dce424e0e97f0559bb57c

thanks

Yinghai

H. Peter Anvin April 14, 2011, 2:03 a.m. UTC | #11

On 04/13/2011 04:39 PM, Linus Torvalds wrote:
> 
>  - Choice #2: understand exactly _what_ goes wrong, and fix it
> analytically (ie by _understanding_ the problem, and being able to
> solve it exactly, and in a way you can argue about without having to
> resort to "magic happens").
> 
> Now, the whole analytic approach (aka "computer sciency" approach),
> where you can actually think about the problem without having any
> pesky "reality" impact the solution is obviously the one we tend to
> prefer. Sadly, it's seldom the one we can use in reality when it comes
> to things like resource allocation, since we end up starting off with
> often buggy approximations of what the actual hardware is all about
> (ie broken firmware tables).
> 
> So I'd love to know exactly why one random number works, and why
> another one doesn't. But as long as we do _not_ know the "Why" of it,
> we will have to revert.
> 

Yes.  However, even if we *do* revert (and the time is running short on
not reverting) I would like to understand this particular one, simply
because I think it may very well be a problem that is manifesting itself
in other ways on other systems.

The other thing that this has uncovered is that we already have a bunch
of complete b*llsh*t magic numbers in this path, some of which are
trivially shown to be wrong or at least completely arbitrary, so there
are more issues here :(

	-hpa

Linus Torvalds April 14, 2011, 2:27 a.m. UTC | #12

On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Yes.  However, even if we *do* revert (and the time is running short on
> not reverting) I would like to understand this particular one, simply
> because I think it may very well be a problem that is manifesting itself
> in other ways on other systems.
>
> The other thing that this has uncovered is that we already have a bunch
> of complete b*llsh*t magic numbers in this

Linus Torvalds April 14, 2011, 2:33 a.m. UTC | #13

On Wednesday, April 13, 2011, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Yes.  However, even if we *do* revert (and the time is running short on
>> not reverting) I would like to understand this particular one, simply
>> because I think it may very well be a problem that is manifesting itself
>> in other ways on other systems.

 sorry, fingerfart. Anyway, I agree 100%.

 we definitely want to also understand the reason for things not
working, even if we do revert..

        Linus
>> of complete b*llsh*t magic numbers in this
>

Tejun Heo April 14, 2011, 4:03 a.m. UTC | #14

Hello,

On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
> On Wednesday, April 13, 2011, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> Yes.  However, even if we *do* revert (and the time is running short on
> >> not reverting) I would like to understand this particular one, simply
> >> because I think it may very well be a problem that is manifesting itself
> >> in other ways on other systems.
> 
>  sorry, fingerfart. Anyway, I agree 100%.
> 
>  we definitely want to also understand the reason for things not
> working, even if we do revert..

There were (and still are) places where memblock callers implemented
ad-hoc top-down allocation by stepping down start limit until
allocation succeeds.  Several of them have been removed since top-down
became the default behavior, so simply reverting the commit is likely
to cause subtle issues.  Maybe the best approach is introducing
@topdown parameter and use it selectively for pure memory allocations.

Thanks.

Alan Cox April 14, 2011, 8:09 a.m. UTC | #15

On Wed, 13 Apr 2011 19:33:40 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wednesday, April 13, 2011, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> Yes.  However, even if we *do* revert (and the time is running short on
> >> not reverting) I would like to understand this particular one, simply
> >> because I think it may very well be a problem that is manifesting itself
> >> in other ways on other systems.
> 
>  sorry, fingerfart. Anyway, I agree 100%.
> 
>  we definitely want to also understand the reason for things not
> working, even if we do revert..

Definitely because if it fails when the "magic" involves the GART base it
starts to sound like something may be hitting the wrong address space or
not flushing properly.

Joerg Roedel April 14, 2011, 8:59 a.m. UTC | #16

On Wed, Apr 13, 2011 at 03:31:09PM -0700, H. Peter Anvin wrote:
> On 04/13/2011 03:22 PM, Joerg Roedel wrote:
> > On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote:
> >> On 04/13/2011 02:50 PM, Joerg Roedel wrote:
> >>> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
> >>>> -	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
> >>>> +	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21);
> >>>
> >>> Btw, while looking at this code I wondered why the 512M goal is enforced
> >>> by the alignment. Start could be set to 512M instead and the alignment
> >>> can be aper_size as it should. Any reason for such a big alignment?
> >>>
> >>> 	Joerg
> >>>
> >>> P.S.: The box is still in the office, I will try this debug-patch
> >>>       tomorrow.
> >>
> >> The only reason that I can think of is that the aperture itself can be
> >> huge, and perhaps 512 MiB is the biggest such known. 
> > 
> > Well, that would work as well by just using aper_size as alignment, the
> > aperture needs to be aligned on its size anyway. This code only runs
> > when Linux allocates the aperture itself and if I am mistaken is uses
> > always 64MB when doing this.
> 
> Yes, I would agree with that.  The sane thing would be to set the base
> to whatever address needs to be guarded against (WHICH SHOULD BE
> MOTIVATED), and use aper_size as alignment, *unless* we are only using
> the initial portion of a much larger hardware structure that needs
> natural alignment (which isn't clear to me, I do know we sometimes use
> only a fraction of the GART, but that doesn't mean we need to
> naturally-align the entire thing, nor that 512 MiB is sufficient to do so.)

Whats allocated here is the address-space for the aperture. The code
actually allocates the memory but all it needs is the physical address
range. This range is later programmed into hardware as the GART aperture
(the area the GART remaps).
The Linux code can split the aperture if necessary for DMA-API usage and
AGP usage. In that case both users get a half of the aperture and manage
them itself.

	Joerg

Joerg Roedel April 14, 2011, 9:36 a.m. UTC | #17

On Thu, Apr 14, 2011 at 01:03:37PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
> > On Wednesday, April 13, 2011, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
> > >>
> > >> Yes.  However, even if we *do* revert (and the time is running short on
> > >> not reverting) I would like to understand this particular one, simply
> > >> because I think it may very well be a problem that is manifesting itself
> > >> in other ways on other systems.
> > 
> >  sorry, fingerfart. Anyway, I agree 100%.
> > 
> >  we definitely want to also understand the reason for things not
> > working, even if we do revert..
> 
> There were (and still are) places where memblock callers implemented
> ad-hoc top-down allocation by stepping down start limit until
> allocation succeeds.  Several of them have been removed since top-down
> became the default behavior, so simply reverting the commit is likely
> to cause subtle issues.  Maybe the best approach is introducing
> @topdown parameter and use it selectively for pure memory allocations.

Wouldn't it be better to provide a seperate memblock allocation
function which operates top-down and use this one in the places that
need it? This way it wouldn't break code that relies on bottom-up.

	Joerg

Linus Torvalds May 6, 2011, 9:17 p.m. UTC | #18

On Wednesday, April 13, 2011, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wednesday, April 13, 2011, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> Yes.  However, even if we *do* revert (and the time is running short on
>> not reverting) I would like to understand this particular one, simply
>> because I think it may very well be a problem that is manifesting itself
>> in other ways on other systems.

 sorry, fingerfart. Anyway, I agree 100%.

 we definitely want to also understand the reason for things not
working, even if we do revert..

        Linus
>> of complete b*llsh*t magic numbers in this
>

Commit Message

Comments

Patch