diff mbox

[1/3] drm/i915: Check if the stolen memory "reserved" area is enabled or not

Message ID 20171102151737.23336-1-ville.syrjala@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ville Syrjälä Nov. 2, 2017, 3:17 p.m. UTC
From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Apparently there are some machines that put semi-sensible looking values
into the stolen "reserved" base and size, except those values are actually
outside the stolen memory. There is a bit in the register which
supposedly could tell us whether the reserved area is even enabled or
not. Let's check for that before we go trusting the base and size.

Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem_stolen.c | 30 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h        |  2 ++
 2 files changed, 32 insertions(+)

Comments

Ville Syrjälä Nov. 2, 2017, 5:08 p.m. UTC | #1
On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
> == Series Details ==
> 
> Series: series starting with [1/3] drm/i915: Check if the stolen memory "reserved" area is enabled or not
> URL   : https://patchwork.freedesktop.org/series/33060/
> State : warning
> 
> == Summary ==
> 
> Test kms_busy:
>         Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
>                 dmesg-warn -> PASS       (shard-hsw)
>         Subgroup extended-modeset-hang-newfb-with-reset-render-B:
>                 pass       -> DMESG-WARN (shard-hsw)

Hmm. The warn was there already AFAICS. I wonder why this is claiming
things were passing?

Also shard-glkb didn't seem to get any results from this run. No idea
why, nor why this summary fails to mention that fact.

Oh and BTW the boot/dmesg links from the shard results don't seem to
work very well. Sometimes it just gets you an empty log and you have
to manually find a file that has some actual content in it.

> 
> shard-hsw        total:2539 pass:1432 dwarn:2   dfail:0   fail:8   skip:1097 time:9313s
> 
> == Logs ==
> 
> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html

Looking at the results we go from
<7>[    2.822784] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
<7>[    2.826115] [drm:i915_gem_init_stolen [i915]] Stolen reserved area
[0x0000000001100000 - 0x0000000001200000] outside stolen memory
[0x00000000c7a00000 - 0x00000000cfa00000]

to
<7>[    2.909693] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
<7>[    2.912226] [drm:i915_gem_init_stolen [i915]] Memory reserved for
graphics device: 131072K, usable: 131072K

on shard-snb6 at least.

After going through the dmesgs for all the other machines we have in ci,
it doesn't look like there were any other changes in the amount of stolen
memory we detect (well, couldn't check shard-glkb due to lack fo results).
Saarinen, Jani Nov. 3, 2017, 7:19 a.m. UTC | #2
Hi, 
> -----Original Message-----

> From: Intel-gfx [mailto:intel-gfx-bounces@lists.freedesktop.org] On Behalf Of

> Ville Syrjälä

> Sent: torstai 2. marraskuuta 2017 19.08

> To: intel-gfx@lists.freedesktop.org

> Cc: Sarvela, Tomi P <tomi.p.sarvela@intel.com>

> Subject: Re: [Intel-gfx] ✗ Fi.CI.IGT: warning for series starting with [1/3]

> drm/i915: Check if the stolen memory "reserved" area is enabled or not

> 

> On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:

> > == Series Details ==

> >

> > Series: series starting with [1/3] drm/i915: Check if the stolen memory

> "reserved" area is enabled or not

> > URL   : https://patchwork.freedesktop.org/series/33060/

> > State : warning

> >

> > == Summary ==

> >

> > Test kms_busy:

> >         Subgroup extended-modeset-hang-oldfb-with-reset-render-A:

> >                 dmesg-warn -> PASS       (shard-hsw)

> >         Subgroup extended-modeset-hang-newfb-with-reset-render-B:

> >                 pass       -> DMESG-WARN (shard-hsw)

> 

> Hmm. The warn was there already AFAICS. I wonder why this is claiming

> things were passing?

Tomi, Arek? 
> 

> Also shard-glkb didn't seem to get any results from this run. No idea why, nor

> why this summary fails to mention that fact.

We are not getting runs from GLK for pre-merge, right Tomi
> 

> Oh and BTW the boot/dmesg links from the shard results don't seem to work

> very well. Sometimes it just gets you an empty log and you have to manually

> find a file that has some actual content in it.

> 

Tomi again.
> >

> > shard-hsw        total:2539 pass:1432 dwarn:2   dfail:0   fail:8   skip:1097

> time:9313s

> >



Jani Saarinen
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
Sarvela, Tomi P Nov. 3, 2017, 8:18 a.m. UTC | #3
On 02/11/17 19:08, Ville Syrjälä wrote:
> On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
>> == Series Details ==
>>
>> Series: series starting with [1/3] drm/i915: Check if the stolen memory "reserved" area is enabled or not
>> URL   : https://patchwork.freedesktop.org/series/33060/
>> State : warning
>>
>> == Summary ==
>>
>> Test kms_busy:
>>          Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
>>                  dmesg-warn -> PASS       (shard-hsw)
>>          Subgroup extended-modeset-hang-newfb-with-reset-render-B:
>>                  pass       -> DMESG-WARN (shard-hsw)
> 
> Hmm. The warn was there already AFAICS. I wonder why this is claiming
> things were passing?

The sharded result for the run is 'warning'. What do you mean?

> Also shard-glkb didn't seem to get any results from this run. No idea
> why, nor why this summary fails to mention that fact.

One GLK died, and 4 GLKs take nearly 90 minutes to run shards. It's not 
reasonable to wait for them on every run and queue everything else, so 
Patchwork/Trybot runs are skipped on GLK. Anything can be run and added 
to results manually, if needed.

As a comparison, SNB/HSW/APL all run shards in ~35 minutes. KBL is the 
only one that gets rebooted between shards (due to leaking context), and 
takes around 50 minutes.

> Oh and BTW the boot/dmesg links from the shard results don't seem to
> work very well. Sometimes it just gets you an empty log and you have
> to manually find a file that has some actual content in it.

If there is no bootlog for a run, the host has not booted between runs. 
It's maybe not intuitively clear. I've tried to communicate that the 
shard runs are shifting to conditional reboots (booted only if hung).

Tomi

>> shard-hsw        total:2539 pass:1432 dwarn:2   dfail:0   fail:8   skip:1097 time:9313s
>>
>> == Logs ==
>>
>> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html
> 
> Looking at the results we go from
> <7>[    2.822784] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
> <7>[    2.826115] [drm:i915_gem_init_stolen [i915]] Stolen reserved area
> [0x0000000001100000 - 0x0000000001200000] outside stolen memory
> [0x00000000c7a00000 - 0x00000000cfa00000]
> 
> to
> <7>[    2.909693] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
> <7>[    2.912226] [drm:i915_gem_init_stolen [i915]] Memory reserved for
> graphics device: 131072K, usable: 131072K
> 
> on shard-snb6 at least.
> 
> After going through the dmesgs for all the other machines we have in ci,
> it doesn't look like there were any other changes in the amount of stolen
> memory we detect (well, couldn't check shard-glkb due to lack fo results).

There is GLK in Farm1, so you can check the bootlogs from fast-feedback 
run if that part is what you're interested in. All the other gens too.

Tomi
Ville Syrjälä Nov. 3, 2017, 10:41 a.m. UTC | #4
On Fri, Nov 03, 2017 at 10:18:32AM +0200, Tomi Sarvela wrote:
> On 02/11/17 19:08, Ville Syrjälä wrote:
> > On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
> >> == Series Details ==
> >>
> >> Series: series starting with [1/3] drm/i915: Check if the stolen memory "reserved" area is enabled or not
> >> URL   : https://patchwork.freedesktop.org/series/33060/
> >> State : warning
> >>
> >> == Summary ==
> >>
> >> Test kms_busy:
> >>          Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
> >>                  dmesg-warn -> PASS       (shard-hsw)
> >>          Subgroup extended-modeset-hang-newfb-with-reset-render-B:
> >>                  pass       -> DMESG-WARN (shard-hsw)
> > 
> > Hmm. The warn was there already AFAICS. I wonder why this is claiming
> > things were passing?
> 
> The sharded result for the run is 'warning'. What do you mean?

This should read 'DMESG-WARN -> DMESG-WARN' rather than 'pass ->
DMESG-WARN'. If I click the link to open up the results in the browser
this test isn't shown on the shards.html, but I can see it as
orange->orange in shards-all.html.

> 
> > Also shard-glkb didn't seem to get any results from this run. No idea
> > why, nor why this summary fails to mention that fact.
> 
> One GLK died, and 4 GLKs take nearly 90 minutes to run shards. It's not 
> reasonable to wait for them on every run and queue everything else, so 
> Patchwork/Trybot runs are skipped on GLK. Anything can be run and added 
> to results manually, if needed.

If they aren't being run, then maybe they shouldn't be part of the
shards.html for pw/trybot runs? Would save me having to wonder why I'm
getting empty results. Or we should have some kind of indication why we
didn't get any results for a particular machine (ie. whether it was
expected or not).

> 
> As a comparison, SNB/HSW/APL all run shards in ~35 minutes. KBL is the 
> only one that gets rebooted between shards (due to leaking context), and 
> takes around 50 minutes.
> 
> > Oh and BTW the boot/dmesg links from the shard results don't seem to
> > work very well. Sometimes it just gets you an empty log and you have
> > to manually find a file that has some actual content in it.
> 
> If there is no bootlog for a run, the host has not booted between runs. 
> It's maybe not intuitively clear. I've tried to communicate that the 
> shard runs are shifting to conditional reboots (booted only if hung).

I don't know how that relates to the several dmesg/boot logs I see
in the directory. And why the web links often seems to point to the
empty files instead of ones with actual content in them.

Also the files are numbered in some way. Whether there's any significance
to those numbers I can't really tell.

> 
> Tomi
> 
> >> shard-hsw        total:2539 pass:1432 dwarn:2   dfail:0   fail:8   skip:1097 time:9313s
> >>
> >> == Logs ==
> >>
> >> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html
> > 
> > Looking at the results we go from
> > <7>[    2.822784] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
> > <7>[    2.826115] [drm:i915_gem_init_stolen [i915]] Stolen reserved area
> > [0x0000000001100000 - 0x0000000001200000] outside stolen memory
> > [0x00000000c7a00000 - 0x00000000cfa00000]
> > 
> > to
> > <7>[    2.909693] [drm:i915_ggtt_probe_hw [i915]] GTT stolen size = 128M
> > <7>[    2.912226] [drm:i915_gem_init_stolen [i915]] Memory reserved for
> > graphics device: 131072K, usable: 131072K
> > 
> > on shard-snb6 at least.
> > 
> > After going through the dmesgs for all the other machines we have in ci,
> > it doesn't look like there were any other changes in the amount of stolen
> > memory we detect (well, couldn't check shard-glkb due to lack fo results).
> 
> There is GLK in Farm1, so you can check the bootlogs from fast-feedback 
> run if that part is what you're interested in. All the other gens too.

I did check all the BAT runs. But I have no idea if the glks there are
the same board as what we have as gklb in the shard. So not sure if I
actually checked all the N machine types we have, or just N-1.
Sarvela, Tomi P Nov. 3, 2017, 11:14 a.m. UTC | #5
On 03/11/17 12:41, Ville Syrjälä wrote:
> On Fri, Nov 03, 2017 at 10:18:32AM +0200, Tomi Sarvela wrote:
>> On 02/11/17 19:08, Ville Syrjälä wrote:
>>> On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
>>>>
>>>> Test kms_busy:
>>>>           Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
>>>>                   dmesg-warn -> PASS       (shard-hsw)
>>>>           Subgroup extended-modeset-hang-newfb-with-reset-render-B:
>>>>                   pass       -> DMESG-WARN (shard-hsw)
>>>
>>> Hmm. The warn was there already AFAICS. I wonder why this is claiming
>>> things were passing?
>>
>> The sharded result for the run is 'warning'. What do you mean?
> 
> This should read 'DMESG-WARN -> DMESG-WARN' rather than 'pass ->
> DMESG-WARN'. If I click the link to open up the results in the browser
> this test isn't shown on the shards.html, but I can see it as
> orange->orange in shards-all.html.

I'll see what has happened there. Clearly a bug.

>>> Also shard-glkb didn't seem to get any results from this run. No idea
>>> why, nor why this summary fails to mention that fact.
>>
>> One GLK died, and 4 GLKs take nearly 90 minutes to run shards. It's not
>> reasonable to wait for them on every run and queue everything else, so
>> Patchwork/Trybot runs are skipped on GLK. Anything can be run and added
>> to results manually, if needed.
> 
> If they aren't being run, then maybe they shouldn't be part of the
> shards.html for pw/trybot runs? Would save me having to wonder why I'm
> getting empty results. Or we should have some kind of indication why we
> didn't get any results for a particular machine (ie. whether it was
> expected or not).

Because of people, we can't always be sure if missing machine is 
expected or not. There is no hard reservation system, and machines are 
taken out of the farm for debugging, bios upgrades, that kind of stuff.

Dropping shard from shards.html if there's no comparable results 
(missing either baseline, or patchset run) would probably be a good thing.

>> As a comparison, SNB/HSW/APL all run shards in ~35 minutes. KBL is the
>> only one that gets rebooted between shards (due to leaking context), and
>> takes around 50 minutes.
>>
>>> Oh and BTW the boot/dmesg links from the shard results don't seem to
>>> work very well. Sometimes it just gets you an empty log and you have
>>> to manually find a file that has some actual content in it.
>>
>> If there is no bootlog for a run, the host has not booted between runs.
>> It's maybe not intuitively clear. I've tried to communicate that the
>> shard runs are shifting to conditional reboots (booted only if hung).
> 
> I don't know how that relates to the several dmesg/boot logs I see
> in the directory. And why the web links often seems to point to the
> empty files instead of ones with actual content in them.

I could link to the original bootlog, but that doesn't answer question 
"what happened just before this shard was run"

Sometimes there's leftover dmesg between shardruns, which gets recorded 
in boot.log. Earlier name was "dmesg_before_run.log" or something like that.

> Also the files are numbered in some way. Whether there's any significance
> to those numbers I can't really tell.

Shards are fed to testhosts from 0 to 34, linear order. Filenames have 
that run number. On one host smaller number has been run before larger 
number. On shardruns the number is used to choose shard testlist in our 
IGT package (./shards/x0000)

We could also do several runs with same kernel and testlist, but 
cibuglog can't yet handle the statistics for these kind of results.

>>> After going through the dmesgs for all the other machines we have in ci,
>>> it doesn't look like there were any other changes in the amount of stolen
>>> memory we detect (well, couldn't check shard-glkb due to lack fo results).
>>
>> There is GLK in Farm1, so you can check the bootlogs from fast-feedback
>> run if that part is what you're interested in. All the other gens too.
> 
> I did check all the BAT runs. But I have no idea if the glks there are
> the same board as what we have as gklb in the shard. So not sure if I
> actually checked all the N machine types we have, or just N-1.

Shards have two different boards, and one of those is in Farm1 too. 
Farm1 has different eDP panel, though.

Tomi
Sarvela, Tomi P Nov. 3, 2017, 11:33 a.m. UTC | #6
On 03/11/17 12:43, Petri Latvala wrote:
> On Fri, Nov 03, 2017 at 10:18:32AM +0200, Tomi Sarvela wrote:
>> On 02/11/17 19:08, Ville Syrjälä wrote:
>>> On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
>>>> == Series Details ==
>>>>
>>>> Series: series starting with [1/3] drm/i915: Check if the stolen memory "reserved" area is enabled or not
>>>> URL   : https://patchwork.freedesktop.org/series/33060/
>>>> State : warning
>>>>
>>>> == Summary ==
>>>>
>>>> Test kms_busy:
>>>>           Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
>>>>                   dmesg-warn -> PASS       (shard-hsw)
>>>>           Subgroup extended-modeset-hang-newfb-with-reset-render-B:
>>>>                   pass       -> DMESG-WARN (shard-hsw)
>>>
>>> Hmm. The warn was there already AFAICS. I wonder why this is claiming
>>> things were passing?
>>
>> The sharded result for the run is 'warning'. What do you mean?
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html
> 
> The comparison is done to CI_DRM_3309 according to this, and
> kms_busy@extended-modeset-hang-newfb-with-reset-render-B was
> dmesg-warn there. Why is the above diff showing pass -> DMESG-WARN?

Found the reason: some of the shards have been run twice for CI_DRM_3309 
(my mistake, probably). This has the unintended effect on visualization 
/ comparison which create independently: results depends on loading 
order if there's flip-flops, and differences might show up.

I'll remove duplicate shards and re-run the html.

Tomi
Sarvela, Tomi P Nov. 3, 2017, 12:53 p.m. UTC | #7
On 03/11/17 13:33, Tomi Sarvela wrote:
> On 03/11/17 12:43, Petri Latvala wrote:
>> On Fri, Nov 03, 2017 at 10:18:32AM +0200, Tomi Sarvela wrote:
>>> On 02/11/17 19:08, Ville Syrjälä wrote:
>>>> On Thu, Nov 02, 2017 at 04:34:26PM -0000, Patchwork wrote:
>>>>> == Series Details ==
>>>>>
>>>>> Series: series starting with [1/3] drm/i915: Check if the stolen 
>>>>> memory "reserved" area is enabled or not
>>>>> URL   : https://patchwork.freedesktop.org/series/33060/
>>>>> State : warning
>>>>>
>>>>> == Summary ==
>>>>>
>>>>> Test kms_busy:
>>>>>           Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
>>>>>                   dmesg-warn -> PASS       (shard-hsw)
>>>>>           Subgroup extended-modeset-hang-newfb-with-reset-render-B:
>>>>>                   pass       -> DMESG-WARN (shard-hsw)
>>>>
>>>> Hmm. The warn was there already AFAICS. I wonder why this is claiming
>>>> things were passing?
>>>
>>> The sharded result for the run is 'warning'. What do you mean?
>>
>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html
>>
>> The comparison is done to CI_DRM_3309 according to this, and
>> kms_busy@extended-modeset-hang-newfb-with-reset-render-B was
>> dmesg-warn there. Why is the above diff showing pass -> DMESG-WARN?
> 
> Found the reason: some of the shards have been run twice for CI_DRM_3309 
> (my mistake, probably). This has the unintended effect on visualization 
> / comparison which create independently: results depends on loading 
> order if there's flip-flops, and differences might show up.
> 
> I'll remove duplicate shards and re-run the html.

Cleaned up CI_DRM_3309 run. Sorry for the confusion.

Made the improvement of dropping hosts without needed runs when asking 
for changes: there is no comparable changes if one of two runs is 
missing. They can be seen in shards-all.html though.

Re-ran Patchwork_6930 as an example: (SUCCESS)

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards.html
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6930/shards-all.html

Test kms_flip:
         Subgroup plain-flip-fb-recreate-interruptible:
                 pass       -> FAIL       (shard-hsw) fdo#100368
Test kms_busy:
         Subgroup extended-modeset-hang-oldfb-with-reset-render-A:
                 dmesg-warn -> PASS       (shard-hsw)

fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368

shard-hsw        total:2539 pass:1431 dwarn:2   dfail:0   fail:9 
skip:1097 time:9313s


Tomi
Zanoni, Paulo R Nov. 14, 2017, 8:12 p.m. UTC | #8
Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Apparently there are some machines that put semi-sensible looking
> values
> into the stolen "reserved" base and size, except those values are
> actually
> outside the stolen memory. There is a bit in the register which
> supposedly could tell us whether the reserved area is even enabled or
> not. Let's check for that before we go trusting the base and size.

If this is such a problem since g4x, why didn't we spot it earlier? It
would be nice if you could explain in the commit message (or at least
in this email) what are the consequences you're seeing that made you
realize about this problem. Did something actually explode? I'm
genuinely curious.


Now talking about the patch itself:

If we're going to assume random bits instead of a full-zero in
(possibly) uninitialized stuff, shouldn't we first check bit 1, which
is supposed to tell us if the whole reg is locked or not?

if ((reg_val & 0x3) != 0x3)
	ignore stolen reserved stuff;

Anyway, this patch without my suggestions is probably better than the
current situation so:
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
but please feel investigate the bit 1 thing and CC me on v2 or a
possible follow-up patch with conclusions.


> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_stolen.c | 30
> ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_reg.h        |  2 ++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c
> b/drivers/gpu/drm/i915/i915_gem_stolen.c
> index 03e7abc7e043..44a5dbc8c23b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_stolen.c
> +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
> @@ -294,6 +294,12 @@ static void g4x_get_stolen_reserved(struct
> drm_i915_private *dev_priv,
>  				     ELK_STOLEN_RESERVED);
>  	dma_addr_t stolen_top = dev_priv->mm.stolen_base + ggtt-
> >stolen_size;
>  
> +	if ((reg_val & G4X_STOLEN_RESERVED_ENABLE) == 0) {
> +		*base = 0;
> +		*size = 0;
> +		return;
> +	}
> +
>  	*base = (reg_val & G4X_STOLEN_RESERVED_ADDR2_MASK) << 16;
>  
>  	WARN_ON((reg_val & G4X_STOLEN_RESERVED_ADDR1_MASK) < *base);
> @@ -313,6 +319,12 @@ static void gen6_get_stolen_reserved(struct
> drm_i915_private *dev_priv,
>  {
>  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
>  
> +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> +		*base = 0;
> +		*size = 0;
> +		return;
> +	}
> +
>  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
>  
>  	switch (reg_val & GEN6_STOLEN_RESERVED_SIZE_MASK) {
> @@ -339,6 +351,12 @@ static void gen7_get_stolen_reserved(struct
> drm_i915_private *dev_priv,
>  {
>  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
>  
> +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> +		*base = 0;
> +		*size = 0;
> +		return;
> +	}
> +
>  	*base = reg_val & GEN7_STOLEN_RESERVED_ADDR_MASK;
>  
>  	switch (reg_val & GEN7_STOLEN_RESERVED_SIZE_MASK) {
> @@ -359,6 +377,12 @@ static void chv_get_stolen_reserved(struct
> drm_i915_private *dev_priv,
>  {
>  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
>  
> +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> +		*base = 0;
> +		*size = 0;
> +		return;
> +	}
> +
>  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
>  
>  	switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
> @@ -387,6 +411,12 @@ static void bdw_get_stolen_reserved(struct
> drm_i915_private *dev_priv,
>  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
>  	dma_addr_t stolen_top;
>  
> +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> +		*base = 0;
> +		*size = 0;
> +		return;
> +	}
> +
>  	stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
>  
>  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> diff --git a/drivers/gpu/drm/i915/i915_reg.h
> b/drivers/gpu/drm/i915/i915_reg.h
> index f0f8f6059652..dc2cbc428d1b 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -382,6 +382,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t
> reg)
>  #define GEN8_STOLEN_RESERVED_2M		(1 << 7)
>  #define GEN8_STOLEN_RESERVED_4M		(2 << 7)
>  #define GEN8_STOLEN_RESERVED_8M		(3 << 7)
> +#define GEN6_STOLEN_RESERVED_ENABLE	(1 << 0)
>  
>  /* VGA stuff */
>  
> @@ -3398,6 +3399,7 @@ enum i915_power_well_id {
>  #define ELK_STOLEN_RESERVED		_MMIO(MCHBAR_MIRROR_BASE
> + 0x48)
>  #define G4X_STOLEN_RESERVED_ADDR1_MASK	(0xFFFF << 16)
>  #define G4X_STOLEN_RESERVED_ADDR2_MASK	(0xFFF << 4)
> +#define G4X_STOLEN_RESERVED_ENABLE	(1 << 0)
>  
>  /* Memory controller frequency in MCHBAR for Haswell (possible SNB+)
> */
>  #define DCLK _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5e04)
Chris Wilson Nov. 14, 2017, 8:19 p.m. UTC | #9
Quoting Paulo Zanoni (2017-11-14 20:12:41)
> Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Apparently there are some machines that put semi-sensible looking
> > values
> > into the stolen "reserved" base and size, except those values are
> > actually
> > outside the stolen memory. There is a bit in the register which
> > supposedly could tell us whether the reserved area is even enabled or
> > not. Let's check for that before we go trusting the base and size.
> 
> If this is such a problem since g4x, why didn't we spot it earlier? It
> would be nice if you could explain in the commit message (or at least
> in this email) what are the consequences you're seeing that made you
> realize about this problem. Did something actually explode? I'm
> genuinely curious.

The consequence is that we disable stolen; the machine keeps on working
quite happily. Only fbc actually depends on stolen allocation to
function, and no one complains if fbc is disabled. (There's a sketch out
there that we could use a contiguous allocation for fbc if we run out of
stolen.) Internal hw functions are oblivious to our qualms about the
location of stolen and whether some other device is using the same
physical address for its trampoline.
-Chris
Zanoni, Paulo R Nov. 14, 2017, 8:29 p.m. UTC | #10
Em Ter, 2017-11-14 às 20:19 +0000, Chris Wilson escreveu:
> Quoting Paulo Zanoni (2017-11-14 20:12:41)
> > Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Apparently there are some machines that put semi-sensible looking
> > > values
> > > into the stolen "reserved" base and size, except those values are
> > > actually
> > > outside the stolen memory. There is a bit in the register which
> > > supposedly could tell us whether the reserved area is even
> > > enabled or
> > > not. Let's check for that before we go trusting the base and
> > > size.
> > 
> > If this is such a problem since g4x, why didn't we spot it earlier?
> > It
> > would be nice if you could explain in the commit message (or at
> > least
> > in this email) what are the consequences you're seeing that made
> > you
> > realize about this problem. Did something actually explode? I'm
> > genuinely curious.
> 
> The consequence is that we disable stolen; the machine keeps on
> working
> quite happily.

Thanks!

>  Only fbc actually depends on stolen allocation to
> function, and no one complains if fbc is disabled. (There's a sketch
> out
> there that we could use a contiguous allocation for fbc if we run out
> of
> stolen.)

ILK_DPFC_CB_BASE (aka FBC_CFB_BASE these days) needs to be programmed
as an offset of the base of stolen memory, so you'll need to allocate
this memory in the region that comes right after stolen, or you'll run
out of bits to write to the register.

Also, things such as the "last 8mb bug" of BDW/SKL suggest that maybe
this wouldn't work.

Unless, of course, you have some plan to work around this.

>  Internal hw functions are oblivious to our qualms about the
> location of stolen and whether some other device is using the same
> physical address for its trampoline.
> -Chris
Ville Syrjälä Nov. 14, 2017, 8:34 p.m. UTC | #11
On Tue, Nov 14, 2017 at 06:12:41PM -0200, Paulo Zanoni wrote:
> Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Apparently there are some machines that put semi-sensible looking
> > values
> > into the stolen "reserved" base and size, except those values are
> > actually
> > outside the stolen memory. There is a bit in the register which
> > supposedly could tell us whether the reserved area is even enabled or
> > not. Let's check for that before we go trusting the base and size.
> 
> If this is such a problem since g4x, why didn't we spot it earlier? It
> would be nice if you could explain in the commit message (or at least
> in this email) what are the consequences you're seeing that made you
> realize about this problem. Did something actually explode? I'm
> genuinely curious.

CI is getting ping-ping on the SNB shards. Apparently the BIOS leaves
some garbage in the the base+size fields of the register, and sometimes
that results in the reserved area landing inside stolen (at which point
we just reduce the size of stolen and continue), and sometimes it lands
somewhere outside (at which point we just disable stolen entirely).
Hence the FBC tests sometimes find enough stolen, sometimes not.

> 
> 
> Now talking about the patch itself:
> 
> If we're going to assume random bits instead of a full-zero in
> (possibly) uninitialized stuff, shouldn't we first check bit 1, which
> is supposed to tell us if the whole reg is locked or not?

Not sure we can trust the BIOS to always set the lock bit. I suppose it
really should, but I don't recall seeing anything in the docs saying
that not setting the lock bit would somehow disable the reserved area.
So my understanding is that as long as the enable bit is set the
hardware will prevent us from accessing the are, regardless of whether
the settings have been locked down or not.

This wouldn't be the first lock bit some BIOS versions forget to set BTW.

> 
> if ((reg_val & 0x3) != 0x3)
> 	ignore stolen reserved stuff;
> 
> Anyway, this patch without my suggestions is probably better than the
> current situation so:
> Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> but please feel investigate the bit 1 thing and CC me on v2 or a
> possible follow-up patch with conclusions.
> 
> 
> > Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_stolen.c | 30
> > ++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/i915_reg.h        |  2 ++
> >  2 files changed, 32 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c
> > b/drivers/gpu/drm/i915/i915_gem_stolen.c
> > index 03e7abc7e043..44a5dbc8c23b 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_stolen.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
> > @@ -294,6 +294,12 @@ static void g4x_get_stolen_reserved(struct
> > drm_i915_private *dev_priv,
> >  				     ELK_STOLEN_RESERVED);
> >  	dma_addr_t stolen_top = dev_priv->mm.stolen_base + ggtt-
> > >stolen_size;
> >  
> > +	if ((reg_val & G4X_STOLEN_RESERVED_ENABLE) == 0) {
> > +		*base = 0;
> > +		*size = 0;
> > +		return;
> > +	}
> > +
> >  	*base = (reg_val & G4X_STOLEN_RESERVED_ADDR2_MASK) << 16;
> >  
> >  	WARN_ON((reg_val & G4X_STOLEN_RESERVED_ADDR1_MASK) < *base);
> > @@ -313,6 +319,12 @@ static void gen6_get_stolen_reserved(struct
> > drm_i915_private *dev_priv,
> >  {
> >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> >  
> > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > +		*base = 0;
> > +		*size = 0;
> > +		return;
> > +	}
> > +
> >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> >  
> >  	switch (reg_val & GEN6_STOLEN_RESERVED_SIZE_MASK) {
> > @@ -339,6 +351,12 @@ static void gen7_get_stolen_reserved(struct
> > drm_i915_private *dev_priv,
> >  {
> >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> >  
> > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > +		*base = 0;
> > +		*size = 0;
> > +		return;
> > +	}
> > +
> >  	*base = reg_val & GEN7_STOLEN_RESERVED_ADDR_MASK;
> >  
> >  	switch (reg_val & GEN7_STOLEN_RESERVED_SIZE_MASK) {
> > @@ -359,6 +377,12 @@ static void chv_get_stolen_reserved(struct
> > drm_i915_private *dev_priv,
> >  {
> >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> >  
> > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > +		*base = 0;
> > +		*size = 0;
> > +		return;
> > +	}
> > +
> >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> >  
> >  	switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
> > @@ -387,6 +411,12 @@ static void bdw_get_stolen_reserved(struct
> > drm_i915_private *dev_priv,
> >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> >  	dma_addr_t stolen_top;
> >  
> > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > +		*base = 0;
> > +		*size = 0;
> > +		return;
> > +	}
> > +
> >  	stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
> >  
> >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > b/drivers/gpu/drm/i915/i915_reg.h
> > index f0f8f6059652..dc2cbc428d1b 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -382,6 +382,7 @@ static inline bool i915_mmio_reg_valid(i915_reg_t
> > reg)
> >  #define GEN8_STOLEN_RESERVED_2M		(1 << 7)
> >  #define GEN8_STOLEN_RESERVED_4M		(2 << 7)
> >  #define GEN8_STOLEN_RESERVED_8M		(3 << 7)
> > +#define GEN6_STOLEN_RESERVED_ENABLE	(1 << 0)
> >  
> >  /* VGA stuff */
> >  
> > @@ -3398,6 +3399,7 @@ enum i915_power_well_id {
> >  #define ELK_STOLEN_RESERVED		_MMIO(MCHBAR_MIRROR_BASE
> > + 0x48)
> >  #define G4X_STOLEN_RESERVED_ADDR1_MASK	(0xFFFF << 16)
> >  #define G4X_STOLEN_RESERVED_ADDR2_MASK	(0xFFF << 4)
> > +#define G4X_STOLEN_RESERVED_ENABLE	(1 << 0)
> >  
> >  /* Memory controller frequency in MCHBAR for Haswell (possible SNB+)
> > */
> >  #define DCLK _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5e04)
Chris Wilson Nov. 14, 2017, 8:41 p.m. UTC | #12
Quoting Paulo Zanoni (2017-11-14 20:29:49)
> Em Ter, 2017-11-14 às 20:19 +0000, Chris Wilson escreveu:
> >  Only fbc actually depends on stolen allocation to
> > function, and no one complains if fbc is disabled. (There's a sketch
> > out
> > there that we could use a contiguous allocation for fbc if we run out
> > of
> > stolen.)
> 
> ILK_DPFC_CB_BASE (aka FBC_CFB_BASE these days) needs to be programmed
> as an offset of the base of stolen memory, so you'll need to allocate
> this memory in the region that comes right after stolen, or you'll run
> out of bits to write to the register.
> 
> Also, things such as the "last 8mb bug" of BDW/SKL suggest that maybe
> this wouldn't work.
> 
> Unless, of course, you have some plan to work around this.

Nah, just the stuff I was last looking at (powerctx) were physical.

Being able to disable stolen and recover the memory is still on the
wishlist.
-Chris
Zanoni, Paulo R Nov. 14, 2017, 8:44 p.m. UTC | #13
Em Ter, 2017-11-14 às 22:34 +0200, Ville Syrjälä escreveu:
> On Tue, Nov 14, 2017 at 06:12:41PM -0200, Paulo Zanoni wrote:
> > Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Apparently there are some machines that put semi-sensible looking
> > > values
> > > into the stolen "reserved" base and size, except those values are
> > > actually
> > > outside the stolen memory. There is a bit in the register which
> > > supposedly could tell us whether the reserved area is even
> > > enabled or
> > > not. Let's check for that before we go trusting the base and
> > > size.
> > 
> > If this is such a problem since g4x, why didn't we spot it earlier?
> > It
> > would be nice if you could explain in the commit message (or at
> > least
> > in this email) what are the consequences you're seeing that made
> > you
> > realize about this problem. Did something actually explode? I'm
> > genuinely curious.
> 
> CI is getting ping-ping on the SNB shards. Apparently the BIOS leaves
> some garbage in the the base+size fields of the register, and
> sometimes
> that results in the reserved area landing inside stolen (at which
> point
> we just reduce the size of stolen and continue), and sometimes it
> lands
> somewhere outside (at which point we just disable stolen entirely).
> Hence the FBC tests sometimes find enough stolen, sometimes not.

Couldn't it be that we're failing to read the "size of stolen"
registers instead of the "stolen reserved stuff" regs? All of the
stolen registers have a history of controversy...


> 
> > 
> > 
> > Now talking about the patch itself:
> > 
> > If we're going to assume random bits instead of a full-zero in
> > (possibly) uninitialized stuff, shouldn't we first check bit 1,
> > which
> > is supposed to tell us if the whole reg is locked or not?
> 
> Not sure we can trust the BIOS to always set the lock bit. I suppose
> it
> really should, but I don't recall seeing anything in the docs saying
> that not setting the lock bit would somehow disable the reserved
> area.

The way things are worded, it suggests that if the lock bit is zero,
everything else could be garbage. Bit 0 is only locked after bit 1 is
locked. It doesn't make sense to me to trust bit 0 without trusting bit
1.

So my understanding is that as long as the enable bit is set the
> hardware will prevent us from accessing the are, regardless of
> whether
> the settings have been locked down or not.
> 
> This wouldn't be the first lock bit some BIOS versions forget to set
> BTW.

But it could be worth checking this now that you have access to the
machines that have the random stuff...


> 
> > 
> > if ((reg_val & 0x3) != 0x3)
> > 	ignore stolen reserved stuff;
> > 
> > Anyway, this patch without my suggestions is probably better than
> > the
> > current situation so:
> > Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
> > but please feel investigate the bit 1 thing and CC me on v2 or a
> > possible follow-up patch with conclusions.
> > 
> > 
> > > Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
> > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem_stolen.c | 30
> > > ++++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/i915/i915_reg.h        |  2 ++
> > >  2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c
> > > b/drivers/gpu/drm/i915/i915_gem_stolen.c
> > > index 03e7abc7e043..44a5dbc8c23b 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_stolen.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
> > > @@ -294,6 +294,12 @@ static void g4x_get_stolen_reserved(struct
> > > drm_i915_private *dev_priv,
> > >  				     ELK_STOLEN_RESERVED);
> > >  	dma_addr_t stolen_top = dev_priv->mm.stolen_base + ggtt-
> > > > stolen_size;
> > > 
> > >  
> > > +	if ((reg_val & G4X_STOLEN_RESERVED_ENABLE) == 0) {
> > > +		*base = 0;
> > > +		*size = 0;
> > > +		return;
> > > +	}
> > > +
> > >  	*base = (reg_val & G4X_STOLEN_RESERVED_ADDR2_MASK) <<
> > > 16;
> > >  
> > >  	WARN_ON((reg_val & G4X_STOLEN_RESERVED_ADDR1_MASK) <
> > > *base);
> > > @@ -313,6 +319,12 @@ static void gen6_get_stolen_reserved(struct
> > > drm_i915_private *dev_priv,
> > >  {
> > >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> > >  
> > > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > > +		*base = 0;
> > > +		*size = 0;
> > > +		return;
> > > +	}
> > > +
> > >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> > >  
> > >  	switch (reg_val & GEN6_STOLEN_RESERVED_SIZE_MASK) {
> > > @@ -339,6 +351,12 @@ static void gen7_get_stolen_reserved(struct
> > > drm_i915_private *dev_priv,
> > >  {
> > >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> > >  
> > > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > > +		*base = 0;
> > > +		*size = 0;
> > > +		return;
> > > +	}
> > > +
> > >  	*base = reg_val & GEN7_STOLEN_RESERVED_ADDR_MASK;
> > >  
> > >  	switch (reg_val & GEN7_STOLEN_RESERVED_SIZE_MASK) {
> > > @@ -359,6 +377,12 @@ static void chv_get_stolen_reserved(struct
> > > drm_i915_private *dev_priv,
> > >  {
> > >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> > >  
> > > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > > +		*base = 0;
> > > +		*size = 0;
> > > +		return;
> > > +	}
> > > +
> > >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> > >  
> > >  	switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
> > > @@ -387,6 +411,12 @@ static void bdw_get_stolen_reserved(struct
> > > drm_i915_private *dev_priv,
> > >  	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
> > >  	dma_addr_t stolen_top;
> > >  
> > > +	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
> > > +		*base = 0;
> > > +		*size = 0;
> > > +		return;
> > > +	}
> > > +
> > >  	stolen_top = dev_priv->mm.stolen_base + ggtt-
> > > >stolen_size;
> > >  
> > >  	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
> > > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > > b/drivers/gpu/drm/i915/i915_reg.h
> > > index f0f8f6059652..dc2cbc428d1b 100644
> > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > @@ -382,6 +382,7 @@ static inline bool
> > > i915_mmio_reg_valid(i915_reg_t
> > > reg)
> > >  #define GEN8_STOLEN_RESERVED_2M		(1 << 7)
> > >  #define GEN8_STOLEN_RESERVED_4M		(2 << 7)
> > >  #define GEN8_STOLEN_RESERVED_8M		(3 << 7)
> > > +#define GEN6_STOLEN_RESERVED_ENABLE	(1 << 0)
> > >  
> > >  /* VGA stuff */
> > >  
> > > @@ -3398,6 +3399,7 @@ enum i915_power_well_id {
> > >  #define ELK_STOLEN_RESERVED		_MMIO(MCHBAR_MIRROR_B
> > > ASE
> > > + 0x48)
> > >  #define G4X_STOLEN_RESERVED_ADDR1_MASK	(0xFFFF << 16)
> > >  #define G4X_STOLEN_RESERVED_ADDR2_MASK	(0xFFF << 4)
> > > +#define G4X_STOLEN_RESERVED_ENABLE	(1 << 0)
> > >  
> > >  /* Memory controller frequency in MCHBAR for Haswell (possible
> > > SNB+)
> > > */
> > >  #define DCLK _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5e04)
> 
>
Ville Syrjälä Nov. 14, 2017, 9:38 p.m. UTC | #14
On Tue, Nov 14, 2017 at 06:44:51PM -0200, Paulo Zanoni wrote:
> Em Ter, 2017-11-14 às 22:34 +0200, Ville Syrjälä escreveu:
> > On Tue, Nov 14, 2017 at 06:12:41PM -0200, Paulo Zanoni wrote:
> > > Em Qui, 2017-11-02 às 17:17 +0200, Ville Syrjala escreveu:
> > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > 
> > > > Apparently there are some machines that put semi-sensible looking
> > > > values
> > > > into the stolen "reserved" base and size, except those values are
> > > > actually
> > > > outside the stolen memory. There is a bit in the register which
> > > > supposedly could tell us whether the reserved area is even
> > > > enabled or
> > > > not. Let's check for that before we go trusting the base and
> > > > size.
> > > 
> > > If this is such a problem since g4x, why didn't we spot it earlier?
> > > It
> > > would be nice if you could explain in the commit message (or at
> > > least
> > > in this email) what are the consequences you're seeing that made
> > > you
> > > realize about this problem. Did something actually explode? I'm
> > > genuinely curious.
> > 
> > CI is getting ping-ping on the SNB shards. Apparently the BIOS leaves
> > some garbage in the the base+size fields of the register, and
> > sometimes
> > that results in the reserved area landing inside stolen (at which
> > point
> > we just reduce the size of stolen and continue), and sometimes it
> > lands
> > somewhere outside (at which point we just disable stolen entirely).
> > Hence the FBC tests sometimes find enough stolen, sometimes not.
> 
> Couldn't it be that we're failing to read the "size of stolen"
> registers instead of the "stolen reserved stuff" regs? All of the
> stolen registers have a history of controversy...

I think stolen itself looks correct. IIRC it was somewhere above
the 3 GiB mark and neatly inside a e820 reserved region.

The reserved region (at least in the logs I looked at) was somewhere
around the 17 MiB mark, right on top of normal RAM according to e820.
No sane BIOS would put it there.

> 
> 
> > 
> > > 
> > > 
> > > Now talking about the patch itself:
> > > 
> > > If we're going to assume random bits instead of a full-zero in
> > > (possibly) uninitialized stuff, shouldn't we first check bit 1,
> > > which
> > > is supposed to tell us if the whole reg is locked or not?
> > 
> > Not sure we can trust the BIOS to always set the lock bit. I suppose
> > it
> > really should, but I don't recall seeing anything in the docs saying
> > that not setting the lock bit would somehow disable the reserved
> > area.
> 
> The way things are worded, it suggests that if the lock bit is zero,
> everything else could be garbage. Bit 0 is only locked after bit 1 is
> locked. It doesn't make sense to me to trust bit 0 without trusting bit
> 1.

The hardware itself will trust bit 0 without bit 1. I think that's all
that really matters.

> 
> So my understanding is that as long as the enable bit is set the
> > hardware will prevent us from accessing the are, regardless of
> > whether
> > the settings have been locked down or not.
> > 
> > This wouldn't be the first lock bit some BIOS versions forget to set
> > BTW.
> 
> But it could be worth checking this now that you have access to the
> machines that have the random stuff...

Yeah. I guess it might be interesting to see the state of the lock bit
as well. I'll see if I can get someone to grab it for us (/me no longer
has any idea how to access those machines...).
Ville Syrjälä Nov. 15, 2017, 5:11 p.m. UTC | #15
I pushed the series to dinq. Thanks for the review.

I think we can re-evaluate the case for the lock bit if and when someone
reports seeing the DRM_ERROR() in the wild.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
index 03e7abc7e043..44a5dbc8c23b 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -294,6 +294,12 @@  static void g4x_get_stolen_reserved(struct drm_i915_private *dev_priv,
 				     ELK_STOLEN_RESERVED);
 	dma_addr_t stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
 
+	if ((reg_val & G4X_STOLEN_RESERVED_ENABLE) == 0) {
+		*base = 0;
+		*size = 0;
+		return;
+	}
+
 	*base = (reg_val & G4X_STOLEN_RESERVED_ADDR2_MASK) << 16;
 
 	WARN_ON((reg_val & G4X_STOLEN_RESERVED_ADDR1_MASK) < *base);
@@ -313,6 +319,12 @@  static void gen6_get_stolen_reserved(struct drm_i915_private *dev_priv,
 {
 	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
 
+	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
+		*base = 0;
+		*size = 0;
+		return;
+	}
+
 	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
 	switch (reg_val & GEN6_STOLEN_RESERVED_SIZE_MASK) {
@@ -339,6 +351,12 @@  static void gen7_get_stolen_reserved(struct drm_i915_private *dev_priv,
 {
 	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
 
+	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
+		*base = 0;
+		*size = 0;
+		return;
+	}
+
 	*base = reg_val & GEN7_STOLEN_RESERVED_ADDR_MASK;
 
 	switch (reg_val & GEN7_STOLEN_RESERVED_SIZE_MASK) {
@@ -359,6 +377,12 @@  static void chv_get_stolen_reserved(struct drm_i915_private *dev_priv,
 {
 	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
 
+	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
+		*base = 0;
+		*size = 0;
+		return;
+	}
+
 	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
 	switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
@@ -387,6 +411,12 @@  static void bdw_get_stolen_reserved(struct drm_i915_private *dev_priv,
 	uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
 	dma_addr_t stolen_top;
 
+	if ((reg_val & GEN6_STOLEN_RESERVED_ENABLE) == 0) {
+		*base = 0;
+		*size = 0;
+		return;
+	}
+
 	stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
 
 	*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index f0f8f6059652..dc2cbc428d1b 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -382,6 +382,7 @@  static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define GEN8_STOLEN_RESERVED_2M		(1 << 7)
 #define GEN8_STOLEN_RESERVED_4M		(2 << 7)
 #define GEN8_STOLEN_RESERVED_8M		(3 << 7)
+#define GEN6_STOLEN_RESERVED_ENABLE	(1 << 0)
 
 /* VGA stuff */
 
@@ -3398,6 +3399,7 @@  enum i915_power_well_id {
 #define ELK_STOLEN_RESERVED		_MMIO(MCHBAR_MIRROR_BASE + 0x48)
 #define G4X_STOLEN_RESERVED_ADDR1_MASK	(0xFFFF << 16)
 #define G4X_STOLEN_RESERVED_ADDR2_MASK	(0xFFF << 4)
+#define G4X_STOLEN_RESERVED_ENABLE	(1 << 0)
 
 /* Memory controller frequency in MCHBAR for Haswell (possible SNB+) */
 #define DCLK _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5e04)