diff mbox series

[RFC,2/2] drm/ttm: downgrade cached to write_combined when snooping not available

Message ID 20240629052247.2653363-3-uwu@icenowy.me (mailing list archive)
State New, archived
Headers show
Series drm/ttm: support device w/o coherency | expand

Commit Message

Icenowy Zheng June 29, 2024, 5:22 a.m. UTC
As we can now acquire the presence of the full DMA coherency (snooping
capability) from ttm_device, we can now map the CPU side memory as
write-combined when cached is requested and snooping is not avilable.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 4 ++++
 drivers/gpu/drm/ttm/ttm_tt.c      | 4 ++++
 include/drm/ttm/ttm_caching.h     | 3 ++-
 3 files changed, 10 insertions(+), 1 deletion(-)

Comments

Jiaxun Yang June 29, 2024, 7:57 p.m. UTC | #1
在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
[...]
> @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, 
> struct ttm_resource *res,
>  		caching = res->bus.caching;
>  	}
> 
> +	/* Downgrade cached mapping for non-snooping devices */
> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
> +		caching = ttm_write_combined;
Hi Icenowy,

Thanks for your patch! You saved many non-coh PCIe host implementations a day!.

Unfortunately I don't think we can safely ttm_cached to ttm_write_comnined, we've
had enough drama with write combine behaviour on all different platforms.

See drm_arch_can_wc_memory in drm_cache.h.

Thanks 

> +
>  	return ttm_prot_from_caching(caching, tmp);
>  }
>  EXPORT_SYMBOL(ttm_io_prot);
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 7b00ddf0ce49f..3335df45fba5e 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>  			       enum ttm_caching caching,
>  			       unsigned long extra_pages)
>  {
> +	/* Downgrade cached mapping for non-snooping devices */
> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
> +		caching = ttm_write_combined;
> +
>  	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
>  	ttm->page_flags = page_flags;
>  	ttm->dma_address = NULL;
> diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
> index a18f43e93abab..f92d7911f50e4 100644
> --- a/include/drm/ttm/ttm_caching.h
> +++ b/include/drm/ttm/ttm_caching.h
> @@ -47,7 +47,8 @@ enum ttm_caching {
> 
>  	/**
>  	 * @ttm_cached: Fully cached like normal system memory, requires that
> -	 * devices snoop the CPU cache on accesses.
> +	 * devices snoop the CPU cache on accesses. Downgraded to
> +	 * ttm_write_combined when the snooping capaiblity is missing.
>  	 */
>  	ttm_cached
>  };
> -- 
> 2.45.2
Icenowy Zheng June 29, 2024, 8:51 p.m. UTC | #2
于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang <jiaxun.yang@flygoat.com> 写道:
>
>
>在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
>[...]
>> @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, 
>> struct ttm_resource *res,
>>  		caching = res->bus.caching;
>>  	}
>> 
>> +	/* Downgrade cached mapping for non-snooping devices */
>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>> +		caching = ttm_write_combined;
>Hi Icenowy,
>
>Thanks for your patch! You saved many non-coh PCIe host implementations a day!.
>
>Unfortunately I don't think we can safely ttm_cached to ttm_write_comnined, we've
>had enough drama with write combine behaviour on all different platforms.
>
>See drm_arch_can_wc_memory in drm_cache.h.
>

Yes this really sounds like an issue.

Maybe the behavior of ttm_write_combined should furtherly be decided
by drm_arch_can_wc_memory() in case of quirks?

>Thanks 
>
>> +
>>  	return ttm_prot_from_caching(caching, tmp);
>>  }
>>  EXPORT_SYMBOL(ttm_io_prot);
>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>> index 7b00ddf0ce49f..3335df45fba5e 100644
>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>>  			       enum ttm_caching caching,
>>  			       unsigned long extra_pages)
>>  {
>> +	/* Downgrade cached mapping for non-snooping devices */
>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>> +		caching = ttm_write_combined;
>> +
>>  	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
>>  	ttm->page_flags = page_flags;
>>  	ttm->dma_address = NULL;
>> diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
>> index a18f43e93abab..f92d7911f50e4 100644
>> --- a/include/drm/ttm/ttm_caching.h
>> +++ b/include/drm/ttm/ttm_caching.h
>> @@ -47,7 +47,8 @@ enum ttm_caching {
>> 
>>  	/**
>>  	 * @ttm_cached: Fully cached like normal system memory, requires that
>> -	 * devices snoop the CPU cache on accesses.
>> +	 * devices snoop the CPU cache on accesses. Downgraded to
>> +	 * ttm_write_combined when the snooping capaiblity is missing.
>>  	 */
>>  	ttm_cached
>>  };
>> -- 
>> 2.45.2
>
Icenowy Zheng June 30, 2024, 6:52 a.m. UTC | #3
在 2024-06-29星期六的 20:57 +0100,Jiaxun Yang写道:
> 
> 
> 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
> [...]
> > @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object
> > *bo, 
> > struct ttm_resource *res,
> >                 caching = res->bus.caching;
> >         }
> > 
> > +       /* Downgrade cached mapping for non-snooping devices */
> > +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
> > +               caching = ttm_write_combined;
> Hi Icenowy,
> 
> Thanks for your patch! You saved many non-coh PCIe host
> implementations a day!.
> 
> Unfortunately I don't think we can safely ttm_cached to
> ttm_write_comnined, we've
> had enough drama with write combine behaviour on all different
> platforms.

I think on these platforms, ttm_write_combined should be prevented to
be mapped to pgprot_writecombine instead, but downgrade ttm_cached to
ttm_write_combined (and then being treated same as ttm_uncached) is
acceptable.

> 
> See drm_arch_can_wc_memory in drm_cache.h.
> 
> Thanks 
> 
> > +
> >         return ttm_prot_from_caching(caching, tmp);
> >  }
> >  EXPORT_SYMBOL(ttm_io_prot);
> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > b/drivers/gpu/drm/ttm/ttm_tt.c
> > index 7b00ddf0ce49f..3335df45fba5e 100644
> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt
> > *ttm,
> >                                enum ttm_caching caching,
> >                                unsigned long extra_pages)
> >  {
> > +       /* Downgrade cached mapping for non-snooping devices */
> > +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
> > +               caching = ttm_write_combined;
> > +
> >         ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT)
> > + extra_pages;
> >         ttm->page_flags = page_flags;
> >         ttm->dma_address = NULL;
> > diff --git a/include/drm/ttm/ttm_caching.h
> > b/include/drm/ttm/ttm_caching.h
> > index a18f43e93abab..f92d7911f50e4 100644
> > --- a/include/drm/ttm/ttm_caching.h
> > +++ b/include/drm/ttm/ttm_caching.h
> > @@ -47,7 +47,8 @@ enum ttm_caching {
> > 
> >         /**
> >          * @ttm_cached: Fully cached like normal system memory,
> > requires that
> > -        * devices snoop the CPU cache on accesses.
> > +        * devices snoop the CPU cache on accesses. Downgraded to
> > +        * ttm_write_combined when the snooping capaiblity is
> > missing.
> >          */
> >         ttm_cached
> >  };
> > -- 
> > 2.45.2
>
Christian König July 1, 2024, 11:40 a.m. UTC | #4
Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
>
> 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang <jiaxun.yang@flygoat.com> 写道:
>>
>> 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
>> [...]
>>> @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo,
>>> struct ttm_resource *res,
>>>   		caching = res->bus.caching;
>>>   	}
>>>
>>> +	/* Downgrade cached mapping for non-snooping devices */
>>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>> +		caching = ttm_write_combined;
>> Hi Icenowy,
>>
>> Thanks for your patch! You saved many non-coh PCIe host implementations a day!.

Ah, wait a second.

Such a thing as non-coherent PCIe implementation doesn't exist. The PCIe 
specification makes it mandatory for memory access to be cache coherent.

There are a bunch of non-compliant PCIe implementations which have 
broken cache coherency, but those explicitly violate the specification 
and because of that are not supported.

Regards,
Christian.

>>
>> Unfortunately I don't think we can safely ttm_cached to ttm_write_comnined, we've
>> had enough drama with write combine behaviour on all different platforms.
>>
>> See drm_arch_can_wc_memory in drm_cache.h.
>>
> Yes this really sounds like an issue.
>
> Maybe the behavior of ttm_write_combined should furtherly be decided
> by drm_arch_can_wc_memory() in case of quirks?
>
>> Thanks
>>
>>> +
>>>   	return ttm_prot_from_caching(caching, tmp);
>>>   }
>>>   EXPORT_SYMBOL(ttm_io_prot);
>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>> index 7b00ddf0ce49f..3335df45fba5e 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>>>   			       enum ttm_caching caching,
>>>   			       unsigned long extra_pages)
>>>   {
>>> +	/* Downgrade cached mapping for non-snooping devices */
>>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>> +		caching = ttm_write_combined;
>>> +
>>>   	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
>>>   	ttm->page_flags = page_flags;
>>>   	ttm->dma_address = NULL;
>>> diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
>>> index a18f43e93abab..f92d7911f50e4 100644
>>> --- a/include/drm/ttm/ttm_caching.h
>>> +++ b/include/drm/ttm/ttm_caching.h
>>> @@ -47,7 +47,8 @@ enum ttm_caching {
>>>
>>>   	/**
>>>   	 * @ttm_cached: Fully cached like normal system memory, requires that
>>> -	 * devices snoop the CPU cache on accesses.
>>> +	 * devices snoop the CPU cache on accesses. Downgraded to
>>> +	 * ttm_write_combined when the snooping capaiblity is missing.
>>>   	 */
>>>   	ttm_cached
>>>   };
>>> -- 
>>> 2.45.2
Jiaxun Yang July 1, 2024, 11:52 a.m. UTC | #5
在2024年7月1日七月 下午12:40,Christian König写道:
[...]
>
> Ah, wait a second.
>
> Such a thing as non-coherent PCIe implementation doesn't exist. The PCIe 
> specification makes it mandatory for memory access to be cache coherent.

There are some non-PCIe TTM GPU being hit by this pitfall, we have non-coherent
Vivante GPU on some devices.

Handling it at TTM core makes more sense on reducing per-driver effort on dealing
platform issues.

>
> There are a bunch of non-compliant PCIe implementations which have 
> broken cache coherency, but those explicitly violate the specification 
> and because of that are not supported.

I don't really understand, "doesn't exist" and "bunch of" seems to be contradicting
with each other.

>
> Regards,
> Christian.
>
>>>
>>> Unfortunately I don't think we can safely ttm_cached to ttm_write_comnined, we've
>>> had enough drama with write combine behaviour on all different platforms.
>>>
>>> See drm_arch_can_wc_memory in drm_cache.h.
>>>
>> Yes this really sounds like an issue.
>>
>> Maybe the behavior of ttm_write_combined should furtherly be decided
>> by drm_arch_can_wc_memory() in case of quirks?

IMO for DMA mappings, use dma_pgprot at mapping makes more sense :-)

Thanks 
- Jiaxun
>>
>>> Thanks
>>>
>>>> +
>>>>   	return ttm_prot_from_caching(caching, tmp);
>>>>   }
>>>>   EXPORT_SYMBOL(ttm_io_prot);
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> index 7b00ddf0ce49f..3335df45fba5e 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>>>>   			       enum ttm_caching caching,
>>>>   			       unsigned long extra_pages)
>>>>   {
>>>> +	/* Downgrade cached mapping for non-snooping devices */
>>>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>> +		caching = ttm_write_combined;
>>>> +
>>>>   	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
>>>>   	ttm->page_flags = page_flags;
>>>>   	ttm->dma_address = NULL;
>>>> diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
>>>> index a18f43e93abab..f92d7911f50e4 100644
>>>> --- a/include/drm/ttm/ttm_caching.h
>>>> +++ b/include/drm/ttm/ttm_caching.h
>>>> @@ -47,7 +47,8 @@ enum ttm_caching {
>>>>
>>>>   	/**
>>>>   	 * @ttm_cached: Fully cached like normal system memory, requires that
>>>> -	 * devices snoop the CPU cache on accesses.
>>>> +	 * devices snoop the CPU cache on accesses. Downgraded to
>>>> +	 * ttm_write_combined when the snooping capaiblity is missing.
>>>>   	 */
>>>>   	ttm_cached
>>>>   };
>>>> -- 
>>>> 2.45.2
Christian König July 1, 2024, 12:10 p.m. UTC | #6
Am 01.07.24 um 13:52 schrieb Jiaxun Yang:
> 在2024年7月1日七月 下午12:40,Christian König写道:
> [...]
>> Ah, wait a second.
>>
>> Such a thing as non-coherent PCIe implementation doesn't exist. The PCIe
>> specification makes it mandatory for memory access to be cache coherent.
> There are some non-PCIe TTM GPU being hit by this pitfall, we have non-coherent
> Vivante GPU on some devices.

Yeah, but those are perfectly supported.

If you have a non PCIe device which needs uncached or write combined 
system memory allocations you can just specify this at memory allocation 
time.

> Handling it at TTM core makes more sense on reducing per-driver effort on dealing
> platform issues.
>
>> There are a bunch of non-compliant PCIe implementations which have
>> broken cache coherency, but those explicitly violate the specification
>> and because of that are not supported.
> I don't really understand, "doesn't exist" and "bunch of" seems to be contradicting
> with each other.

A compliant non-coherent PCIe implementation doesn't exists, that's made 
mandatory by the PCIe standard.

What does exists are some non compliant non coherent PCIe 
implementations, but as far as I know those are then not supported at 
all by Linux.

We already had tons of problems with platform chips which intentionally 
doesn't correctly implement the PCIe specification because it is cheaper.

At least for AMDs GPU driver we reverted to rejecting patches for 
platform bugs cause by incorrectly wired up PCIe root complexes.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>>> Unfortunately I don't think we can safely ttm_cached to ttm_write_comnined, we've
>>>> had enough drama with write combine behaviour on all different platforms.
>>>>
>>>> See drm_arch_can_wc_memory in drm_cache.h.
>>>>
>>> Yes this really sounds like an issue.
>>>
>>> Maybe the behavior of ttm_write_combined should furtherly be decided
>>> by drm_arch_can_wc_memory() in case of quirks?
> IMO for DMA mappings, use dma_pgprot at mapping makes more sense :-)
>
> Thanks
> - Jiaxun
>>>> Thanks
>>>>
>>>>> +
>>>>>    	return ttm_prot_from_caching(caching, tmp);
>>>>>    }
>>>>>    EXPORT_SYMBOL(ttm_io_prot);
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> index 7b00ddf0ce49f..3335df45fba5e 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>>>>>    			       enum ttm_caching caching,
>>>>>    			       unsigned long extra_pages)
>>>>>    {
>>>>> +	/* Downgrade cached mapping for non-snooping devices */
>>>>> +	if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>>> +		caching = ttm_write_combined;
>>>>> +
>>>>>    	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
>>>>>    	ttm->page_flags = page_flags;
>>>>>    	ttm->dma_address = NULL;
>>>>> diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
>>>>> index a18f43e93abab..f92d7911f50e4 100644
>>>>> --- a/include/drm/ttm/ttm_caching.h
>>>>> +++ b/include/drm/ttm/ttm_caching.h
>>>>> @@ -47,7 +47,8 @@ enum ttm_caching {
>>>>>
>>>>>    	/**
>>>>>    	 * @ttm_cached: Fully cached like normal system memory, requires that
>>>>> -	 * devices snoop the CPU cache on accesses.
>>>>> +	 * devices snoop the CPU cache on accesses. Downgraded to
>>>>> +	 * ttm_write_combined when the snooping capaiblity is missing.
>>>>>    	 */
>>>>>    	ttm_cached
>>>>>    };
>>>>> -- 
>>>>> 2.45.2
Icenowy Zheng July 2, 2024, 1:46 a.m. UTC | #7
在 2024-07-01星期一的 13:40 +0200,Christian König写道:
> Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
> > 
> > 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang
> > <jiaxun.yang@flygoat.com> 写道:
> > > 
> > > 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
> > > [...]
> > > > @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
> > > > ttm_buffer_object *bo,
> > > > struct ttm_resource *res,
> > > >                 caching = res->bus.caching;
> > > >         }
> > > > 
> > > > +       /* Downgrade cached mapping for non-snooping devices */
> > > > +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
> > > > +               caching = ttm_write_combined;
> > > Hi Icenowy,
> > > 
> > > Thanks for your patch! You saved many non-coh PCIe host
> > > implementations a day!.
> 
> Ah, wait a second.
> 
> Such a thing as non-coherent PCIe implementation doesn't exist. The
> PCIe 
> specification makes it mandatory for memory access to be cache
> coherent.

Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X addendum
1.0, none of this explicitly requires the PCIe controller and the CPU
being fully coherent. The PCI-X spec even says "Note that PCI-X, like
conventional PCI, does not require systems to support coherent caches
for addresses accessed by PCI-X requesters".

In addition, in the perspective of Linux, I think bypassing CPU cache
of shared memory is considered as coherent access too, see
dma_alloc_coherent() function's naming.

> 
> There are a bunch of non-compliant PCIe implementations which have 
> broken cache coherency, but those explicitly violate the
> specification 
> and because of that are not supported.

Regardless of it violating the spec or not, these devices work with
Linux subsystems that use dma_alloc_coherent to allocate DMA buffers
(which is the most common case), and GPU drivers just give out cryptic
error messages like "ring gfx test failed" without any mention of
coherency issues at all, which makes the fact that Linux DRM/TTM
subsystem currently requires PCIe snooping CPU cache more obscure.

> 
> Regards,
> Christian.
> 
> > > 
> > > Unfortunately I don't think we can safely ttm_cached to
> > > ttm_write_comnined, we've
> > > had enough drama with write combine behaviour on all different
> > > platforms.
> > > 
> > > See drm_arch_can_wc_memory in drm_cache.h.
> > > 
> > Yes this really sounds like an issue.
> > 
> > Maybe the behavior of ttm_write_combined should furtherly be
> > decided
> > by drm_arch_can_wc_memory() in case of quirks?
> > 
> > > Thanks
> > > 
> > > > +
> > > >         return ttm_prot_from_caching(caching, tmp);
> > > >   }
> > > >   EXPORT_SYMBOL(ttm_io_prot);
> > > > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > index 7b00ddf0ce49f..3335df45fba5e 100644
> > > > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct
> > > > ttm_tt *ttm,
> > > >                                enum ttm_caching caching,
> > > >                                unsigned long extra_pages)
> > > >   {
> > > > +       /* Downgrade cached mapping for non-snooping devices */
> > > > +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
> > > > +               caching = ttm_write_combined;
> > > > +
> > > >         ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
> > > > PAGE_SHIFT) + extra_pages;
> > > >         ttm->page_flags = page_flags;
> > > >         ttm->dma_address = NULL;
> > > > diff --git a/include/drm/ttm/ttm_caching.h
> > > > b/include/drm/ttm/ttm_caching.h
> > > > index a18f43e93abab..f92d7911f50e4 100644
> > > > --- a/include/drm/ttm/ttm_caching.h
> > > > +++ b/include/drm/ttm/ttm_caching.h
> > > > @@ -47,7 +47,8 @@ enum ttm_caching {
> > > > 
> > > >         /**
> > > >          * @ttm_cached: Fully cached like normal system memory,
> > > > requires that
> > > > -        * devices snoop the CPU cache on accesses.
> > > > +        * devices snoop the CPU cache on accesses. Downgraded
> > > > to
> > > > +        * ttm_write_combined when the snooping capaiblity is
> > > > missing.
> > > >          */
> > > >         ttm_cached
> > > >   };
> > > > -- 
> > > > 2.45.2
>
Christian König July 2, 2024, 8:08 a.m. UTC | #8
Am 02.07.24 um 03:46 schrieb Icenowy Zheng:
> 在 2024-07-01星期一的 13:40 +0200,Christian König写道:
>> Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
>>> 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang
>>> <jiaxun.yang@flygoat.com>  写道:
>>>> 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
>>>> [...]
>>>>> @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
>>>>> ttm_buffer_object *bo,
>>>>> struct ttm_resource *res,
>>>>>                  caching = res->bus.caching;
>>>>>          }
>>>>>
>>>>> +       /* Downgrade cached mapping for non-snooping devices */
>>>>> +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>>> +               caching = ttm_write_combined;
>>>> Hi Icenowy,
>>>>
>>>> Thanks for your patch! You saved many non-coh PCIe host
>>>> implementations a day!.
>> Ah, wait a second.
>>
>> Such a thing as non-coherent PCIe implementation doesn't exist. The
>> PCIe
>> specification makes it mandatory for memory access to be cache
>> coherent.
> Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X addendum
> 1.0, none of this explicitly requires the PCIe controller and the CPU
> being fully coherent. The PCI-X spec even says "Note that PCI-X, like
> conventional PCI, does not require systems to support coherent caches
> for addresses accessed by PCI-X requesters".

See the very first PCI specification, AGP 2.0 and the PCIe extension for 
non-snooped access.

Originally it wasn't well defined what the PCI 1.0 spec meant with 
coherency (e.g. snooping vs uncached).

AGP was the first specification which explicitly defined that all AGP 
memory accesses must be non-snooped and all PCI accesses must snoop the 
CPU caches.

PCIe then had an extension which defined the "No Snooping Attribute" to 
allow emulating the AGP behavior.

For the current PCIe 6.1 specification the non-snoop extension was 
merged into the base specification.

Here see section "2.2.6.5 No Snoop Attribute", e.g. "Hardware enforced 
cache coherency expected"

As well as the notes under section 7.5.3.4 Device Control Register:

Enable No Snoop - If this bit is Set, the Function is permitted to Set 
the No Snoop bit in the Requester
Attributes of transactions it initiates that do not require hardware 
enforced cache coherency (see Section 2.2.6.5 ).

To summarize it: Not snooping caches is an extension, snooping caches is 
mandatory.

> In addition, in the perspective of Linux, I think bypassing CPU cache
> of shared memory is considered as coherent access too, see
> dma_alloc_coherent() function's naming.

Yes that's correct, but this is for platform devices. E.g. other I/O 
from drivers who doesn't need to work with malloced system memory for 
example.

We have quite a bunch of V4L, sound and I also think network devices 
which work like that. But those are non-PCI devices.

>> There are a bunch of non-compliant PCIe implementations which have
>> broken cache coherency, but those explicitly violate the
>> specification
>> and because of that are not supported.
> Regardless of it violating the spec or not, these devices work with
> Linux subsystems that use dma_alloc_coherent to allocate DMA buffers
> (which is the most common case), and GPU drivers just give out cryptic
> error messages like "ring gfx test failed" without any mention of
> coherency issues at all, which makes the fact that Linux DRM/TTM
> subsystem currently requires PCIe snooping CPU cache more obscure.

No, they don't even remotely work. You just got very basic tests working.

Both the Vulkan as well as the OpenGL specification require that you can 
import "normal" malloced() system memory into the GPU driver.

This is not possible without a cache coherent platform architecture. So 
you can't fully support those APIs.

We exercised this quite extensively already and even have a confirmation 
from ARM engineers that the approach of attaching just any PCIe root to 
an ARM IP core is not supported from their side.

And if I'm not completely mistaken the RISC-V specification was also 
updated to disallow stuff like this.

So yes you can have boards which implement non-snooped PCIe, but you get 
exactly zero support from hardware vendors to run software on it.

Regards,
Christian.

>> Regards,
>> Christian.
>>
>>>> Unfortunately I don't think we can safely ttm_cached to
>>>> ttm_write_comnined, we've
>>>> had enough drama with write combine behaviour on all different
>>>> platforms.
>>>>
>>>> See drm_arch_can_wc_memory in drm_cache.h.
>>>>
>>> Yes this really sounds like an issue.
>>>
>>> Maybe the behavior of ttm_write_combined should furtherly be
>>> decided
>>> by drm_arch_can_wc_memory() in case of quirks?
>>>
>>>> Thanks
>>>>
>>>>> +
>>>>>          return ttm_prot_from_caching(caching, tmp);
>>>>>    }
>>>>>    EXPORT_SYMBOL(ttm_io_prot);
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> index 7b00ddf0ce49f..3335df45fba5e 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct
>>>>> ttm_tt *ttm,
>>>>>                                 enum ttm_caching caching,
>>>>>                                 unsigned long extra_pages)
>>>>>    {
>>>>> +       /* Downgrade cached mapping for non-snooping devices */
>>>>> +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>>> +               caching = ttm_write_combined;
>>>>> +
>>>>>          ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
>>>>> PAGE_SHIFT) + extra_pages;
>>>>>          ttm->page_flags = page_flags;
>>>>>          ttm->dma_address = NULL;
>>>>> diff --git a/include/drm/ttm/ttm_caching.h
>>>>> b/include/drm/ttm/ttm_caching.h
>>>>> index a18f43e93abab..f92d7911f50e4 100644
>>>>> --- a/include/drm/ttm/ttm_caching.h
>>>>> +++ b/include/drm/ttm/ttm_caching.h
>>>>> @@ -47,7 +47,8 @@ enum ttm_caching {
>>>>>
>>>>>          /**
>>>>>           * @ttm_cached: Fully cached like normal system memory,
>>>>> requires that
>>>>> -        * devices snoop the CPU cache on accesses.
>>>>> +        * devices snoop the CPU cache on accesses. Downgraded
>>>>> to
>>>>> +        * ttm_write_combined when the snooping capaiblity is
>>>>> missing.
>>>>>           */
>>>>>          ttm_cached
>>>>>    };
>>>>> -- 
>>>>> 2.45.2
Christian König July 2, 2024, 8:10 a.m. UTC | #9
Am 02.07.24 um 10:08 schrieb Christian König:
> Am 02.07.24 um 03:46 schrieb Icenowy Zheng:
>> 在 2024-07-01星期一的 13:40 +0200,Christian König写道:
>>> Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
>>>> 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang
>>>> <jiaxun.yang@flygoat.com>  写道:
>>>>> 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
>>>>> [...]
>>>>>> @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
>>>>>> ttm_buffer_object *bo,
>>>>>> struct ttm_resource *res,
>>>>>>                  caching = res->bus.caching;
>>>>>>          }
>>>>>>
>>>>>> +       /* Downgrade cached mapping for non-snooping devices */
>>>>>> +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>>>> +               caching = ttm_write_combined;
>>>>> Hi Icenowy,
>>>>>
>>>>> Thanks for your patch! You saved many non-coh PCIe host
>>>>> implementations a day!.
>>> Ah, wait a second.
>>>
>>> Such a thing as non-coherent PCIe implementation doesn't exist. The
>>> PCIe
>>> specification makes it mandatory for memory access to be cache
>>> coherent.
>> Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X addendum
>> 1.0, none of this explicitly requires the PCIe controller and the CPU
>> being fully coherent. The PCI-X spec even says "Note that PCI-X, like
>> conventional PCI, does not require systems to support coherent caches
>> for addresses accessed by PCI-X requesters".
>
> See the very first PCI specification, AGP 2.0 and the PCIe extension 
> for non-snooped access.
>
> Originally it wasn't well defined what the PCI 1.0 spec meant with 
> coherency (e.g. snooping vs uncached).
>
> AGP was the first specification which explicitly defined that all AGP 
> memory accesses must be non-snooped and all PCI accesses must snoop 
> the CPU caches.
>
> PCIe then had an extension which defined the "No Snooping Attribute" 
> to allow emulating the AGP behavior.
>
> For the current PCIe 6.1 specification the non-snoop extension was 
> merged into the base specification.
>
> Here see section "2.2.6.5 No Snoop Attribute", e.g. "Hardware enforced 
> cache coherency expected"
>
> As well as the notes under section 7.5.3.4 Device Control Register:
>
> Enable No Snoop - If this bit is Set, the Function is permitted to Set 
> the No Snoop bit in the Requester
> Attributes of transactions it initiates that do not require hardware 
> enforced cache coherency (see Section 2.2.6.5 ).
>
> To summarize it: Not snooping caches is an extension, snooping caches 
> is mandatory.
>
>> In addition, in the perspective of Linux, I think bypassing CPU cache
>> of shared memory is considered as coherent access too, see
>> dma_alloc_coherent() function's naming.
>
> Yes that's correct, but this is for platform devices. E.g. other I/O 
> from drivers who doesn't need to work with malloced system memory for 
> example.
>
> We have quite a bunch of V4L, sound and I also think network devices 
> which work like that. But those are non-PCI devices.
>
>>> There are a bunch of non-compliant PCIe implementations which have
>>> broken cache coherency, but those explicitly violate the
>>> specification
>>> and because of that are not supported.
>> Regardless of it violating the spec or not, these devices work with
>> Linux subsystems that use dma_alloc_coherent to allocate DMA buffers
>> (which is the most common case), and GPU drivers just give out cryptic
>> error messages like "ring gfx test failed" without any mention of
>> coherency issues at all, which makes the fact that Linux DRM/TTM
>> subsystem currently requires PCIe snooping CPU cache more obscure.
>
> No, they don't even remotely work. You just got very basic tests working.
>
> Both the Vulkan as well as the OpenGL specification require that you 
> can import "normal" malloced() system memory into the GPU driver.
>
> This is not possible without a cache coherent platform architecture. 
> So you can't fully support those APIs.
>
> We exercised this quite extensively already and even have a 
> confirmation from ARM engineers that the approach of attaching just 
> any PCIe root to an ARM IP core is not supported from their side.
>
> And if I'm not completely mistaken the RISC-V specification was also 
> updated to disallow stuff like this.

Googling helps: https://github.com/riscv/riscv-platform-specs/issues/83

Regards,
Christian.

>
> So yes you can have boards which implement non-snooped PCIe, but you 
> get exactly zero support from hardware vendors to run software on it.
>
> Regards,
> Christian.
>
>>> Regards,
>>> Christian.
>>>
>>>>> Unfortunately I don't think we can safely ttm_cached to
>>>>> ttm_write_comnined, we've
>>>>> had enough drama with write combine behaviour on all different
>>>>> platforms.
>>>>>
>>>>> See drm_arch_can_wc_memory in drm_cache.h.
>>>>>
>>>> Yes this really sounds like an issue.
>>>>
>>>> Maybe the behavior of ttm_write_combined should furtherly be
>>>> decided
>>>> by drm_arch_can_wc_memory() in case of quirks?
>>>>
>>>>> Thanks
>>>>>
>>>>>> +
>>>>>>          return ttm_prot_from_caching(caching, tmp);
>>>>>>    }
>>>>>>    EXPORT_SYMBOL(ttm_io_prot);
>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> index 7b00ddf0ce49f..3335df45fba5e 100644
>>>>>> --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>>> @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct
>>>>>> ttm_tt *ttm,
>>>>>>                                 enum ttm_caching caching,
>>>>>>                                 unsigned long extra_pages)
>>>>>>    {
>>>>>> +       /* Downgrade cached mapping for non-snooping devices */
>>>>>> +       if (!bo->bdev->dma_coherent && caching == ttm_cached)
>>>>>> +               caching = ttm_write_combined;
>>>>>> +
>>>>>>          ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
>>>>>> PAGE_SHIFT) + extra_pages;
>>>>>>          ttm->page_flags = page_flags;
>>>>>>          ttm->dma_address = NULL;
>>>>>> diff --git a/include/drm/ttm/ttm_caching.h
>>>>>> b/include/drm/ttm/ttm_caching.h
>>>>>> index a18f43e93abab..f92d7911f50e4 100644
>>>>>> --- a/include/drm/ttm/ttm_caching.h
>>>>>> +++ b/include/drm/ttm/ttm_caching.h
>>>>>> @@ -47,7 +47,8 @@ enum ttm_caching {
>>>>>>
>>>>>>          /**
>>>>>>           * @ttm_cached: Fully cached like normal system memory,
>>>>>> requires that
>>>>>> -        * devices snoop the CPU cache on accesses.
>>>>>> +        * devices snoop the CPU cache on accesses. Downgraded
>>>>>> to
>>>>>> +        * ttm_write_combined when the snooping capaiblity is
>>>>>> missing.
>>>>>>           */
>>>>>>          ttm_cached
>>>>>>    };
>>>>>> -- 
>>>>>> 2.45.2
>
Icenowy Zheng July 2, 2024, 9:06 a.m. UTC | #10
在 2024-07-02星期二的 10:08 +0200,Christian König写道:
> Am 02.07.24 um 03:46 schrieb Icenowy Zheng:
> > 在 2024-07-01星期一的 13:40 +0200,Christian König写道:
> > > Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
> > > > 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang
> > > > <jiaxun.yang@flygoat.com>  写道:
> > > > > 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
> > > > > [...]
> > > > > > @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
> > > > > > ttm_buffer_object *bo,
> > > > > > struct ttm_resource *res,
> > > > > >                  caching = res->bus.caching;
> > > > > >          }
> > > > > > 
> > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > devices */
> > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > ttm_cached)
> > > > > > +               caching = ttm_write_combined;
> > > > > Hi Icenowy,
> > > > > 
> > > > > Thanks for your patch! You saved many non-coh PCIe host
> > > > > implementations a day!.
> > > Ah, wait a second.
> > > 
> > > Such a thing as non-coherent PCIe implementation doesn't exist.
> > > The
> > > PCIe
> > > specification makes it mandatory for memory access to be cache
> > > coherent.
> > Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X
> > addendum
> > 1.0, none of this explicitly requires the PCIe controller and the
> > CPU
> > being fully coherent. The PCI-X spec even says "Note that PCI-X,
> > like
> > conventional PCI, does not require systems to support coherent
> > caches
> > for addresses accessed by PCI-X requesters".
> 
> See the very first PCI specification, AGP 2.0 and the PCIe extension
> for 
> non-snooped access.
> 
> Originally it wasn't well defined what the PCI 1.0 spec meant with 
> coherency (e.g. snooping vs uncached).
> 

I think the word in PCI-X addendum could be understood as kind of
clarification, which makes at least PCI-X (and maybe retrospectively
PCI) coherency optional.

> AGP was the first specification which explicitly defined that all AGP
> memory accesses must be non-snooped and all PCI accesses must snoop
> the 
> CPU caches.

However I don't think the definition of the AGP spec could apply on all
PCI(e) implementations. The AGP spec itself don't apply on
implementations that do not implement AGP (which is the most PCI(e)
implementations today), and it's not in the reference list of the PCIe
spec, so it does no help on this context.

> 
> PCIe then had an extension which defined the "No Snooping Attribute"
> to 
> allow emulating the AGP behavior.
> 
> For the current PCIe 6.1 specification the non-snoop extension was 
> merged into the base specification.
> 
> Here see section "2.2.6.5 No Snoop Attribute", e.g. "Hardware
> enforced 
> cache coherency expected"
> 
> As well as the notes under section 7.5.3.4 Device Control Register:
> 
> Enable No Snoop - If this bit is Set, the Function is permitted to
> Set 
> the No Snoop bit in the Requester
> Attributes of transactions it initiates that do not require hardware 
> enforced cache coherency (see Section 2.2.6.5 ).
> 
> To summarize it: Not snooping caches is an extension, snooping caches
> is 
> mandatory.

I don't get any such new PCIe specifications, but in Section 2.2.6.3 of
PCIe specification 2.0, it declares these attribute bits as "hints"
"Level of support is dependent on target applications", and suggests to
refer to PCI-X 2.0 (which is the document that I referenced the
sentence originally from).

This makes it reasonable that No-Snoop bit is only meaningful in a
coherent implementation, and it becomes just no-op for non-coherent
implementations. If PCIe specification really requires coherent access
by default, this should be mentioned in other parts of the spec,
instead of only get referred by the No-Snoop bit definition.

> 
> > In addition, in the perspective of Linux, I think bypassing CPU
> > cache
> > of shared memory is considered as coherent access too, see
> > dma_alloc_coherent() function's naming.
> 
> Yes that's correct, but this is for platform devices. E.g. other I/O 
> from drivers who doesn't need to work with malloced system memory for
> example.
> 
> We have quite a bunch of V4L, sound and I also think network devices 
> which work like that. But those are non-PCI devices.

I think in the Linux kernel most drivers (of course including PCI ones)
use DMA buffer in this way, which makes them natually compatible with
non-coherent PCIe implementations. TTM is one of the few exceptions
here.

> 
> > > There are a bunch of non-compliant PCIe implementations which
> > > have
> > > broken cache coherency, but those explicitly violate the
> > > specification
> > > and because of that are not supported.
> > Regardless of it violating the spec or not, these devices work with
> > Linux subsystems that use dma_alloc_coherent to allocate DMA
> > buffers
> > (which is the most common case), and GPU drivers just give out
> > cryptic
> > error messages like "ring gfx test failed" without any mention of
> > coherency issues at all, which makes the fact that Linux DRM/TTM
> > subsystem currently requires PCIe snooping CPU cache more obscure.
> 
> No, they don't even remotely work. You just got very basic tests
> working.
> 
> Both the Vulkan as well as the OpenGL specification require that you
> can 
> import "normal" malloced() system memory into the GPU driver.

I am not familiar with Vulkan, however in the OpenGL context I don't
think just importing a memory buffer into the GPU is supported. Most
buffers of OpenGL is declared and allocated on the GPU part, and have
data copied to by either using specific GL functions (e.g. glTexImage*)
or have the GPU side buffer mapped to CPU (with functions like
glMapBuffer). In this case the memory is surely not "normal malloc()ed
memory", but memory originated from the GPU driver.

To map arbitary CPU-allocated buffers, not only the GPU should be able
to fully snoop the CPU cache, but the GPU MMU should have a equal or
smaller page size than the CPU MMU (although in real world cases the
GPU page size is usually 4K and the CPU page size is usually >=4K,
which makes this not a big problem, but still not well considered).

In fact, in my daily jobs, I met the situation that the library of some
peripheral (a HW decoder) just passed an arbitary user-space pointer to
the application without any hardware address info (its internal
implementation seems to use dma-buf, but I get no interface for
acquiring the dma-buf information), and I failed to do a zero-copy
optimization of this just because I am not able to import this user-
space pointer to the GPU driver because of no known API importing this
arbitary pointer without further device memory info.

> 
> This is not possible without a cache coherent platform architecture.
> So 
> you can't fully support those APIs.

In this case I think the coherency status could be exported to the
userspace, and the userspace side driver could disable certain APIs (I
don't think anything in the core GL 2.1 spec should be affected).

In addition, in the case of AMD GPUs, disabling KFD for non-coherent
devices sounds viable.

> 
> We exercised this quite extensively already and even have a
> confirmation 
> from ARM engineers that the approach of attaching just any PCIe root
> to 
> an ARM IP core is not supported from their side.

This kind of problem should be considered as a kind of platform
implementation, not getting support from just the CPU IP core vendor
isn't a surprise; this do not prevent ARM SoC vendors from implementing
PCIe, from low-end ones such as Allwinner to high-end ones such as you
AMD.

> And if I'm not completely mistaken the RISC-V specification was also 
> updated to disallow stuff like this.
> 
> So yes you can have boards which implement non-snooped PCIe, but you
> get 
> exactly zero support from hardware vendors to run software on it.
> 

It's a quite usual case for free softwares to get no support from
hardware vendors, and some of them are even developed by reverse
engineering. I don't think it as a blocker for the Linux kernel to
merge as many hardwares' support as possible.

> Regards,
> Christian.
> 
> > > Regards,
> > > Christian.
> > > 
> > > > > Unfortunately I don't think we can safely ttm_cached to
> > > > > ttm_write_comnined, we've
> > > > > had enough drama with write combine behaviour on all
> > > > > different
> > > > > platforms.
> > > > > 
> > > > > See drm_arch_can_wc_memory in drm_cache.h.
> > > > > 
> > > > Yes this really sounds like an issue.
> > > > 
> > > > Maybe the behavior of ttm_write_combined should furtherly be
> > > > decided
> > > > by drm_arch_can_wc_memory() in case of quirks?
> > > > 
> > > > > Thanks
> > > > > 
> > > > > > +
> > > > > >          return ttm_prot_from_caching(caching, tmp);
> > > > > >    }
> > > > > >    EXPORT_SYMBOL(ttm_io_prot);
> > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > index 7b00ddf0ce49f..3335df45fba5e 100644
> > > > > > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct
> > > > > > ttm_tt *ttm,
> > > > > >                                 enum ttm_caching caching,
> > > > > >                                 unsigned long extra_pages)
> > > > > >    {
> > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > devices */
> > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > ttm_cached)
> > > > > > +               caching = ttm_write_combined;
> > > > > > +
> > > > > >          ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
> > > > > > PAGE_SHIFT) + extra_pages;
> > > > > >          ttm->page_flags = page_flags;
> > > > > >          ttm->dma_address = NULL;
> > > > > > diff --git a/include/drm/ttm/ttm_caching.h
> > > > > > b/include/drm/ttm/ttm_caching.h
> > > > > > index a18f43e93abab..f92d7911f50e4 100644
> > > > > > --- a/include/drm/ttm/ttm_caching.h
> > > > > > +++ b/include/drm/ttm/ttm_caching.h
> > > > > > @@ -47,7 +47,8 @@ enum ttm_caching {
> > > > > > 
> > > > > >          /**
> > > > > >           * @ttm_cached: Fully cached like normal system
> > > > > > memory,
> > > > > > requires that
> > > > > > -        * devices snoop the CPU cache on accesses.
> > > > > > +        * devices snoop the CPU cache on accesses.
> > > > > > Downgraded
> > > > > > to
> > > > > > +        * ttm_write_combined when the snooping capaiblity
> > > > > > is
> > > > > > missing.
> > > > > >           */
> > > > > >          ttm_cached
> > > > > >    };
> > > > > > -- 
> > > > > > 2.45.2
Christian König July 2, 2024, 9:27 a.m. UTC | #11
Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
> [SNIP]
> However I don't think the definition of the AGP spec could apply on all
> PCI(e) implementations. The AGP spec itself don't apply on
> implementations that do not implement AGP (which is the most PCI(e)
> implementations today), and it's not in the reference list of the PCIe
> spec, so it does no help on this context.

No, exactly that is not correct.

See as I explained the No-Snoop extension to PCIe was created to help 
with AGP support and later merged into the base PCIe specification.

So the AGP spec is now part of the PCIe spec.
[SNIP]
>> We have quite a bunch of V4L, sound and I also think network devices
>> which work like that. But those are non-PCI devices.
> I think in the Linux kernel most drivers (of course including PCI ones)
> use DMA buffer in this way, which makes them natually compatible with
> non-coherent PCIe implementations. TTM is one of the few exceptions
> here.

Yes and that is absolutely intentional.

See we don't want to support any non-coherent PCIe implementation.

[SNIP]
>> And if I'm not completely mistaken the RISC-V specification was also
>> updated to disallow stuff like this.
>>
>> So yes you can have boards which implement non-snooped PCIe, but you
>> get
>> exactly zero support from hardware vendors to run software on it.
>>
> It's a quite usual case for free softwares to get no support from
> hardware vendors, and some of them are even developed by reverse
> engineering. I don't think it as a blocker for the Linux kernel to
> merge as many hardwares' support as possible.

We seem to have a misunderstanding here, this is not a software issue. 
The hardware platform is considered broken by the hardware vendor!

In other words people have stitched together hardware in a way which is 
not supported by the creator of that hardware.

So as long as you can't convince anybody from ARM or the RISC-V team or 
whoever created that hardware to confirm that the hardware actually 
works you won't get any support for that.

Regards,
Christian.
Icenowy Zheng July 2, 2024, 9:36 a.m. UTC | #12
在 2024-07-02星期二的 10:10 +0200,Christian König写道:
> Am 02.07.24 um 10:08 schrieb Christian König:
> > Am 02.07.24 um 03:46 schrieb Icenowy Zheng:
> > > 在 2024-07-01星期一的 13:40 +0200,Christian König写道:
> > > > Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
> > > > > 于 2024年6月30日 GMT+08:00 03:57:47,Jiaxun Yang
> > > > > <jiaxun.yang@flygoat.com>  写道:
> > > > > > 在2024年6月29日六月 上午6:22,Icenowy Zheng写道:
> > > > > > [...]
> > > > > > > @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
> > > > > > > ttm_buffer_object *bo,
> > > > > > > struct ttm_resource *res,
> > > > > > >                  caching = res->bus.caching;
> > > > > > >          }
> > > > > > > 
> > > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > > devices */
> > > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > > ttm_cached)
> > > > > > > +               caching = ttm_write_combined;
> > > > > > Hi Icenowy,
> > > > > > 
> > > > > > Thanks for your patch! You saved many non-coh PCIe host
> > > > > > implementations a day!.
> > > > Ah, wait a second.
> > > > 
> > > > Such a thing as non-coherent PCIe implementation doesn't exist.
> > > > The
> > > > PCIe
> > > > specification makes it mandatory for memory access to be cache
> > > > coherent.
> > > Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X
> > > addendum
> > > 1.0, none of this explicitly requires the PCIe controller and the
> > > CPU
> > > being fully coherent. The PCI-X spec even says "Note that PCI-X,
> > > like
> > > conventional PCI, does not require systems to support coherent
> > > caches
> > > for addresses accessed by PCI-X requesters".
> > 
> > See the very first PCI specification, AGP 2.0 and the PCIe
> > extension 
> > for non-snooped access.
> > 
> > Originally it wasn't well defined what the PCI 1.0 spec meant with 
> > coherency (e.g. snooping vs uncached).
> > 
> > AGP was the first specification which explicitly defined that all
> > AGP 
> > memory accesses must be non-snooped and all PCI accesses must snoop
> > the CPU caches.
> > 
> > PCIe then had an extension which defined the "No Snooping
> > Attribute" 
> > to allow emulating the AGP behavior.
> > 
> > For the current PCIe 6.1 specification the non-snoop extension was 
> > merged into the base specification.
> > 
> > Here see section "2.2.6.5 No Snoop Attribute", e.g. "Hardware
> > enforced 
> > cache coherency expected"
> > 
> > As well as the notes under section 7.5.3.4 Device Control Register:
> > 
> > Enable No Snoop - If this bit is Set, the Function is permitted to
> > Set 
> > the No Snoop bit in the Requester
> > Attributes of transactions it initiates that do not require
> > hardware 
> > enforced cache coherency (see Section 2.2.6.5 ).
> > 
> > To summarize it: Not snooping caches is an extension, snooping
> > caches 
> > is mandatory.
> > 
> > > In addition, in the perspective of Linux, I think bypassing CPU
> > > cache
> > > of shared memory is considered as coherent access too, see
> > > dma_alloc_coherent() function's naming.
> > 
> > Yes that's correct, but this is for platform devices. E.g. other
> > I/O 
> > from drivers who doesn't need to work with malloced system memory
> > for 
> > example.
> > 
> > We have quite a bunch of V4L, sound and I also think network
> > devices 
> > which work like that. But those are non-PCI devices.
> > 
> > > > There are a bunch of non-compliant PCIe implementations which
> > > > have
> > > > broken cache coherency, but those explicitly violate the
> > > > specification
> > > > and because of that are not supported.
> > > Regardless of it violating the spec or not, these devices work
> > > with
> > > Linux subsystems that use dma_alloc_coherent to allocate DMA
> > > buffers
> > > (which is the most common case), and GPU drivers just give out
> > > cryptic
> > > error messages like "ring gfx test failed" without any mention of
> > > coherency issues at all, which makes the fact that Linux DRM/TTM
> > > subsystem currently requires PCIe snooping CPU cache more
> > > obscure.
> > 
> > No, they don't even remotely work. You just got very basic tests
> > working.
> > 
> > Both the Vulkan as well as the OpenGL specification require that
> > you 
> > can import "normal" malloced() system memory into the GPU driver.
> > 
> > This is not possible without a cache coherent platform
> > architecture. 
> > So you can't fully support those APIs.
> > 
> > We exercised this quite extensively already and even have a 
> > confirmation from ARM engineers that the approach of attaching just
> > any PCIe root to an ARM IP core is not supported from their side.
> > 
> > And if I'm not completely mistaken the RISC-V specification was
> > also 
> > updated to disallow stuff like this.
> 
> Googling helps:
> https://github.com/riscv/riscv-platform-specs/issues/83

riscv-platform-specs is a deprecated repository, and the specs are
never ratified by RVI. According to its content, it should be moved to
riscv-non-isa/, but the move didn't even happen. The successor of this
specific part, which is reduced to "server" context, is placed at
riscv-non-isa/server-soc, in a more specific form; however server-soc
is also not ratified yet either, and its CCS_040 item uses "SHOULD"
instead of "MUST" on "hardware enforced cache coherency".

> 
> Regards,
> Christian.
> 
> > 
> > So yes you can have boards which implement non-snooped PCIe, but
> > you 
> > get exactly zero support from hardware vendors to run software on
> > it.
> > 
> > Regards,
> > Christian.
> > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > > Unfortunately I don't think we can safely ttm_cached to
> > > > > > ttm_write_comnined, we've
> > > > > > had enough drama with write combine behaviour on all
> > > > > > different
> > > > > > platforms.
> > > > > > 
> > > > > > See drm_arch_can_wc_memory in drm_cache.h.
> > > > > > 
> > > > > Yes this really sounds like an issue.
> > > > > 
> > > > > Maybe the behavior of ttm_write_combined should furtherly be
> > > > > decided
> > > > > by drm_arch_can_wc_memory() in case of quirks?
> > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > > +
> > > > > > >          return ttm_prot_from_caching(caching, tmp);
> > > > > > >    }
> > > > > > >    EXPORT_SYMBOL(ttm_io_prot);
> > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > > b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > > index 7b00ddf0ce49f..3335df45fba5e 100644
> > > > > > > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > > @@ -152,6 +152,10 @@ static void
> > > > > > > ttm_tt_init_fields(struct
> > > > > > > ttm_tt *ttm,
> > > > > > >                                 enum ttm_caching caching,
> > > > > > >                                 unsigned long
> > > > > > > extra_pages)
> > > > > > >    {
> > > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > > devices */
> > > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > > ttm_cached)
> > > > > > > +               caching = ttm_write_combined;
> > > > > > > +
> > > > > > >          ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
> > > > > > > PAGE_SHIFT) + extra_pages;
> > > > > > >          ttm->page_flags = page_flags;
> > > > > > >          ttm->dma_address = NULL;
> > > > > > > diff --git a/include/drm/ttm/ttm_caching.h
> > > > > > > b/include/drm/ttm/ttm_caching.h
> > > > > > > index a18f43e93abab..f92d7911f50e4 100644
> > > > > > > --- a/include/drm/ttm/ttm_caching.h
> > > > > > > +++ b/include/drm/ttm/ttm_caching.h
> > > > > > > @@ -47,7 +47,8 @@ enum ttm_caching {
> > > > > > > 
> > > > > > >          /**
> > > > > > >           * @ttm_cached: Fully cached like normal system
> > > > > > > memory,
> > > > > > > requires that
> > > > > > > -        * devices snoop the CPU cache on accesses.
> > > > > > > +        * devices snoop the CPU cache on accesses.
> > > > > > > Downgraded
> > > > > > > to
> > > > > > > +        * ttm_write_combined when the snooping
> > > > > > > capaiblity is
> > > > > > > missing.
> > > > > > >           */
> > > > > > >          ttm_cached
> > > > > > >    };
> > > > > > > -- 
> > > > > > > 2.45.2
> >
Icenowy Zheng July 2, 2024, 10:01 a.m. UTC | #13
在 2024-07-02星期二的 11:27 +0200,Christian König写道:
> Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
> > [SNIP]
> > However I don't think the definition of the AGP spec could apply on
> > all
> > PCI(e) implementations. The AGP spec itself don't apply on
> > implementations that do not implement AGP (which is the most PCI(e)
> > implementations today), and it's not in the reference list of the
> > PCIe
> > spec, so it does no help on this context.
> 
> No, exactly that is not correct.
> 
> See as I explained the No-Snoop extension to PCIe was created to help
> with AGP support and later merged into the base PCIe specification.
> 
> So the AGP spec is now part of the PCIe spec.
> [SNIP]

I don't think AGP spec is part of the PCIe spec, at least not part of
the PCIe spec I was reading. It does not contain the word "AGP" at all,
and despite it has a "Reference Documents" chapter, this chapter didn't
include the AGP spec, unlike PCI spec / PCI-X addendum which are listed
there.

In addition, your logic that No-Snoop is related to AGP only apply when
discussing about the history of PCIe, not when inspecting it from a
synchronic view, which is what new implementers of a spec should do.

> > > We have quite a bunch of V4L, sound and I also think network
> > > devices
> > > which work like that. But those are non-PCI devices.
> > I think in the Linux kernel most drivers (of course including PCI
> > ones)
> > use DMA buffer in this way, which makes them natually compatible
> > with
> > non-coherent PCIe implementations. TTM is one of the few exceptions
> > here.
> 
> Yes and that is absolutely intentional.
> 
> See we don't want to support any non-coherent PCIe implementation.
> 

However, considering users exist for this setup, I here seriously hope
the support to be gained since today.

> [SNIP]
> > > And if I'm not completely mistaken the RISC-V specification was
> > > also
> > > updated to disallow stuff like this.
> > > 
> > > So yes you can have boards which implement non-snooped PCIe, but
> > > you
> > > get
> > > exactly zero support from hardware vendors to run software on it.
> > > 
> > It's a quite usual case for free softwares to get no support from
> > hardware vendors, and some of them are even developed by reverse
> > engineering. I don't think it as a blocker for the Linux kernel to
> > merge as many hardwares' support as possible.
> 
> We seem to have a misunderstanding here, this is not a software
> issue. 
> The hardware platform is considered broken by the hardware vendor!

I don't think in this case Arm Ltd / RISC-V International is considered
the vendor -- the SoC vendor is. You can rarely get support directly
from the CPU ISA vendor (or CPU IP vendor, which differs mostly in the
RISC-V situation) in the case a SoC or a final product with some SoC is
purchased, and the SoC/product's vendor would rarely declare their
hardware platform as broken.

> 
> In other words people have stitched together hardware in a way which
> is 
> not supported by the creator of that hardware.
> 
> So as long as you can't convince anybody from ARM or the RISC-V team
> or 
> whoever created that hardware to confirm that the hardware actually 
> works you won't get any support for that.

The world isn't black-or-white, and the thing contradictory to
"confirmed working" is "not confirmed working" instead of "confirmed
not working". To get the thing really working (or prove it doesn't work
at all by practice) is part of the traditional fun of a hacker.

> 
> Regards,
> Christian.
Jiaxun Yang July 2, 2024, 10:03 a.m. UTC | #14
在2024年7月2日七月 下午5:27,Christian König写道:
> Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
>> [SNIP] However I don't think the definition of the AGP spec could apply on all
>> PCI(e) implementations. The AGP spec itself don't apply on
>> implementations that do not implement AGP (which is the most PCI(e)
>> implementations today), and it's not in the reference list of the PCIe
>> spec, so it does no help on this context. 
> No, exactly that is not correct.
>
> See as I explained the No-Snoop extension to PCIe was created to help 
> with AGP support and later merged into the base PCIe specification.
>
> So the AGP spec is now part of the PCIe spec.

We don't really buy this theory.

Keyword "AGP" doesn't appear in "PCI Express Base 4.0 Base Specification" even
once.

If PCIe is a predecessor of AGP, where does AGP specific software interface like
 AGP aperture goes? PCIe GPUs are only borrowing software concepts from AGP,
but they didn't inherit any hardware properties.

[...]
> We seem to have a misunderstanding here, this is not a software issue. 
> The hardware platform is considered broken by the hardware vendor!

It's up to the specification text to define compliance means. So far as per analysis
from Icenowy of PCIe specification text itself it's not prohibited.

>
> In other words people have stitched together hardware in a way which is 
> not supported by the creator of that hardware.
>
> So as long as you can't convince anybody from ARM or the RISC-V team or 
> whoever created that hardware to confirm that the hardware actually 
> works you won't get any support for that.

Well we are trying to support them on our own in mainline, we are not asking
for any support.

Thanks
- Jiaxun
>
> Regards,
> Christian.
Jiaxun Yang July 3, 2024, 8:52 a.m. UTC | #15
在2024年7月2日七月 下午6:03,Jiaxun Yang写道:
> 在2024年7月2日七月 下午5:27,Christian König写道:
>> Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
>>> [SNIP] However I don't think the definition of the AGP spec could apply on all
>>> PCI(e) implementations. The AGP spec itself don't apply on
>>> implementations that do not implement AGP (which is the most PCI(e)
>>> implementations today), and it's not in the reference list of the PCIe
>>> spec, so it does no help on this context. 
>> No, exactly that is not correct.
>>
>> See as I explained the No-Snoop extension to PCIe was created to help 
>> with AGP support and later merged into the base PCIe specification.
>>
>> So the AGP spec is now part of the PCIe spec.

Hi Bjorn & linux-pci folks,

It seems like we have some disputes on interpretation pf PCIe specification.

We are seeking your expertise on the question: Does PCIe specification mandate Cache
coherency via snoop?

There are some further context in this thread [1].

[1]:  https://lore.kernel.org/all/0db974d40cd8c5dcc723d43c328bac923e0fe33a.camel@icenowy.me/
Thanks
- Jiaxun

>
> We don't really buy this theory.
>
> Keyword "AGP" doesn't appear in "PCI Express Base 4.0 Base Specification" even
> once.
>
> If PCIe is a predecessor of AGP, where does AGP specific software interface like
>  AGP aperture goes? PCIe GPUs are only borrowing software concepts from AGP,
> but they didn't inherit any hardware properties.
>
> [...]
>> We seem to have a misunderstanding here, this is not a software issue. 
>> The hardware platform is considered broken by the hardware vendor!
>
> It's up to the specification text to define compliance means. So far as 
> per analysis
> from Icenowy of PCIe specification text itself it's not prohibited.
>
>>
>> In other words people have stitched together hardware in a way which is 
>> not supported by the creator of that hardware.
>>
>> So as long as you can't convince anybody from ARM or the RISC-V team or 
>> whoever created that hardware to confirm that the hardware actually 
>> works you won't get any support for that.
>
> Well we are trying to support them on our own in mainline, we are not asking
> for any support.
>
> Thanks
> - Jiaxun
>>
>> Regards,
>> Christian.
>
> -- 
> - Jiaxun
Bjorn Helgaas July 3, 2024, 9:08 p.m. UTC | #16
On Wed, Jul 03, 2024 at 04:52:30PM +0800, Jiaxun Yang wrote:
> 在2024年7月2日七月 下午6:03,Jiaxun Yang写道:
> > 在2024年7月2日七月 下午5:27,Christian König写道:
> >> Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
> >>> [SNIP] However I don't think the definition of the AGP spec could apply on all
> >>> PCI(e) implementations. The AGP spec itself don't apply on
> >>> implementations that do not implement AGP (which is the most PCI(e)
> >>> implementations today), and it's not in the reference list of the PCIe
> >>> spec, so it does no help on this context. 
> >> No, exactly that is not correct.
> >>
> >> See as I explained the No-Snoop extension to PCIe was created to help 
> >> with AGP support and later merged into the base PCIe specification.
> >>
> >> So the AGP spec is now part of the PCIe spec.
> 
> Hi Bjorn & linux-pci folks,
> 
> It seems like we have some disputes on interpretation pf PCIe
> specification.
> 
> We are seeking your expertise on the question: Does PCIe
> specification mandate Cache coherency via snoop?

I'm not qualified to opine on this.  I'd say it's a question for the
PCI SIG protocol workgroup.  https://forum.pcisig.com/ is a place to
start.

Bjorn
Icenowy Zheng July 4, 2024, 2 a.m. UTC | #17
在 2024-07-03星期三的 16:08 -0500,Bjorn Helgaas写道:
> On Wed, Jul 03, 2024 at 04:52:30PM +0800, Jiaxun Yang wrote:
> > 在2024年7月2日七月 下午6:03,Jiaxun Yang写道:
> > > 在2024年7月2日七月 下午5:27,Christian König写道:
> > > > Am 02.07.24 um 11:06 schrieb Icenowy Zheng:
> > > > > [SNIP] However I don't think the definition of the AGP spec
> > > > > could apply on all
> > > > > PCI(e) implementations. The AGP spec itself don't apply on
> > > > > implementations that do not implement AGP (which is the most
> > > > > PCI(e)
> > > > > implementations today), and it's not in the reference list of
> > > > > the PCIe
> > > > > spec, so it does no help on this context. 
> > > > No, exactly that is not correct.
> > > > 
> > > > See as I explained the No-Snoop extension to PCIe was created
> > > > to help 
> > > > with AGP support and later merged into the base PCIe
> > > > specification.
> > > > 
> > > > So the AGP spec is now part of the PCIe spec.
> > 
> > Hi Bjorn & linux-pci folks,
> > 
> > It seems like we have some disputes on interpretation pf PCIe
> > specification.
> > 
> > We are seeking your expertise on the question: Does PCIe
> > specification mandate Cache coherency via snoop?
> 
> I'm not qualified to opine on this.  I'd say it's a question for the
> PCI SIG protocol workgroup.  https://forum.pcisig.com/ is a place to
> start.

Sorry for the disturbance.

As individual hacker, I am not eligble of being a PCI-SIG member and
join the discussion there.

So I here want to ask a question as an individual hacker: what's the
policy of linux-pci towards these non-coherent PCIe implementations?

If the sentences of Christian is right, these implementations are just
out-of-spec, should them get purged out of the kernel, or at least
raising a warning that some HW won't work because of inconformant
implementation?

> 
> Bjorn
Icenowy Zheng July 4, 2024, 6:40 a.m. UTC | #18
在 2024-07-03星期三的 23:11 -0700,Christoph Hellwig写道:
> On Thu, Jul 04, 2024 at 10:00:52AM +0800, Icenowy Zheng wrote:
> > So I here want to ask a question as an individual hacker: what's
> > the
> > policy of linux-pci towards these non-coherent PCIe
> > implementations?
> > 
> > If the sentences of Christian is right, these implementations are
> > just
> > out-of-spec, should them get purged out of the kernel, or at least
> > raising a warning that some HW won't work because of inconformant
> > implementation?
> 
> Nothing in the PCIe specifications that mandates a programming model.
> Non-coherent DMA is extremely common in lower end devices, and
> despite
> all the issues that it causes well supported in Linux.
> 
> What are you trying to solve?

Currently the DRM TTM subsystem (and GPU drivers using it) will assume
coherency and fail on these non-coherent systems with cryptic error
messages (like `[drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx
test failed (-110)`) without mentioning coherency issues at all.

My original patchset tries to solve this problem by make the TTM
subsystem sensible of coherency status (and prevent CPU-side cached
mapping when non-coherent), but got argued by TTM maintainer and the
maintainer says TTM's ignorance on non-coherent systems is intentional.

>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 0b3f4267130c4..6519ce047787d 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -302,6 +302,10 @@  pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res,
 		caching = res->bus.caching;
 	}
 
+	/* Downgrade cached mapping for non-snooping devices */
+	if (!bo->bdev->dma_coherent && caching == ttm_cached)
+		caching = ttm_write_combined;
+
 	return ttm_prot_from_caching(caching, tmp);
 }
 EXPORT_SYMBOL(ttm_io_prot);
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 7b00ddf0ce49f..3335df45fba5e 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -152,6 +152,10 @@  static void ttm_tt_init_fields(struct ttm_tt *ttm,
 			       enum ttm_caching caching,
 			       unsigned long extra_pages)
 {
+	/* Downgrade cached mapping for non-snooping devices */
+	if (!bo->bdev->dma_coherent && caching == ttm_cached)
+		caching = ttm_write_combined;
+
 	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + extra_pages;
 	ttm->page_flags = page_flags;
 	ttm->dma_address = NULL;
diff --git a/include/drm/ttm/ttm_caching.h b/include/drm/ttm/ttm_caching.h
index a18f43e93abab..f92d7911f50e4 100644
--- a/include/drm/ttm/ttm_caching.h
+++ b/include/drm/ttm/ttm_caching.h
@@ -47,7 +47,8 @@  enum ttm_caching {
 
 	/**
 	 * @ttm_cached: Fully cached like normal system memory, requires that
-	 * devices snoop the CPU cache on accesses.
+	 * devices snoop the CPU cache on accesses. Downgraded to
+	 * ttm_write_combined when the snooping capaiblity is missing.
 	 */
 	ttm_cached
 };