mbox series

[-next,00/11] lib/interval-tree: move to half closed intervals

Message ID 20191003201858.11666-1-dave@stgolabs.net (mailing list archive)
Headers show
Series lib/interval-tree: move to half closed intervals | expand

Message

Davidlohr Bueso Oct. 3, 2019, 8:18 p.m. UTC
Hi,

It has been discussed[1,2] that almost all users of interval trees would better
be served if the intervals were actually not [a,b], but instead [a, b). This
series attempts to convert all callers by way of transitioning from using
"interval_tree_generic.h" to "interval_tree_gen.h". Once all users are converted,
we remove the former.

Patch 1: adds a call that will make patch 8 easier to review by introducing stab
         queries for the vma interval tree.

Patch 2: adds the new interval_tree_gen.h which is the same as the old one but
         uses [a,b) intervals.

Patch 3-9: converts, in baby steps (as much as possible), each interval tree to
	   the new [a,b) one. It is done this way also to maintain bisectability.
	   Most conversions are pretty straightforward, however, there are some
	   creative ways in which some callers use the interval 'end' when going
	   through intersecting ranges within a tree. Ie: patch 3, 6 and 9.

Patch 10: deletes the interval_tree_generic.h header; there are no longer any users.

Patch 11: finally simplifies x86 pat tree to use the new interval tree machinery.

This has been lightly tested, and certainly not on driver paths that do non
trivial conversions. Also needs more eyeballs as conversions can be easily
missed (even when I've tried mitigating this by renaming the endpoint from 'last'
to 'end' in each corresponding structure).

Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple of
relevant lists (mm, dri, rdma); sorry if you consider this spam.

Applies on top of today's linux-next tree. Please consider for v5.5.

Thanks!

[1] https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupKM_iKozw@mail.gmail.com/
[2] https://lore.kernel.org/patchwork/patch/1114629/

Davidlohr Bueso (11):
  mm: introduce vma_interval_tree_foreach_stab()
  lib/interval-tree: add an equivalent tree with [a,b) intervals
  drm/amdgpu: convert amdgpu_vm_it to half closed intervals
  drm: convert drm_mm_interval_tree to half closed intervals
  IB/hfi1: convert __mmu_int_rb to half closed intervals
  IB,usnic: convert usnic_uiom_interval_tree to half closed intervals
  vhost: convert vhost_umem_interval_tree to half closed intervals
  mm: convert vma_interval_tree to half closed intervals
  lib/interval-tree: convert interval_tree to half closed intervals
  lib: drop interval_tree_generic.h
  x86/mm, pat: convert pat tree to generic interval tree

 arch/arm/mm/fault-armv.c                           |   2 +-
 arch/arm/mm/flush.c                                |   2 +-
 arch/nios2/mm/cacheflush.c                         |   2 +-
 arch/parisc/kernel/cache.c                         |   2 +-
 arch/x86/mm/pat.c                                  |  22 +--
 arch/x86/mm/pat_rbtree.c                           | 151 +++++----------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c             |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c             |  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h         |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h          |  18 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c            |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c            |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c             |  47 ++++---
 drivers/gpu/drm/drm_mm.c                           |   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c        |   5 +-
 .../gpu/drm/i915/gem/selftests/i915_gem_context.c  |   2 +-
 drivers/gpu/drm/radeon/radeon_mn.c                 |  11 +-
 drivers/gpu/drm/radeon/radeon_trace.h              |   2 +-
 drivers/gpu/drm/radeon/radeon_vm.c                 |  26 ++--
 drivers/gpu/drm/selftests/test-drm_mm.c            |   2 +-
 drivers/infiniband/core/umem_odp.c                 |  21 +--
 drivers/infiniband/hw/hfi1/mmu_rb.c                |  15 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c           |   8 +-
 .../infiniband/hw/usnic/usnic_uiom_interval_tree.c |  26 ++--
 .../infiniband/hw/usnic/usnic_uiom_interval_tree.h |   2 +-
 drivers/iommu/virtio-iommu.c                       |   6 +-
 drivers/vhost/vhost.c                              |  19 ++-
 drivers/vhost/vhost.h                              |   4 +-
 fs/dax.c                                           |   2 +-
 include/drm/drm_mm.h                               |   6 +-
 include/linux/interval_tree.h                      |   2 +-
 ...interval_tree_generic.h => interval_tree_gen.h} |  72 +++++-----
 include/linux/mm.h                                 |   6 +
 include/rdma/ib_umem_odp.h                         |   4 +-
 kernel/events/uprobes.c                            |   2 +-
 lib/interval_tree.c                                |   6 +-
 mm/hugetlb.c                                       |   4 +-
 mm/interval_tree.c                                 |   4 +-
 mm/khugepaged.c                                    |   2 +-
 mm/memory-failure.c                                |   6 +-
 mm/memory.c                                        |   2 +-
 mm/nommu.c                                         |   2 +-
 mm/rmap.c                                          |   6 +-
 43 files changed, 217 insertions(+), 333 deletions(-)
 rename include/linux/{interval_tree_generic.h => interval_tree_gen.h} (72%)

Comments

Matthew Wilcox Oct. 3, 2019, 8:32 p.m. UTC | #1
On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
> It has been discussed[1,2] that almost all users of interval trees would better
> be served if the intervals were actually not [a,b], but instead [a, b). This

So how does a user represent a range from ULONG_MAX to ULONG_MAX now?

I think the problem is that large parts of the kernel just don't consider
integer overflow.  Because we write in C, it's natural to write:

	for (i = start; i < end; i++)

and just assume that we never need to hit ULONG_MAX or UINT_MAX.
If we're storing addresses, that's generally true -- most architectures
don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd
have trouble with PTR_ERR).  If you're looking at file sizes, that's
not true on 32-bit machines, and we've definitely seen filesystem bugs
with files nudging up on 16TB (on 32 bit with 4k page size).  Or block
driver bugs with similarly sized block devices.

So, yeah, easier to use.  But damning corner cases.
Davidlohr Bueso Oct. 3, 2019, 9:10 p.m. UTC | #2
On Thu, 03 Oct 2019, Matthew Wilcox wrote:

>On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
>> It has been discussed[1,2] that almost all users of interval trees would better
>> be served if the intervals were actually not [a,b], but instead [a, b). This
>
>So how does a user represent a range from ULONG_MAX to ULONG_MAX now?

I would assume that any such lookups would be stab queries (anon/vma interval
tree). So both anon and files. And yeah, I blissfully ignored any overflow scenarios.
This should at least be documented.

>
>I think the problem is that large parts of the kernel just don't consider
>integer overflow.  Because we write in C, it's natural to write:
>
>	for (i = start; i < end; i++)
>
>and just assume that we never need to hit ULONG_MAX or UINT_MAX.

Similarly, I did not adjust queries such as 0 to ULONG_MAX, which are actually
real, then again any intersecting ranges will most likely not even be close to
end.

>If we're storing addresses, that's generally true -- most architectures
>don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd
>have trouble with PTR_ERR).  If you're looking at file sizes, that's
>not true on 32-bit machines, and we've definitely seen filesystem bugs
>with files nudging up on 16TB (on 32 bit with 4k page size).  Or block
>driver bugs with similarly sized block devices.
>
>So, yeah, easier to use.  But damning corner cases.

I agree.

Thanks,
Davidlohr
Jason Gunthorpe Oct. 4, 2019, 12:26 a.m. UTC | #3
On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
> Hi,
> 
> It has been discussed[1,2] that almost all users of interval trees would better
> be served if the intervals were actually not [a,b], but instead [a, b). This
> series attempts to convert all callers by way of transitioning from using
> "interval_tree_generic.h" to "interval_tree_gen.h". Once all users are converted,
> we remove the former.
> 
> Patch 1: adds a call that will make patch 8 easier to review by introducing stab
>          queries for the vma interval tree.
> 
> Patch 2: adds the new interval_tree_gen.h which is the same as the old one but
>          uses [a,b) intervals.
> 
> Patch 3-9: converts, in baby steps (as much as possible), each interval tree to
> 	   the new [a,b) one. It is done this way also to maintain bisectability.
> 	   Most conversions are pretty straightforward, however, there are some
> 	   creative ways in which some callers use the interval 'end' when going
> 	   through intersecting ranges within a tree. Ie: patch 3, 6 and 9.
> 
> Patch 10: deletes the interval_tree_generic.h header; there are no longer any users.
> 
> Patch 11: finally simplifies x86 pat tree to use the new interval tree machinery.
> 
> This has been lightly tested, and certainly not on driver paths that do non
> trivial conversions. Also needs more eyeballs as conversions can be easily
> missed (even when I've tried mitigating this by renaming the endpoint from 'last'
> to 'end' in each corresponding structure).
> 
> Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple of
> relevant lists (mm, dri, rdma); sorry if you consider this spam.
> 
> Applies on top of today's linux-next tree. Please consider for v5.5.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupKM_iKozw@mail.gmail.com/

Hurm, this is not entirely accurate. Most users do actually want
overlapping and multiple ranges. I just studied this extensively:

radeon_mn actually wants overlapping but seems to mis-understand the
interval_tree API and actively tries hard to prevent overlapping at
great cost and complexity. I have a patch to delete all of this and
just be overlapping.

amdgpu_mn copied the wrongness from radeon_mn

All the DRM drivers are basically the same here, tracking userspace
controlled VAs, so overlapping is essential

hfi1/mmu_rb definitely needs overlapping as it is dealing with
userspace VA ranges under control of userspace. As do the other
infiniband users.

vhost probably doesn't overlap in the normal case, but again userspace
could trigger overlap in some pathalogical case.

The [start,last] allows the interval to cover up to ULONG_MAX. I don't
know if this is needed however. Many users are using userspace VAs
here. Is there any kernel configuration where ULONG_MAX is a valid
userspace pointer? Ie 32 bit 4G userspace? I don't know. 

Many users seemed to have bugs where they were taking a userspace
controlled start + length and converting them into a start/end for
interval tree without overflow protection (woops)

Also I have a series already cooking to delete several of these
interval tree users, which will terribly conflict with this :\

Is it really necessary to make such churn for such a tiny API change?

Jason
Davidlohr Bueso Oct. 4, 2019, 2:48 a.m. UTC | #4
On Thu, 03 Oct 2019, Jason Gunthorpe wrote:

>On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
>> Hi,
>>
>> It has been discussed[1,2] that almost all users of interval trees would better
>> be served if the intervals were actually not [a,b], but instead [a, b). This
>> series attempts to convert all callers by way of transitioning from using
>> "interval_tree_generic.h" to "interval_tree_gen.h". Once all users are converted,
>> we remove the former.
>>
>> Patch 1: adds a call that will make patch 8 easier to review by introducing stab
>>          queries for the vma interval tree.
>>
>> Patch 2: adds the new interval_tree_gen.h which is the same as the old one but
>>          uses [a,b) intervals.
>>
>> Patch 3-9: converts, in baby steps (as much as possible), each interval tree to
>> 	   the new [a,b) one. It is done this way also to maintain bisectability.
>> 	   Most conversions are pretty straightforward, however, there are some
>> 	   creative ways in which some callers use the interval 'end' when going
>> 	   through intersecting ranges within a tree. Ie: patch 3, 6 and 9.
>>
>> Patch 10: deletes the interval_tree_generic.h header; there are no longer any users.
>>
>> Patch 11: finally simplifies x86 pat tree to use the new interval tree machinery.
>>
>> This has been lightly tested, and certainly not on driver paths that do non
>> trivial conversions. Also needs more eyeballs as conversions can be easily
>> missed (even when I've tried mitigating this by renaming the endpoint from 'last'
>> to 'end' in each corresponding structure).
>>
>> Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple of
>> relevant lists (mm, dri, rdma); sorry if you consider this spam.
>>
>> Applies on top of today's linux-next tree. Please consider for v5.5.
>>
>> Thanks!
>>
>> [1] https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupKM_iKozw@mail.gmail.com/
>
>Hurm, this is not entirely accurate. Most users do actually want
>overlapping and multiple ranges. I just studied this extensively:
>
>radeon_mn actually wants overlapping but seems to mis-understand the
>interval_tree API and actively tries hard to prevent overlapping at
>great cost and complexity. I have a patch to delete all of this and
>just be overlapping.
>
>amdgpu_mn copied the wrongness from radeon_mn
>
>All the DRM drivers are basically the same here, tracking userspace
>controlled VAs, so overlapping is essential
>
>hfi1/mmu_rb definitely needs overlapping as it is dealing with
>userspace VA ranges under control of userspace. As do the other
>infiniband users.
>
>vhost probably doesn't overlap in the normal case, but again userspace
>could trigger overlap in some pathalogical case.
>
>The [start,last] allows the interval to cover up to ULONG_MAX. I don't
>know if this is needed however. Many users are using userspace VAs
>here. Is there any kernel configuration where ULONG_MAX is a valid
>userspace pointer? Ie 32 bit 4G userspace? I don't know.
>
>Many users seemed to have bugs where they were taking a userspace
>controlled start + length and converting them into a start/end for
>interval tree without overflow protection (woops)
>
>Also I have a series already cooking to delete several of these
>interval tree users, which will terribly conflict with this :\

I have no problem redoing after your changes; if it's worth it
at all.

>
>Is it really necessary to make such churn for such a tiny API change?

I agree, and was kind of expecting this. In general the diffstat ended
up being larger than I initially hoped for. Maybe after your removals
I can look into this again.

Thanks,
Davidlohr
Michel Lespinasse Oct. 4, 2019, 12:43 p.m. UTC | #5
On Thu, Oct 03, 2019 at 01:32:50PM -0700, Matthew Wilcox wrote:
> On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
> > It has been discussed[1,2] that almost all users of interval trees would better
> > be served if the intervals were actually not [a,b], but instead [a, b). This
> 
> So how does a user represent a range from ULONG_MAX to ULONG_MAX now?
> 
> I think the problem is that large parts of the kernel just don't consider
> integer overflow.  Because we write in C, it's natural to write:
> 
> 	for (i = start; i < end; i++)
> 
> and just assume that we never need to hit ULONG_MAX or UINT_MAX.
> If we're storing addresses, that's generally true -- most architectures
> don't allow addresses in the -PAGE_SIZE to ULONG_MAX range (or they'd
> have trouble with PTR_ERR).  If you're looking at file sizes, that's
> not true on 32-bit machines, and we've definitely seen filesystem bugs
> with files nudging up on 16TB (on 32 bit with 4k page size).  Or block
> driver bugs with similarly sized block devices.
> 
> So, yeah, easier to use.  But damning corner cases.

Yeah, I wanted to ask - is the case where pgoff == ULONG_MAX (i.e.,
last block of a file that is exactly 16TB) currently supported on
32-bit archs ?
I have no idea if I am supposed to care about this or not...
Michel Lespinasse Oct. 4, 2019, 1:15 p.m. UTC | #6
Hi Jason,

On Thu, Oct 3, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> Hurm, this is not entirely accurate. Most users do actually want
> overlapping and multiple ranges. I just studied this extensively:

(Just curious, are you the person we discussed this with after the
Maple Tree talk at LPC 2019 ?)

I think we have two separate API problems there:
- overlapping vs non-overlapping intervals (the interval tree API
supports overlapping intervals, but some users are confused about
this)
- closed vs half-open interval definitions

It looks like you have been looking mostly at the first issue, which I
expect could simplify several interval tree users considerably, while
Davidlohr is addressing the second issue here.

> radeon_mn actually wants overlapping but seems to mis-understand the
> interval_tree API and actively tries hard to prevent overlapping at
> great cost and complexity. I have a patch to delete all of this and
> just be overlapping.
>
> amdgpu_mn copied the wrongness from radeon_mn
>
> All the DRM drivers are basically the same here, tracking userspace
> controlled VAs, so overlapping is essential
>
> hfi1/mmu_rb definitely needs overlapping as it is dealing with
> userspace VA ranges under control of userspace. As do the other
> infiniband users.

Do you have a handle on what usnic is doing with its intervals ?
usnic_uiom_insert_interval() has some complicated logic to avoid
having overlapping intervals, which is very confusing to me.

> vhost probably doesn't overlap in the normal case, but again userspace
> could trigger overlap in some pathalogical case.
>
> The [start,last] allows the interval to cover up to ULONG_MAX. I don't
> know if this is needed however. Many users are using userspace VAs
> here. Is there any kernel configuration where ULONG_MAX is a valid
> userspace pointer? Ie 32 bit 4G userspace? I don't know.
>
> Many users seemed to have bugs where they were taking a userspace
> controlled start + length and converting them into a start/end for
> interval tree without overflow protection (woops)
>
> Also I have a series already cooking to delete several of these
> interval tree users, which will terribly conflict with this :\
>
> Is it really necessary to make such churn for such a tiny API change?

My take is that this (Davidlohr's) patch series does not necessarily
need to be applied all at once - we could get the first change in
(adding the interval_tree_gen.h header), and convert the first few
users, without getting them all at once, as long as we have a plan for
finishing the work. So, if you have cleanups in progress in some of
the files, just tell us which ones and we can leave them out from the
first pass.

Thanks,
Matthew Wilcox Oct. 4, 2019, 4:03 p.m. UTC | #7
On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote:
> My take is that this (Davidlohr's) patch series does not necessarily
> need to be applied all at once - we could get the first change in
> (adding the interval_tree_gen.h header), and convert the first few
> users, without getting them all at once, as long as we have a plan for
> finishing the work. So, if you have cleanups in progress in some of
> the files, just tell us which ones and we can leave them out from the
> first pass.

Since we have users which do need to use the full ULONG_MAX range
(as pointed out by Christian Koenig), I don't think adding a second
implementation which is half-open is a good idea.  It'll only lead to
confusion.
Jason Gunthorpe Oct. 4, 2019, 5:45 p.m. UTC | #8
On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote:
> Hi Jason,
> 
> On Thu, Oct 3, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > Hurm, this is not entirely accurate. Most users do actually want
> > overlapping and multiple ranges. I just studied this extensively:
> 
> (Just curious, are you the person we discussed this with after the
> Maple Tree talk at LPC 2019 ?)

Possibly!
 
> I think we have two separate API problems there:
> - overlapping vs non-overlapping intervals (the interval tree API
> supports overlapping intervals, but some users are confused about
> this)

I think we just have a bunch of confused drivers, ie the two drm
drivers sure look confused to me.

> - closed vs half-open interval definitions

I'm not sure why this is a big problem..

We may actually just have bugs in handling the '-1' as it is supposed
to be written as start + (size-1) so that start + size == ULONG_MAX+1
works properly.

> > hfi1/mmu_rb definitely needs overlapping as it is dealing with
> > userspace VA ranges under control of userspace. As do the other
> > infiniband users.
> 
> Do you have a handle on what usnic is doing with its intervals ?
> usnic_uiom_insert_interval() has some complicated logic to avoid
> having overlapping intervals, which is very confusing to me.

I don't know why it is so complicated, but I can say that it is
storing userspace VA's in that tree.

I have some feeling this driver is trying to use the IOMMU to create a
mirror of the userspace VA

Userspace can request the HW be able to access any set of overlapping
regions and so the driver must intersect all the ranges and compute a
list of VA pages to IOMMU map. Just guessing.

Jason
Davidlohr Bueso Oct. 4, 2019, 7:35 p.m. UTC | #9
On Fri, 04 Oct 2019, Matthew Wilcox wrote:

>On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote:
>> My take is that this (Davidlohr's) patch series does not necessarily
>> need to be applied all at once - we could get the first change in
>> (adding the interval_tree_gen.h header), and convert the first few
>> users, without getting them all at once, as long as we have a plan for
>> finishing the work. So, if you have cleanups in progress in some of
>> the files, just tell us which ones and we can leave them out from the
>> first pass.
>
>Since we have users which do need to use the full ULONG_MAX range
>(as pointed out by Christian Koenig), I don't think adding a second
>implementation which is half-open is a good idea.  It'll only lead to
>confusion.

Right, we should not have two implementations.

Thanks,
Davidlohr