mbox series

[PATCHv4,0/6] scsi: use xarray for devices and targets

Message ID 20200602113311.121513-1-hare@suse.de (mailing list archive)
Headers show
Series scsi: use xarray for devices and targets | expand

Message

Hannes Reinecke June 2, 2020, 11:33 a.m. UTC
Hi all,

based on the ideas from Doug Gilbert here's now my take on using
xarrays for devices and targets.
It revolves around two ideas:

- The scsi target 'channel' and 'id' numbers are never ever used
  to the full 32 bit range; channels are well below 10, and no
  driver is using more than 16 bits for the id. So we can reduce
  the type of 'channel' and 'id' to 16 bits, and use the 32 bit
  value 'channel << 16 | id' as the index into the target xarray.
- Nearly every target only ever uses the first two levels of the
  4-level SCSI LUN structure, which means that we can use the
  linearized SCSI LUN id as an index into the xarray.
  If we ever come across targets utilizing more that 2 levels of
  the LUN structure we'll allocate the first unused index and have
  to resort to a less efficient lookup instead of direct indexing.

With these changes we can implement an efficient lookup mechanism,
devolving into direct lookup for most cases. It also allows us to
detect duplicate entries or accidental overwrites of existing elements
by using xa_cmpxchg().
And iteration over targets and devices should be as efficient as the
current, list-based, approach.

As usual, comments and reviews are welcome.

Changes to v2:
- Implement safe device iteration as noted by Doug
- Add an additional patch to avoid a pointless memory allocation
  in scsi_alloc_target()

Hannes Reinecke (6):
  scsi: convert target lookup to xarray
  target_core_pscsi: use __scsi_device_lookup()
  scsi: move target device list to xarray
  scsi: remove direct device lookup per host
  scsi_error: use xarray lookup instead of wrappers
  scsi: avoid pointless memory allocation in scsi_alloc_target()

 drivers/scsi/hosts.c               |   5 +-
 drivers/scsi/scsi.c                | 151 +++++++++++++++++++++++++++++--------
 drivers/scsi/scsi_error.c          |  35 +++++----
 drivers/scsi/scsi_lib.c            |   9 +--
 drivers/scsi/scsi_priv.h           |   2 +
 drivers/scsi/scsi_scan.c           | 101 +++++++++++++++----------
 drivers/scsi/scsi_sysfs.c          |  72 +++++++++++++-----
 drivers/target/target_core_pscsi.c |   8 +-
 include/scsi/scsi_device.h         |  32 +++++---
 include/scsi/scsi_host.h           |   5 +-
 10 files changed, 288 insertions(+), 132 deletions(-)

Comments

Christoph Hellwig June 3, 2020, 12:53 p.m. UTC | #1
On Tue, Jun 02, 2020 at 01:33:05PM +0200, Hannes Reinecke wrote:
> Hi all,
> 
> based on the ideas from Doug Gilbert here's now my take on using
> xarrays for devices and targets.
> It revolves around two ideas:
> 
> - The scsi target 'channel' and 'id' numbers are never ever used
>   to the full 32 bit range; channels are well below 10, and no
>   driver is using more than 16 bits for the id. So we can reduce
>   the type of 'channel' and 'id' to 16 bits, and use the 32 bit
>   value 'channel << 16 | id' as the index into the target xarray.
> - Nearly every target only ever uses the first two levels of the
>   4-level SCSI LUN structure, which means that we can use the
>   linearized SCSI LUN id as an index into the xarray.
>   If we ever come across targets utilizing more that 2 levels of
>   the LUN structure we'll allocate the first unused index and have
>   to resort to a less efficient lookup instead of direct indexing.
> 
> With these changes we can implement an efficient lookup mechanism,
> devolving into direct lookup for most cases. It also allows us to
> detect duplicate entries or accidental overwrites of existing elements
> by using xa_cmpxchg().
> And iteration over targets and devices should be as efficient as the
> current, list-based, approach.
> 
> As usual, comments and reviews are welcome.

I see absolutely no argument for what the point of this series.  It adds
more code, and I don't really see any indications for it fixing bugs,
speeding up workloads, or reducing memory usage.
Douglas Gilbert June 3, 2020, 6:23 p.m. UTC | #2
On 2020-06-03 8:53 a.m., Christoph Hellwig wrote:
> On Tue, Jun 02, 2020 at 01:33:05PM +0200, Hannes Reinecke wrote:
>> Hi all,
>>
>> based on the ideas from Doug Gilbert here's now my take on using
>> xarrays for devices and targets.
>> It revolves around two ideas:
>>
>> - The scsi target 'channel' and 'id' numbers are never ever used
>>    to the full 32 bit range; channels are well below 10, and no
>>    driver is using more than 16 bits for the id. So we can reduce
>>    the type of 'channel' and 'id' to 16 bits, and use the 32 bit
>>    value 'channel << 16 | id' as the index into the target xarray.
>> - Nearly every target only ever uses the first two levels of the
>>    4-level SCSI LUN structure, which means that we can use the
>>    linearized SCSI LUN id as an index into the xarray.
>>    If we ever come across targets utilizing more that 2 levels of
>>    the LUN structure we'll allocate the first unused index and have
>>    to resort to a less efficient lookup instead of direct indexing.
>>
>> With these changes we can implement an efficient lookup mechanism,
>> devolving into direct lookup for most cases. It also allows us to
>> detect duplicate entries or accidental overwrites of existing elements
>> by using xa_cmpxchg().
>> And iteration over targets and devices should be as efficient as the
>> current, list-based, approach.
>>
>> As usual, comments and reviews are welcome.
> 
> I see absolutely no argument for what the point of this series.  It adds
> more code, and I don't really see any indications for it fixing bugs,
> speeding up workloads, or reducing memory usage.

Lets take memory usage first. The legacy design (part of which may have
been a later add-on) has three collections where two are needed:
    1) all targets in a host
    2) all sdev_s in a target
    3) all sdev_s in a host

So the third one is redundant and now removed (together with the
complexity of making sure those 3 collections are always in sync, seen
from the users' viewpoint). Each doubly linked collection on 64
bit machines uses 16 bytes (2 eight byte pointers). So that is a
32 byte reduction in each sdev object. The proposed solution adds 0
bytes because it uses the LUN as an index which is already there.
Similar but smaller win in scsi_target objects.

There are also some locks and mutexes in the three level object
tree (host-target-sdev[LU]) that can probably be dispensed with
as xarrays come with their own locks. That has not been done yet
making both my earlier proposal and this one "overlocked". And
locks and mutexes take up space in objects and slow things down.


The speeding up will come in big machine startup and shutdown and
its reaction time to disruptions (e.g. cable disconnected to a disk
array) IMO. xarray and explicit parent pointers give us a faster
way to navigate up and down the object tree. With this patchset we
have an O(ln(n)) lookup in the downward direction where currently we
only have O(n). Very little use is made of the "lookup" functions in
the API because users could see that it was just an iteration
(i.e. O(n)). Hopefully transports will take advantage of faster
lookups and perhaps implement their own xarrays. Even the upward
navigation can be complicated by transports inserting levels between
the host and the target. This is what the SCSI mid-layer object tree
looks like moving upwards from a SAS SSD, connected to an
SAS expander, moving up to its host (a HBA):
     scsi_device, ptr=ffff99d23f513960
     scsi_target, ptr=ffff99d241595c28
     sas_rphy, ptr=ffff99d242519c00
     sas_port, ptr=ffff99d24251ec00
     sas_expander_device, ptr=ffff99d23f4c6438
     sas_port, ptr=ffff99d23f4c7400
     Scsi_Host, ptr=ffff99d2425261f8

There already is a scsi_device::host redundant pointer to bypass the
oft-called and slow-walking dev_to_shost(). I'm proposing another
redundant scsi_target::parent_shost pointer that will bypass seven
dev_to_shost() invocations.

Currently all iterations are done under the host_lock as that is
required for doubly linked list safety. xarray uses rcu read locks
on all non-modifying operations including iterations and if we can
safely rely on them, that will increase the available parallelism
within one host.

Finally the SCSI fast path will usually require the presence the
corresponding sdev object, preferably cached. So making it smaller
will help.

Doug Gilbert


P.S. I sidestepped the "bugs" issue. Surely we will add some but
it is hard to believe when you wade into the complexity of the
currently linked collections and their myriad of locks, that there
aren't subtle bugs in the existing code. I have been working with
xarrays for about 1 year and finding locking issues is easier
with xarrays compared to "roll your own" linked list locking, IMO.
Hannes Reinecke June 4, 2020, 4:12 p.m. UTC | #3
On 6/3/20 2:53 PM, Christoph Hellwig wrote:
> On Tue, Jun 02, 2020 at 01:33:05PM +0200, Hannes Reinecke wrote:
>> Hi all,
>>
>> based on the ideas from Doug Gilbert here's now my take on using
>> xarrays for devices and targets.
>> It revolves around two ideas:
>>
>> - The scsi target 'channel' and 'id' numbers are never ever used
>>    to the full 32 bit range; channels are well below 10, and no
>>    driver is using more than 16 bits for the id. So we can reduce
>>    the type of 'channel' and 'id' to 16 bits, and use the 32 bit
>>    value 'channel << 16 | id' as the index into the target xarray.
>> - Nearly every target only ever uses the first two levels of the
>>    4-level SCSI LUN structure, which means that we can use the
>>    linearized SCSI LUN id as an index into the xarray.
>>    If we ever come across targets utilizing more that 2 levels of
>>    the LUN structure we'll allocate the first unused index and have
>>    to resort to a less efficient lookup instead of direct indexing.
>>
>> With these changes we can implement an efficient lookup mechanism,
>> devolving into direct lookup for most cases. It also allows us to
>> detect duplicate entries or accidental overwrites of existing elements
>> by using xa_cmpxchg().
>> And iteration over targets and devices should be as efficient as the
>> current, list-based, approach.
>>
>> As usual, comments and reviews are welcome.
> 
> I see absolutely no argument for what the point of this series.  It adds
> more code, and I don't really see any indications for it fixing bugs,
> speeding up workloads, or reducing memory usage.
> 
 From my perspective this is a proof-of-concept; using xarrays to store 
targets and LUNs has the benefit that we can directly access the 
elements, and the lookup will be more efficient for larger setups.

But it's not a clear-cut solution, merely replacing one concept with 
some issues with another concept with another set of issues.

Guess the real benefit will come only if we manage to move to explicit 
scsi target removal, and not the implicit model of making the scsi 
target dependent on the underlying scsi devices we have now.
I'll be experimenting with that and will post an update for it.

I _do_ like the xarray for targets, though; they have a fixed location 
where they can go and as such xarray are a far more natural choice.
For LUNs it's less compelling as xarrays can't use 64bits generically as 
index, but still.

Cheers,

Hannes