[00/34] Beginning of multi-rail support for drivers/staging/lustre
mbox series

Message ID 153628058697.8267.6056114844033479774.stgit@noble
Headers show
Series
  • Beginning of multi-rail support for drivers/staging/lustre
Related show

Message

NeilBrown Sept. 7, 2018, 12:49 a.m. UTC
The following series implements the first patch in the
multi-rail series:
Commit: 8cbb8cd3e771 ("LU-7734 lnet: Multi-Rail local NI split")

I split that commit up into 40 individual commits which can be found
at
  https://github.com/neilbrown/lustre/commits/multirail
though you need to scroll down a bit, as that contains all the
multi-rail series.

I then ported most of these patches to my mainline tree.
Some that I haven't included are:
- lnet: Move lnet_msg_alloc/free down a bit.
    lnet_msg_alloc/free don't exist any more
- lnet: lib-types: change some tabs to spaces
- lnet - assorted whitespace changes.
- lnet: change ni_last_alive from time64_t to long
- lnet: add lnet_net_state
    net_state is never used.
- lnet: remove 'static' from lnet_get_net_config()

I've also made a couple of minor changes to individual patches not
strictly related to porting (the net_prio field is never used, so I
never added it - I should have made that a separate patch).

This series compiles, but doesn't work.  I get a NULL pointer
reference, then an assertion failure.  If I fix those, it hangs.
The NULL pointer ref and the failing assertion are gone with
later patches, so I hope the other problems are too.

Some of these patches have very poor descriptions, such as "I have no
idea what this does".  If someone would like to explain - or maybe say
"Oh, we really shouldn't have done that", I'd be very happy to
receive that, and update the description or patch accordingly.

These will all appear in my lustre-testing branch, but won't migrate
to 'lustre' until I, at least, have enough other patches that I can
get a successful test run.

Review and comments always welcome.

Thanks,
NeilBrown


---

Amir Shehata (1):
      Completely re-write lnet_parse_networks().

NeilBrown (33):
      struct lnet_ni - reformat comments.
      lnet: Create struct lnet_net
      lnet: struct lnet_ni: move ni_lnd to lnet_net
      lnet: embed lnd_tunables in lnet_ni
      lnet: begin separating "networks" from "network interfaces".
      lnet: store separate xmit/recv net-interface in each message.
      lnet: change lnet_peer to reference the net, rather than ni.
      lnet: add cpt to lnet_match_info.
      lnet: add list of cpts to lnet_net.
      lnet: add ni arg to lnet_cpt_of_nid()
      lnet: pass tun to lnet_startup_lndni, instead of full conf
      lnet: split lnet_startup_lndni
      lnet: reverse order of lnet_startup_lnd{net,ni}
      lnet: rename lnet_find_net_locked to lnet_find_rnet_locked
      lnet: extend zombie handling to nets and nis
      lnet: lnet_shutdown_lndnets - remove some cleanup code.
      lnet: move lnet_shutdown_lndnets down to after first use
      lnet: add ni_state
      lnet: simplify lnet_islocalnet()
      lnet: discard ni_cpt_list
      lnet: add net_ni_added
      lnet: don't take reference in lnet_XX2ni_locked()
      lnet: don't need lock to test ln_shutdown.
      lnet: don't take lock over lnet_net_unique()
      lnet: swap 'then' and 'else' branches in lnet_startup_lndnet
      lnet: only valid lnd_type when net_id is unique.
      lnet: make it possible to add a new interface to a network
      lnet: add checks to ensure network interface names are unique.
      lnet: track tunables in lnet_startup_lndnet()
      lnet: fix typo
      lnet: lnet_dyn_add_ni: fix ping_info count
      lnet: lnet_dyn_del_ni: fix ping_info count
      lnet: introduce use_tcp_bonding mod param


 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   31 -
 .../staging/lustre/include/linux/lnet/lib-types.h  |  142 ++-
 .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |   18 
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |   10 
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |    6 
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c |   12 
 .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |   74 +-
 .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   25 -
 drivers/staging/lustre/lnet/lnet/acceptor.c        |    8 
 drivers/staging/lustre/lnet/lnet/api-ni.c          |  939 +++++++++++++-------
 drivers/staging/lustre/lnet/lnet/config.c          |  688 +++++++++++----
 drivers/staging/lustre/lnet/lnet/lib-move.c        |  132 ++-
 drivers/staging/lustre/lnet/lnet/lib-ptl.c         |    6 
 drivers/staging/lustre/lnet/lnet/lo.c              |    2 
 drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 
 drivers/staging/lustre/lnet/lnet/peer.c            |   31 -
 drivers/staging/lustre/lnet/lnet/router.c          |   51 +
 drivers/staging/lustre/lnet/lnet/router_proc.c     |   24 -
 drivers/staging/lustre/lnet/selftest/brw_test.c    |    2 
 drivers/staging/lustre/lnet/selftest/framework.c   |    3 
 drivers/staging/lustre/lnet/selftest/selftest.h    |    2 
 21 files changed, 1507 insertions(+), 702 deletions(-)

--
Signature

Comments

James Simmons Sept. 10, 2018, 11:10 p.m. UTC | #1
> The following series implements the first patch in the
> multi-rail series:
> Commit: 8cbb8cd3e771 ("LU-7734 lnet: Multi-Rail local NI split")
> 
> I split that commit up into 40 individual commits which can be found
> at
>   https://github.com/neilbrown/lustre/commits/multirail
> though you need to scroll down a bit, as that contains all the
> multi-rail series.
> 
> I then ported most of these patches to my mainline tree.
> Some that I haven't included are:
> - lnet: Move lnet_msg_alloc/free down a bit.
>     lnet_msg_alloc/free don't exist any more
> - lnet: lib-types: change some tabs to spaces
> - lnet - assorted whitespace changes.
> - lnet: change ni_last_alive from time64_t to long
> - lnet: add lnet_net_state
>     net_state is never used.
> - lnet: remove 'static' from lnet_get_net_config()
> 
> I've also made a couple of minor changes to individual patches not
> strictly related to porting (the net_prio field is never used, so I
> never added it - I should have made that a separate patch).
>
> This series compiles, but doesn't work.  I get a NULL pointer
> reference, then an assertion failure.  If I fix those, it hangs.
> The NULL pointer ref and the failing assertion are gone with
> later patches, so I hope the other problems are too.
> 

I have tried it and did a compare to what landed in the OpenSFS branch.
I saw the failures in my testing and foudn the mistake in the 7th patch.

> Some of these patches have very poor descriptions, such as "I have no
> idea what this does".  If someone would like to explain - or maybe say
> "Oh, we really shouldn't have done that", I'd be very happy to
> receive that, and update the description or patch accordingly.

When I ran checkpatch it really dislikes:

This is part of
    8cbb8cd3e771e7f7e0f99cafc19fad32770dc015
       LU-7734 lnet: Multi-Rail local NI split

I don't recommend landing the above in the commit messsage as for the
reason that a person outside of lustre will not know where to look for
that git commit. Instead I recommend replacing it with:

------------------------------------------------------------------
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
Reviewed-on: http://review.whamcloud.com/18274
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Signed-off-by: NeilBrown <neilb@suse.com>

This gives the reviewer a URL link for both the JIRA ticket that usually
contains details not in the commit message as well as the gerrit URL
for the original patch. This way if a future bug is found a comparison
can be done against the original patch. 
 
The policy for the Lustre project is to perserve authorship for patches
when porting to other branches, upstream or LTS.

> These will all appear in my lustre-testing branch, but won't migrate
> to 'lustre' until I, at least, have enough other patches that I can
> get a successful test run.
> 
> Review and comments always welcome.
> 
> Thanks,
> NeilBrown
> 
> 
> ---
> 
> Amir Shehata (1):
>       Completely re-write lnet_parse_networks().
> 
> NeilBrown (33):
>       struct lnet_ni - reformat comments.
>       lnet: Create struct lnet_net
>       lnet: struct lnet_ni: move ni_lnd to lnet_net
>       lnet: embed lnd_tunables in lnet_ni
>       lnet: begin separating "networks" from "network interfaces".
>       lnet: store separate xmit/recv net-interface in each message.
>       lnet: change lnet_peer to reference the net, rather than ni.
>       lnet: add cpt to lnet_match_info.
>       lnet: add list of cpts to lnet_net.
>       lnet: add ni arg to lnet_cpt_of_nid()
>       lnet: pass tun to lnet_startup_lndni, instead of full conf
>       lnet: split lnet_startup_lndni
>       lnet: reverse order of lnet_startup_lnd{net,ni}
>       lnet: rename lnet_find_net_locked to lnet_find_rnet_locked
>       lnet: extend zombie handling to nets and nis
>       lnet: lnet_shutdown_lndnets - remove some cleanup code.
>       lnet: move lnet_shutdown_lndnets down to after first use
>       lnet: add ni_state
>       lnet: simplify lnet_islocalnet()
>       lnet: discard ni_cpt_list
>       lnet: add net_ni_added
>       lnet: don't take reference in lnet_XX2ni_locked()
>       lnet: don't need lock to test ln_shutdown.
>       lnet: don't take lock over lnet_net_unique()
>       lnet: swap 'then' and 'else' branches in lnet_startup_lndnet
>       lnet: only valid lnd_type when net_id is unique.
>       lnet: make it possible to add a new interface to a network
>       lnet: add checks to ensure network interface names are unique.
>       lnet: track tunables in lnet_startup_lndnet()
>       lnet: fix typo
>       lnet: lnet_dyn_add_ni: fix ping_info count
>       lnet: lnet_dyn_del_ni: fix ping_info count
>       lnet: introduce use_tcp_bonding mod param
> 
> 
>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   31 -
>  .../staging/lustre/include/linux/lnet/lib-types.h  |  142 ++-
>  .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |   18 
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |   10 
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |    6 
>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c |   12 
>  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |   74 +-
>  .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   25 -
>  drivers/staging/lustre/lnet/lnet/acceptor.c        |    8 
>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  939 +++++++++++++-------
>  drivers/staging/lustre/lnet/lnet/config.c          |  688 +++++++++++----
>  drivers/staging/lustre/lnet/lnet/lib-move.c        |  132 ++-
>  drivers/staging/lustre/lnet/lnet/lib-ptl.c         |    6 
>  drivers/staging/lustre/lnet/lnet/lo.c              |    2 
>  drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 
>  drivers/staging/lustre/lnet/lnet/peer.c            |   31 -
>  drivers/staging/lustre/lnet/lnet/router.c          |   51 +
>  drivers/staging/lustre/lnet/lnet/router_proc.c     |   24 -
>  drivers/staging/lustre/lnet/selftest/brw_test.c    |    2 
>  drivers/staging/lustre/lnet/selftest/framework.c   |    3 
>  drivers/staging/lustre/lnet/selftest/selftest.h    |    2 
>  21 files changed, 1507 insertions(+), 702 deletions(-)
> 
> --
> Signature
> 
>
NeilBrown Sept. 24, 2018, 6:58 a.m. UTC | #2
On Tue, Sep 11 2018, James Simmons wrote:

>> The following series implements the first patch in the
>> multi-rail series:
>> Commit: 8cbb8cd3e771 ("LU-7734 lnet: Multi-Rail local NI split")
>> 
>> I split that commit up into 40 individual commits which can be found
>> at
>>   https://github.com/neilbrown/lustre/commits/multirail
>> though you need to scroll down a bit, as that contains all the
>> multi-rail series.
>> 
>> I then ported most of these patches to my mainline tree.
>> Some that I haven't included are:
>> - lnet: Move lnet_msg_alloc/free down a bit.
>>     lnet_msg_alloc/free don't exist any more
>> - lnet: lib-types: change some tabs to spaces
>> - lnet - assorted whitespace changes.
>> - lnet: change ni_last_alive from time64_t to long
>> - lnet: add lnet_net_state
>>     net_state is never used.
>> - lnet: remove 'static' from lnet_get_net_config()
>> 
>> I've also made a couple of minor changes to individual patches not
>> strictly related to porting (the net_prio field is never used, so I
>> never added it - I should have made that a separate patch).
>>
>> This series compiles, but doesn't work.  I get a NULL pointer
>> reference, then an assertion failure.  If I fix those, it hangs.
>> The NULL pointer ref and the failing assertion are gone with
>> later patches, so I hope the other problems are too.
>> 
>
> I have tried it and did a compare to what landed in the OpenSFS branch.
> I saw the failures in my testing and foudn the mistake in the 7th patch.
>
>> Some of these patches have very poor descriptions, such as "I have no
>> idea what this does".  If someone would like to explain - or maybe say
>> "Oh, we really shouldn't have done that", I'd be very happy to
>> receive that, and update the description or patch accordingly.
>
> When I ran checkpatch it really dislikes:
>
> This is part of
>     8cbb8cd3e771e7f7e0f99cafc19fad32770dc015
>        LU-7734 lnet: Multi-Rail local NI split
>
> I don't recommend landing the above in the commit messsage as for the
> reason that a person outside of lustre will not know where to look for
> that git commit. Instead I recommend replacing it with:
>
> ------------------------------------------------------------------
> Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> Reviewed-on: http://review.whamcloud.com/18274
> Reviewed-by: Doug Oucharek <dougso@me.com>
> Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> Signed-off-by: NeilBrown <neilb@suse.com>

Thanks for the suggestion.  I don't like that approach exactly because
it seems to be a lie.  The specific patch was not reviewed by those
people, and there is useful information which is not included there.
I have changed to patches to include:

    This is part of
        Commit: 8cbb8cd3e771 ("LU-7734 lnet: Multi-Rail local NI split")
    from upstream lustre, where it is marked:
        Signed-off-by: Amir Shehata <amir.shehata@intel.com>
        WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
        Reviewed-on: http://review.whamcloud.com/18274
        Reviewed-by: Doug Oucharek <doug.s.oucharek@intel.com>
        Reviewed-by: Olaf Weber <olaf@sgi.com>

checkpatch is not happy with the indented tags, but checkpatch is a
servant, not the master.

I've made that change, moved this series to my 'lustre' branch, merged
in the latest -rc, and pushed it all out.

Now the post the rest of the 'MR' series, and then start looking at
Dynamic Discovery (756abb9cf00b9^..1c45d9051764)

Thanks,
NeilBrown

>
> This gives the reviewer a URL link for both the JIRA ticket that usually
> contains details not in the commit message as well as the gerrit URL
> for the original patch. This way if a future bug is found a comparison
> can be done against the original patch. 
>  
> The policy for the Lustre project is to perserve authorship for patches
> when porting to other branches, upstream or LTS.
>
>> These will all appear in my lustre-testing branch, but won't migrate
>> to 'lustre' until I, at least, have enough other patches that I can
>> get a successful test run.
>> 
>> Review and comments always welcome.
>> 
>> Thanks,
>> NeilBrown
>> 
>> 
>> ---
>> 
>> Amir Shehata (1):
>>       Completely re-write lnet_parse_networks().
>> 
>> NeilBrown (33):
>>       struct lnet_ni - reformat comments.
>>       lnet: Create struct lnet_net
>>       lnet: struct lnet_ni: move ni_lnd to lnet_net
>>       lnet: embed lnd_tunables in lnet_ni
>>       lnet: begin separating "networks" from "network interfaces".
>>       lnet: store separate xmit/recv net-interface in each message.
>>       lnet: change lnet_peer to reference the net, rather than ni.
>>       lnet: add cpt to lnet_match_info.
>>       lnet: add list of cpts to lnet_net.
>>       lnet: add ni arg to lnet_cpt_of_nid()
>>       lnet: pass tun to lnet_startup_lndni, instead of full conf
>>       lnet: split lnet_startup_lndni
>>       lnet: reverse order of lnet_startup_lnd{net,ni}
>>       lnet: rename lnet_find_net_locked to lnet_find_rnet_locked
>>       lnet: extend zombie handling to nets and nis
>>       lnet: lnet_shutdown_lndnets - remove some cleanup code.
>>       lnet: move lnet_shutdown_lndnets down to after first use
>>       lnet: add ni_state
>>       lnet: simplify lnet_islocalnet()
>>       lnet: discard ni_cpt_list
>>       lnet: add net_ni_added
>>       lnet: don't take reference in lnet_XX2ni_locked()
>>       lnet: don't need lock to test ln_shutdown.
>>       lnet: don't take lock over lnet_net_unique()
>>       lnet: swap 'then' and 'else' branches in lnet_startup_lndnet
>>       lnet: only valid lnd_type when net_id is unique.
>>       lnet: make it possible to add a new interface to a network
>>       lnet: add checks to ensure network interface names are unique.
>>       lnet: track tunables in lnet_startup_lndnet()
>>       lnet: fix typo
>>       lnet: lnet_dyn_add_ni: fix ping_info count
>>       lnet: lnet_dyn_del_ni: fix ping_info count
>>       lnet: introduce use_tcp_bonding mod param
>> 
>> 
>>  .../staging/lustre/include/linux/lnet/lib-lnet.h   |   31 -
>>  .../staging/lustre/include/linux/lnet/lib-types.h  |  142 ++-
>>  .../lustre/include/uapi/linux/lnet/lnet-dlc.h      |   18 
>>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |   10 
>>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |    6 
>>  .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c |   12 
>>  .../lustre/lnet/klnds/o2iblnd/o2iblnd_modparams.c  |   74 +-
>>  .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |   25 -
>>  drivers/staging/lustre/lnet/lnet/acceptor.c        |    8 
>>  drivers/staging/lustre/lnet/lnet/api-ni.c          |  939 +++++++++++++-------
>>  drivers/staging/lustre/lnet/lnet/config.c          |  688 +++++++++++----
>>  drivers/staging/lustre/lnet/lnet/lib-move.c        |  132 ++-
>>  drivers/staging/lustre/lnet/lnet/lib-ptl.c         |    6 
>>  drivers/staging/lustre/lnet/lnet/lo.c              |    2 
>>  drivers/staging/lustre/lnet/lnet/net_fault.c       |    3 
>>  drivers/staging/lustre/lnet/lnet/peer.c            |   31 -
>>  drivers/staging/lustre/lnet/lnet/router.c          |   51 +
>>  drivers/staging/lustre/lnet/lnet/router_proc.c     |   24 -
>>  drivers/staging/lustre/lnet/selftest/brw_test.c    |    2 
>>  drivers/staging/lustre/lnet/selftest/framework.c   |    3 
>>  drivers/staging/lustre/lnet/selftest/selftest.h    |    2 
>>  21 files changed, 1507 insertions(+), 702 deletions(-)
>> 
>> --
>> Signature
>> 
>>
James Simmons Sept. 29, 2018, 10:35 p.m. UTC | #3
> > Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
> > Reviewed-on: http://review.whamcloud.com/18274
> > Reviewed-by: Doug Oucharek <dougso@me.com>
> > Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
> > Signed-off-by: NeilBrown <neilb@suse.com>
> 
> Thanks for the suggestion.  I don't like that approach exactly because
> it seems to be a lie.  The specific patch was not reviewed by those
> people, and there is useful information which is not included there.
> I have changed to patches to include:
> 
>     This is part of
>         Commit: 8cbb8cd3e771 ("LU-7734 lnet: Multi-Rail local NI split")
>     from upstream lustre, where it is marked:
>         Signed-off-by: Amir Shehata <amir.shehata@intel.com>
>         WC-bug-id: https://jira.whamcloud.com/browse/LU-7734
>         Reviewed-on: http://review.whamcloud.com/18274
>         Reviewed-by: Doug Oucharek <doug.s.oucharek@intel.com>
>         Reviewed-by: Olaf Weber <olaf@sgi.com>
> 
> checkpatch is not happy with the indented tags, but checkpatch is a
> servant, not the master.

To my knowledge their isn't really a policy about this. What I have been
doing is kind of following how LTS versions of lustre have been handled. 
For LTS versions patches are cherry-picked and two additional lines are
added:

Lustre-change:
Lustre-commit:

The orginal reviews are keep. Also by including the original reviews the
people involved with those patches are poked. The only requirement is that
2 people review again. Not everyone has to review for it to land. Once 
landed I don't see a clear why to tell who reviewed.

In any case the above approach seems reasonable as long as the original
author is preserve. The general rule is the original patch poster normally
keeps authorship. Also I noticed patches recently pushed are not reaching
the original authors and reviewers. We should make sure that still 
happens.