diff mbox

Add rdma service for kernel boot support

Message ID 20170713182018.GE11069@obsidianresearch.com (mailing list archive)
State Superseded
Headers show

Commit Message

Jason Gunthorpe July 13, 2017, 6:20 p.m. UTC
On Thu, Jul 13, 2017 at 05:48:22PM +0000, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 19:20 +0200, Benjamin Drung wrote:
> > Currently upstream does not provide a rdma or openibd service. Only the
> > RedHat package ships a rdma service and rdma modules configuration
> > files. The ibacm and srp_daemon services depend on the openibd service.
> > 
> > Make RedHat's rdma service available to all distros by cherry-picking
> > the basic files for the rdma service to run and trim them down to the
> > minimum. Do not pick workarounds or other quirk that might not needed
> > any more. Then replace the openibd service dependency by the rdma
> > service.
> 
> Hello Benjamin,
> 
> Shouldn't the rdma / openibd service be split into multiple services -
> one for IPoIB, one for the SRP initiator driver, one for the SRP target
> driver, one for the iSER initiator driver, one for the iSER target driver,
> one for RDS and one for NFSoRDMA? I think that would allow us to get rid
> of the rdma.conf configuration file and also that this would allow systemd
> to load those services concurrently that do not depend on each other.

I broadly agree.

For instance, Bart's work has already completely fixed srp to load
properly and race free. The missing piece is to have it also autoload
the required srp kernel module.

Benjamin, I have a different, simpler suggestion for some of the other
bits, let me post an RFC

Here is an untested attempt to move module loading into srp_daemon -
what do you think Bart?

Jason

From 28c1e19719ed850bc8f328d187c536c97eef7fd7 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Date: Thu, 13 Jul 2017 12:10:07 -0600
Subject: [PATCH] srp: Autoload the SRP kernel module if required

Doing this before starting the daemon ensures the daemon will work.
Since the daemon is using sysfs files there is no easy way to have the
kernel auto load it.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
---
 srp_daemon/srp_daemon.service.in       |  2 +-
 srp_daemon/srp_daemon_port@.service.in |  1 +
 srp_daemon/srp_kernel_module.service   | 12 ++++++++++++
 3 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 srp_daemon/srp_kernel_module.service

Comments

Bart Van Assche July 13, 2017, 6:33 p.m. UTC | #1
On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote:
> Here is an untested attempt to move module loading into srp_daemon -
> what do you think Bart?

Hello Jason,

Thanks for the patch. To me this looks like a nice simplification compared
to the current approach. Do you want me to test this patch?

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2017, 10:15 p.m. UTC | #2
On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote:
> > Here is an untested attempt to move module loading into srp_daemon -
> > what do you think Bart?
> 
> Hello Jason,
> 
> Thanks for the patch. To me this looks like a nice simplification compared
> to the current approach. Do you want me to test this patch?

I did some more steps..

Benjamin/Bart let me know what you think, I'll send the patches to the
list if you like the idea.

https://github.com/jgunthorpe/rdma-plumbing/tree/systemd

This replaces Benjamins rdma.conf based approach.

The basic approach is to directly use systemd to load the modules and
determine what modules to load by using udev and explicit Requires in
the units.

Eg, the extra modules for mlx4 are loaded by triggering this udev rule:

SUBSYSTEM=="module", KERNEL=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"

Which causes the module loader to load the contents of the modules
list in /etc/rdma/modules/mlx4.conf

If the user does not want a specific to load then they can comment out
the module line in the /etc/rdma/modules files or use the usual module
black list scheme.

The interesting thing about this is how the boot ordering works, since
rdma-load-modules@.service is a 'before' sysinit.target it will run
before normal system startup tasks if the unit is scheduled to run
early enough. I expect that boot time module loads triggered by udev
will be early enough..

This makes it much closer to normal module loading, instead of being
so strange and should eliminate much of the need for explicit
rdma.service dependencies. The places that need something special,
like srp, can directly depend on their specific module loader:

Requires=rdma-load-modules@srp_daemon

Again, if srp_daemon is scheduled to start during system boot this
should result in the modules being loaded together with the rest of
the modules before sysinit.target.

The one nit that I don't have an easy solution for is starting modules
depending on the device technology, eg not starting srp or ipoib on
!IB devices. The issue is that I can't think of an easy way to detect
the device technology from the udev rule, at least not without a
helper script or additional kernel sysfs..

I havne't been able to test this yet, help would be appreciated.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche July 13, 2017, 10:36 p.m. UTC | #3
On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote:
> The one nit that I don't have an easy solution for is starting modules
> depending on the device technology, eg not starting srp or ipoib on
> !IB devices. The issue is that I can't think of an easy way to detect
> the device technology from the udev rule, at least not without a
> helper script or additional kernel sysfs..

Are users expected to modify the kernel-boot/modules/*.conf files? I
expect that only a minority of the RDMA users will want to load the ib_srpt
and ib_isert kernel modules. 

> I havne't been able to test this yet, help would be appreciated.

I will try to free up some time to test the SRP-related changes.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2017, 10:41 p.m. UTC | #4
On Thu, Jul 13, 2017 at 10:36:49PM +0000, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote:
> > The one nit that I don't have an easy solution for is starting modules
> > depending on the device technology, eg not starting srp or ipoib on
> > !IB devices. The issue is that I can't think of an easy way to detect
> > the device technology from the udev rule, at least not without a
> > helper script or additional kernel sysfs..
> 
> Are users expected to modify the kernel-boot/modules/*.conf files? I
> expect that only a minority of the RDMA users will want to load the ib_srpt
> and ib_isert kernel modules. 

I'd say in the same way they would be expected to modify the RedHat
/etc/rdma/rdma.conf file. I copied the defaults from that file, which
was to autoload target drivers.

I agree with you, they probably don't make sense to be defaulted on.
Ideally their respective supporting packages or kernel would ensure
the modules autoload if someone is working with RDMA... I know nothing
about the target infrastructure - can you see a way to do that?

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche July 13, 2017, 10:45 p.m. UTC | #5
On Thu, 2017-07-13 at 16:41 -0600, Jason Gunthorpe wrote:
> On Thu, Jul 13, 2017 at 10:36:49PM +0000, Bart Van Assche wrote:
> > On Thu, 2017-07-13 at 16:15 -0600, Jason Gunthorpe wrote:
> > > The one nit that I don't have an easy solution for is starting modules
> > > depending on the device technology, eg not starting srp or ipoib on
> > > !IB devices. The issue is that I can't think of an easy way to detect
> > > the device technology from the udev rule, at least not without a
> > > helper script or additional kernel sysfs..
> > 
> > Are users expected to modify the kernel-boot/modules/*.conf files? I
> > expect that only a minority of the RDMA users will want to load the ib_srpt
> > and ib_isert kernel modules. 
> 
> I'd say in the same way they would be expected to modify the RedHat
> /etc/rdma/rdma.conf file. I copied the defaults from that file, which
> was to autoload target drivers.
> 
> I agree with you, they probably don't make sense to be defaulted on.
> Ideally their respective supporting packages or kernel would ensure
> the modules autoload if someone is working with RDMA... I know nothing
> about the target infrastructure - can you see a way to do that?

Hello Jason,

Users who want to enable RDMA target functionality probably will know the
names of the kernel modules that provide such functionality. On Debian
systems the names of these kernel modules can e.g. be added to /etc/modules.
However, I'm not sure all distro's have an equivalent of the /etc/modules
file.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe July 13, 2017, 10:51 p.m. UTC | #6
On Thu, Jul 13, 2017 at 10:45:14PM +0000, Bart Van Assche wrote:

> Users who want to enable RDMA target functionality probably will know the
> names of the kernel modules that provide such functionality. On Debian
> systems the names of these kernel modules can e.g. be added to /etc/modules.
> However, I'm not sure all distro's have an equivalent of the /etc/modules
> file.

systemd supports /etc/modules-load.d/ as a built in, some distros will
symlink ../modules.conf to that directory, but these days dropping a
file in /etc/modules-load.d/ is pretty safe.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung July 14, 2017, 10:08 a.m. UTC | #7
Hi Jason,

Am Donnerstag, den 13.07.2017, 16:15 -0600 schrieb Jason Gunthorpe:
> On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote:
> > On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote:
> > > Here is an untested attempt to move module loading into
> > > srp_daemon -
> > > what do you think Bart?
> > 
> > Hello Jason,
> > 
> > Thanks for the patch. To me this looks like a nice simplification
> > compared
> > to the current approach. Do you want me to test this patch?
> 
> I did some more steps..
> 
> Benjamin/Bart let me know what you think, I'll send the patches to
> the
> list if you like the idea.
> 
> https://github.com/jgunthorpe/rdma-plumbing/tree/systemd
> 
> This replaces Benjamins rdma.conf based approach.
> 
> The basic approach is to directly use systemd to load the modules and
> determine what modules to load by using udev and explicit Requires in
> the units.

I like your approach very much. It's cleaner and simpler. So let's
ditch my approach and use yours.

I tested your branch on a Debian 8 (jessie) system with a mlx4 card and
found several points to fix/improve/discuss:

1. The rdma-load-modules@ service needs to set DefaultDependencies=no.
Otherwise it will end up in a dependency loop:

Found ordering cycle on rdma-load-modules@mlx4.service/start
Found dependency on basic.target/start
Found dependency on paths.target/start
Found dependency on acpid.path/start
Found dependency on sysinit.target/start
Found dependency on rdma-load-modules@mlx4.service/start
Breaking ordering cycle by deleting job paths.target/start
Found ordering cycle on rdma-load-modules@mlx4.service/start
Found dependency on paths.target/start
Found dependency on acpid.path/start
Found dependency on sysinit.target/start
Found dependency on rdma-load-modules@mlx4.service/start
Starting Load RDMA modules from /etc/rdma/modules/mlx4.conf...

2. You can just specify "etc/rdma/modules" in debian/rdma-core.install
instead of listing each .conf file individually.

3. Default conf files for opa.conf and roce.conf are missing.

4. Should rdma-load-modules@ *not* fail if the corresponding .conf is
missing?

5. How to handle build-in modules correctly? Our kernel has the i40e
module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service will
be started, but the system does not have a i40e card and thus I don't
want to have the module started.
Benjamin Drung July 14, 2017, 10:18 a.m. UTC | #8
Am Freitag, den 14.07.2017, 12:08 +0200 schrieb Benjamin Drung:
> Hi Jason,
> 
> Am Donnerstag, den 13.07.2017, 16:15 -0600 schrieb Jason Gunthorpe:
> > On Thu, Jul 13, 2017 at 06:33:07PM +0000, Bart Van Assche wrote:
> > > On Thu, 2017-07-13 at 12:20 -0600, Jason Gunthorpe wrote:
> > > > Here is an untested attempt to move module loading into
> > > > srp_daemon -
> > > > what do you think Bart?
> > > 
> > > Hello Jason,
> > > 
> > > Thanks for the patch. To me this looks like a nice simplification
> > > compared
> > > to the current approach. Do you want me to test this patch?
> > 
> > I did some more steps..
> > 
> > Benjamin/Bart let me know what you think, I'll send the patches to
> > the
> > list if you like the idea.
> > 
> > https://github.com/jgunthorpe/rdma-plumbing/tree/systemd
> > 
> > This replaces Benjamins rdma.conf based approach.
> > 
> > The basic approach is to directly use systemd to load the modules
> > and
> > determine what modules to load by using udev and explicit Requires
> > in
> > the units.
> 
> I like your approach very much. It's cleaner and simpler. So let's
> ditch my approach and use yours.
> 
> I tested your branch on a Debian 8 (jessie) system with a mlx4 card
> and
> found several points to fix/improve/discuss:
> 
> 1. The rdma-load-modules@ service needs to set
> DefaultDependencies=no.
> Otherwise it will end up in a dependency loop:
> 
> Found ordering cycle on rdma-load-modules@mlx4.service/start
> Found dependency on basic.target/start
> Found dependency on paths.target/start
> Found dependency on acpid.path/start
> Found dependency on sysinit.target/start
> Found dependency on rdma-load-modules@mlx4.service/start
> Breaking ordering cycle by deleting job paths.target/start
> Found ordering cycle on rdma-load-modules@mlx4.service/start
> Found dependency on paths.target/start
> Found dependency on acpid.path/start
> Found dependency on sysinit.target/start
> Found dependency on rdma-load-modules@mlx4.service/start
> Starting Load RDMA modules from /etc/rdma/modules/mlx4.conf...
> 
> 2. You can just specify "etc/rdma/modules" in debian/rdma-
> core.install
> instead of listing each .conf file individually.
> 
> 3. Default conf files for opa.conf and roce.conf are missing.
> 
> 4. Should rdma-load-modules@ *not* fail if the corresponding .conf is
> missing?
> 
> 5. How to handle build-in modules correctly? Our kernel has the i40e
> module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service
> will
> be started, but the system does not have a i40e card and thus I don't
> want to have the module started. 

Forgot one point:

6. The ipoib module (loaded by rdma-load-modules@infiniband) needs to
loaded before the networking.service is running. The networking.service
brings up the network devices on Debian. It runs "ifup -a" which reads
/etc/network/interfaces which we use to configure our ipoib devices.
Here is the networking.service definition:

bdrung@server:~$ systemctl cat networking.service
# /run/systemd/generator.late/networking.service
# Automatically generated by systemd-sysv-generator

[Unit]
SourcePath=/etc/init.d/networking
Description=LSB: Raise network interfaces.
DefaultDependencies=no
Before=sysinit.target shutdown.target
After=mountkernfs.service local-fs.target urandom.service
Conflicts=shutdown.target

[Service]
Type=forking
Restart=no
TimeoutSec=0
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
SysVStartPriority=13
ExecStart=/etc/init.d/networking start
ExecStop=/etc/init.d/networking stop
ExecReload=/etc/init.d/networking reload

# /run/systemd/generator/networking.service.d/50-insserv.conf-$network.conf
# Automatically generated by systemd-insserv-generator

[Unit]
Wants=network.target
Before=network.target

# /lib/systemd/system/networking.service.d/network-pre.conf
[Unit]
After=network-pre.target systemd-sysctl.service systemd-modules-load.service
Jason Gunthorpe July 14, 2017, 3:55 p.m. UTC | #9
On Fri, Jul 14, 2017 at 12:08:41PM +0200, Benjamin Drung wrote:
> I tested your branch on a Debian 8 (jessie) system with a mlx4 card and
> found several points to fix/improve/discuss:

Thanks!

> 1. The rdma-load-modules@ service needs to set DefaultDependencies=no.
> Otherwise it will end up in a dependency loop:

Makes sense
> 2. You can just specify "etc/rdma/modules" in debian/rdma-core.install
> instead of listing each .conf file individually.

The srp_daemon.conf is in the srp package, so I don't think I can do
that?

> 3. Default conf files for opa.conf and roce.conf are missing.

Right.. Not sure if we should split them like this or not, depends if
we can detect the card type at runtime.

> 4. Should rdma-load-modules@ *not* fail if the corresponding .conf is
> missing?

Does it fail now? Failing seems like the right thing to do for a
missing conf file.

> 5. How to handle build-in modules correctly? Our kernel has the i40e
> module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service will
> be started, but the system does not have a i40e card and thus I don't
> want to have the module started. 

Fixing this would require more fancy udev wonkery - I copied RH's
tested approach which triggers on driver presence, not on driver
binding.

My udev is not great, but something like this:

DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"

Might work better? I think that triggers on driver bind? Could you try
to switch your mlx and i40e drivers in that way?

> 6. The ipoib module (loaded by rdma-load-modules@infiniband) needs to
> loaded before the networking.service is running. The networking.service
> brings up the network devices on Debian. It runs "ifup -a" which reads

Hum. That LSB networking.service sure is an ugly hack, it doesn't
support hotplug so it has this:

  After=network-pre.target systemd-sysctl.service systemd-modules-load.service

To 'try' and run after some amount of hot plugging is done. IMHO this
is done wrong, it should start after sysinit.target but before
network-online.target or something... 

The only solution to this kind of problem is to add more ordering,
Debian should include a patch to rdma-load-modules@ to put it before
their unique networking.service..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung July 14, 2017, 4:23 p.m. UTC | #10
Am Freitag, den 14.07.2017, 09:55 -0600 schrieb Jason Gunthorpe:
> 2. You can just specify "etc/rdma/modules" in debian/rdma-
> > core.install
> > instead of listing each .conf file individually.
> 
> The srp_daemon.conf is in the srp package, so I don't think I can do
> that?

Yes. You are right. So it needs to be listed individually.

> > 4. Should rdma-load-modules@ *not* fail if the corresponding .conf
> > is
> > missing?
> 
> Does it fail now? Failing seems like the right thing to do for a
> missing conf file.

Yes. Currently it fails in this situation.

> > 5. How to handle build-in modules correctly? Our kernel has the
> > i40e
> > module built in (CONFIG_I40E=y) and rdma-load-modules@i40e.service
> > will
> > be started, but the system does not have a i40e card and thus I
> > don't
> > want to have the module started. 
> 
> Fixing this would require more fancy udev wonkery - I copied RH's
> tested approach which triggers on driver presence, not on driver
> binding.
> 
> My udev is not great, but something like this:
> 
> DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd",
> ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"
> 
> Might work better? I think that triggers on driver bind? Could you
> try
> to switch your mlx and i40e drivers in that way?

I tried it. It works as expected. rdma-load-modules@mlx4 is loaded and 
rdma-load-modules@i40e.service is not loaded. Not tested if the udev
trigger will also work for used built-in modules.

> > 6. The ipoib module (loaded by rdma-load-modules@infiniband) needs
> > to
> > loaded before the networking.service is running. The
> > networking.service
> > brings up the network devices on Debian. It runs "ifup -a" which
> > reads
> 
> Hum. That LSB networking.service sure is an ugly hack, it doesn't
> support hotplug so it has this:
> 
>   After=network-pre.target systemd-sysctl.service systemd-modules-
> load.service
> 
> To 'try' and run after some amount of hot plugging is done. IMHO this
> is done wrong, it should start after sysinit.target but before
> network-online.target or something... 
> 
> The only solution to this kind of problem is to add more ordering,
> Debian should include a patch to rdma-load-modules@ to put it before
> their unique networking.service..

Or patch rdma-load-modules@ to put it before network-pre.target
Jason Gunthorpe July 14, 2017, 4:40 p.m. UTC | #11
On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote:

> > My udev is not great, but something like this:
> > 
> > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"
> > 
> > Might work better? I think that triggers on driver bind? Could you
> > try to switch your mlx and i40e drivers in that way?
> 
> I tried it. It works as expected. rdma-load-modules@mlx4 is loaded and 
> rdma-load-modules@i40e.service is not loaded. Not tested if the udev
> trigger will also work for used built-in modules.

Okay, I will change it to be like that and we can see what people
think.

> > The only solution to this kind of problem is to add more ordering,
> > Debian should include a patch to rdma-load-modules@ to put it before
> > their unique networking.service..
> 
> Or patch rdma-load-modules@ to put it before network-pre.target

I think that would be OK.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung July 20, 2017, 9:04 a.m. UTC | #12
Am Freitag, den 14.07.2017, 10:40 -0600 schrieb Jason Gunthorpe:
> On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote:
> 
> > > My udev is not great, but something like this:
> > > 
> > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd",
> > > ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"
> > > 
> > > Might work better? I think that triggers on driver bind? Could
> > > you
> > > try to switch your mlx and i40e drivers in that way?
> > 
> > I tried it. It works as expected. rdma-load-modules@mlx4 is loaded
> > and 
> > rdma-load-modules@i40e.service is not loaded. Not tested if the
> > udev
> > trigger will also work for used built-in modules.
> 
> Okay, I will change it to be like that and we can see what people
> think.

Any updates?
Jason Gunthorpe July 21, 2017, 2:45 a.m. UTC | #13
On Thu, Jul 20, 2017 at 11:04:10AM +0200, Benjamin Drung wrote:
> Am Freitag, den 14.07.2017, 10:40 -0600 schrieb Jason Gunthorpe:
> > On Fri, Jul 14, 2017 at 06:23:13PM +0200, Benjamin Drung wrote:
> > 
> > > > My udev is not great, but something like this:
> > > > 
> > > > DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd",
> > > > ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"
> > > > 
> > > > Might work better? I think that triggers on driver bind? Could
> > > > you
> > > > try to switch your mlx and i40e drivers in that way?
> > > 
> > > I tried it. It works as expected.?rdma-load-modules@mlx4 is loaded
> > > and?
> > > rdma-load-modules@i40e.service is not loaded. Not tested if the
> > > udev
> > > trigger will also work for used built-in modules.
> > 
> > Okay, I will change it to be like that and we can see what people
> > think.
> 
> Any updates?

I've been doing some more testing here and it does not work entirely
properly yet.

The change to 'DRIVER==' causes things to fail, the kernel only
generates uevents when kobjects are created, and binding a driver to a
PCI device does not create a kobject. This means these lines:

DRIVER=="mlx4_core", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}="rdma-load-modules@mlx4"

Do not work reliably. Depending on the module like before would still
seem to be the best option.

Also, the ordering before sysinit.target does not work reliably, udev
loads modules asynchronously and systemd does not block on udev unless
systemd-udevd-settle is depended on.

I'm not sure how Debian's legacy networking.service manages to work at
all, perhaps it is broken and it is just unlikely to be hit because of
how quickly most ethernet drivers get loaded by udev.

As far as I can tell, it needs to depend on systemd-udevd-settle to
guarentee that all the cold-plug network drivers have been loaded by
udev before starting..

.. all of these comments would seem to apply equally to the
rdma.service approach, so I think this is still an improvement.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 15, 2018, 2:47 p.m. UTC | #14
Hi,

I have a Debian 9 (stretch) system with a backported rdma-core 17.0-1
package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel
modules) and following network configuration in
/etc/network/interfaces:

```
auto ib0.dddd
iface ib0.dddd inet6 static
    address fd44:1:5255::
    netmask 64
    pre-up echo connected > /sys/class/net/$IFACE/mode
    dad-attempts 600

auto ib1.dddd
iface ib1.dddd inet6 static
    address fd44:2:5255::
    ne
tmask 64
    pre-up echo connected > /sys/class/net/$IFACE/mode
    dad-
attempts 600
```

The terminal shows following ordering:

```
[FAILED] Failed to start Raise network interfaces.
[  OK  ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf
[  OK  ] Started Load RDMA modules from /etc/rdma/modules/infiniband.conf
[  OK  ] Reached target RDMA Hardware.
```

the networking.service fails with:
```
$ journalctl --no-host -u networking.service
[...]
Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib0.dddd/mode: Directory nonexistent
Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd
Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib1.dddd/mode: Directory nonexistent
Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd
Mai 15 13:16:40 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Mai 15 13:16:40 systemd[1]: Failed to start Raise network interfaces.
Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed state.
Mai 15 13:16:40 systemd[1]: networking.service: Failed with result 'exit-code'.

```

The networking.service fails because it tries to bring up
ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service loads
the ib_ipoib kernel module. networking.service declares that it should
run after the network-pre.target and rdma-load-modules@infiniband.servi
ce declares to run before network-pre.target. Therefore the order
should be rdma-load-modules@infiniband.service -> network-pre.target ->
networking.service, but this is obviously not the case.

I am writing to this mailing list, because got stuck with debugging
this issue and need your help.
Jason Gunthorpe May 15, 2018, 2:58 p.m. UTC | #15
On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote:
> Hi,
> 
> I have a Debian 9 (stretch) system with a backported rdma-core 17.0-1
> package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel
> modules) and following network configuration in
> /etc/network/interfaces:
> 
> ```
> auto ib0.dddd
> iface ib0.dddd inet6 static
>     address fd44:1:5255::
>     netmask 64
>     pre-up echo connected > /sys/class/net/$IFACE/mode
>     dad-attempts 600
> 
> auto ib1.dddd
> iface ib1.dddd inet6 static
>     address fd44:2:5255::
>     netmask 64
>     pre-up echo connected > /sys/class/net/$IFACE/mode
>     dad-attempts 600
> ```
> 
> The terminal shows following ordering:
> 
> ```
> [FAILED] Failed to start Raise network interfaces.
> [  OK  ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf
> [  OK  ] Started Load RDMA modules from /etc/rdma/modules/infiniband.conf
> [  OK  ] Reached target RDMA Hardware.
> ```
> 
> the networking.service fails with:
> ```
> $ journalctl --no-host -u networking.service
> [...]
> Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib0.dddd/mode: Directory nonexistent
> Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd
> Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create /sys/class/net/ib1.dddd/mode: Directory nonexistent
> Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd
> Mai 15 13:16:40 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
> Mai 15 13:16:40 systemd[1]: Failed to start Raise network interfaces.
> Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed state.
> Mai 15 13:16:40 systemd[1]: networking.service: Failed with result 'exit-code'.
> 
> ```
> 
> The networking.service fails because it tries to bring up
> ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service loads
> the ib_ipoib kernel module. networking.service declares that it should
> run after the network-pre.target and rdma-load-modules@infiniband.servi
> ce declares to run before network-pre.target. Therefore the order
> should be rdma-load-modules@infiniband.service -> network-pre.target ->
> networking.service, but this is obviously not the case.
> 
> I am writing to this mailing list, because got stuck with debugging
> this issue and need your help.

The udev.md explains this:

 ## Interaction with legacy non-hotplug services

 Services that cannot handle hot plug must be ordered after
 systemd-udev-settle.service, which will wait for udev to complete loading
 modules and scheduling systemd services. This ensures that all RDMA hardware
 present at boot is setup before proceeding to run the legacy service.

 Admins using legacy services can also place their RDMA hardware modules
 (e.g.  mlx4_ib) directly in /etc/modules-load.d/ or in their initrd which will
 cause systemd to defer passing to sysinit.target until all RDMA hardware is
 setup, this is usually sufficient for legacy services. This is probably the
 default behavior in many configurations.

Since you see the backwards ordering and the errors it meands that
ifupdown in stretch does not support hotplug. IMHO it is a bug in that
package that it doesn't order after settle to try and avoid boot time
hot plug events that it cannot handle.

The modules solution is simplest, add ipoib and HCA drivers to
modules.conf

The robust and future looking solution is to use systemd-networkd
instead of legacy ifupdown...

It is a bit annoying today to get the connected setting though.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 15, 2018, 4:10 p.m. UTC | #16
Am Dienstag, den 15.05.2018, 08:58 -0600 schrieb Jason Gunthorpe:
> On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote:
> > Hi,
> > 
> > I have a Debian 9 (stretch) system with a backported rdma-core
> > 17.0-1
> > package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel
> > modules) and following network configuration in
> > /etc/network/interfaces:
> > 
> > ```
> > auto ib0.dddd
> > iface ib0.dddd inet6 static
> >     address fd44:1:5255::
> >     netmask 64
> >     pre-up echo connected > /sys/class/net/$IFACE/mode
> >     dad-attempts 600
> > 
> > auto ib1.dddd
> > iface ib1.dddd inet6 static
> >     address fd44:2:5255::
> >     netmask 64
> >     pre-up echo connected > /sys/class/net/$IFACE/mode
> >     dad-attempts 600
> > ```
> > 
> > The terminal shows following ordering:
> > 
> > ```
> > [FAILED] Failed to start Raise network interfaces.
> > [  OK  ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf
> > [  OK  ] Started Load RDMA modules from
> > /etc/rdma/modules/infiniband.conf
> > [  OK  ] Reached target RDMA Hardware.
> > ```
> > 
> > the networking.service fails with:
> > ```
> > $ journalctl --no-host -u networking.service
> > [...]
> > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create
> > /sys/class/net/ib0.dddd/mode: Directory nonexistent
> > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd
> > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create
> > /sys/class/net/ib1.dddd/mode: Directory nonexistent
> > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd
> > Mai 15 13:16:40 systemd[1]: networking.service: Main process
> > exited, code=exited, status=1/FAILURE
> > Mai 15 13:16:40 systemd[1]: Failed to start Raise network
> > interfaces.
> > Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed
> > state.
> > Mai 15 13:16:40 systemd[1]: networking.service: Failed with result
> > 'exit-code'.
> > 
> > ```
> > 
> > The networking.service fails because it tries to bring up
> > ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service
> > loads
> > the ib_ipoib kernel module. networking.service declares that it
> > should
> > run after the network-pre.target and rdma-load-modules@infiniband.s
> > ervi
> > ce declares to run before network-pre.target. Therefore the order
> > should be rdma-load-modules@infiniband.service -> network-
> > pre.target ->
> > networking.service, but this is obviously not the case.
> > 
> > I am writing to this mailing list, because got stuck with debugging
> > this issue and need your help.
> 
> The udev.md explains this:
> 
>  ## Interaction with legacy non-hotplug services
> 
>  Services that cannot handle hot plug must be ordered after
>  systemd-udev-settle.service, which will wait for udev to complete
> loading
>  modules and scheduling systemd services. This ensures that all RDMA
> hardware
>  present at boot is setup before proceeding to run the legacy
> service.
> 
>  Admins using legacy services can also place their RDMA hardware
> modules
>  (e.g.  mlx4_ib) directly in /etc/modules-load.d/ or in their initrd
> which will
>  cause systemd to defer passing to sysinit.target until all RDMA
> hardware is
>  setup, this is usually sufficient for legacy services. This is
> probably the
>  default behavior in many configurations.
> 
> Since you see the backwards ordering and the errors it meands that
> ifupdown in stretch does not support hotplug. IMHO it is a bug in
> that
> package that it doesn't order after settle to try and avoid boot time
> hot plug events that it cannot handle.
> 
> The modules solution is simplest, add ipoib and HCA drivers to
> modules.conf

I added the systemd-udev-settle.service dependency:

```
$ systemctl cat networking.service 
# /lib/systemd/system/networking.service
[Unit]
Description=Raise network interfaces
Documentation=man:interfaces(5)
DefaultDependencies=no
Wants=network.target
After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service
Before=network.target shutdown.target network-online.target
Conflicts=shutdown.target

[Install]
WantedBy=multi-user.target
WantedBy=network-online.target

[Service]
Type=oneshot
EnvironmentFile=-/etc/default/networking
ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'
ExecStart=/sbin/ifup -a --read-environment
ExecStop=/sbin/ifdown -a --read-environment --exclude=lo
RemainAfterExit=true
TimeoutStartSec=5min

# /etc/systemd/system/networking.service.d/rdma.conf
[Unit]
# See https://marc.info/?l=linux-rdma&m=152639629213650&w=2
After=systemd-udev-settle.service
```

but it is still not working (same error messages).
Jason Gunthorpe May 15, 2018, 5:20 p.m. UTC | #17
On Tue, May 15, 2018 at 06:10:54PM +0200, Benjamin Drung wrote:
> I added the systemd-udev-settle.service dependency:
> 
> ```
> $ systemctl cat networking.service 
> # /lib/systemd/system/networking.service
> [Unit]
> Description=Raise network interfaces
> Documentation=man:interfaces(5)
> DefaultDependencies=no
> Wants=network.target
> After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service
> Before=network.target shutdown.target network-online.target
> Conflicts=shutdown.target
> 
> [Install]
> WantedBy=multi-user.target
> WantedBy=network-online.target
> 
> [Service]
> Type=oneshot
> EnvironmentFile=-/etc/default/networking
> ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'
> ExecStart=/sbin/ifup -a --read-environment
> ExecStop=/sbin/ifdown -a --read-environment --exclude=lo
> RemainAfterExit=true
> TimeoutStartSec=5min
> 
> # /etc/systemd/system/networking.service.d/rdma.conf
> [Unit]
> # See https://marc.info/?l=linux-rdma&m=152639629213650&w=2
> After=systemd-udev-settle.service
> ```
> 
> but it is still not working (same error messages).

I think it needs a Wants as well ?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford May 15, 2018, 6:15 p.m. UTC | #18
On Tue, 2018-05-15 at 18:10 +0200, Benjamin Drung wrote:
> Am Dienstag, den 15.05.2018, 08:58 -0600 schrieb Jason Gunthorpe:
> > On Tue, May 15, 2018 at 04:47:22PM +0200, Benjamin Drung wrote:
> > > Hi,
> > > 
> > > I have a Debian 9 (stretch) system with a backported rdma-core
> > > 17.0-1
> > > package. The system has a mlx4 card (mlx4_ib and mlx4_core kernel
> > > modules) and following network configuration in
> > > /etc/network/interfaces:
> > > 
> > > ```
> > > auto ib0.dddd
> > > iface ib0.dddd inet6 static
> > >     address fd44:1:5255::
> > >     netmask 64
> > >     pre-up echo connected > /sys/class/net/$IFACE/mode
> > >     dad-attempts 600
> > > 
> > > auto ib1.dddd
> > > iface ib1.dddd inet6 static
> > >     address fd44:2:5255::
> > >     netmask 64
> > >     pre-up echo connected > /sys/class/net/$IFACE/mode
> > >     dad-attempts 600
> > > ```
> > > 
> > > The terminal shows following ordering:
> > > 
> > > ```
> > > [FAILED] Failed to start Raise network interfaces.
> > > [  OK  ] Started Load RDMA modules from /etc/rdma/modules/rdma.conf
> > > [  OK  ] Started Load RDMA modules from
> > > /etc/rdma/modules/infiniband.conf
> > > [  OK  ] Reached target RDMA Hardware.
> > > ```
> > > 
> > > the networking.service fails with:
> > > ```
> > > $ journalctl --no-host -u networking.service
> > > [...]
> > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create
> > > /sys/class/net/ib0.dddd/mode: Directory nonexistent
> > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib0.dddd
> > > Mai 15 13:16:40 ifup[1645]: /bin/sh: 1: cannot create
> > > /sys/class/net/ib1.dddd/mode: Directory nonexistent
> > > Mai 15 13:16:40 ifup[1645]: ifup: failed to bring up ib1.dddd
> > > Mai 15 13:16:40 systemd[1]: networking.service: Main process
> > > exited, code=exited, status=1/FAILURE
> > > Mai 15 13:16:40 systemd[1]: Failed to start Raise network
> > > interfaces.
> > > Mai 15 13:16:40 systemd[1]: networking.service: Unit entered failed
> > > state.
> > > Mai 15 13:16:40 systemd[1]: networking.service: Failed with result
> > > 'exit-code'.
> > > 
> > > ```
> > > 
> > > The networking.service fails because it tries to bring up
> > > ib0.dddd/ib1.dddd before the rdma-load-modules@infiniband.service
> > > loads
> > > the ib_ipoib kernel module. networking.service declares that it
> > > should
> > > run after the network-pre.target and rdma-load-modules@infiniband.s
> > > ervi
> > > ce declares to run before network-pre.target. Therefore the order
> > > should be rdma-load-modules@infiniband.service -> network-
> > > pre.target ->
> > > networking.service, but this is obviously not the case.
> > > 
> > > I am writing to this mailing list, because got stuck with debugging
> > > this issue and need your help.
> > 
> > The udev.md explains this:
> > 
> >  ## Interaction with legacy non-hotplug services
> > 
> >  Services that cannot handle hot plug must be ordered after
> >  systemd-udev-settle.service, which will wait for udev to complete
> > loading
> >  modules and scheduling systemd services. This ensures that all RDMA
> > hardware
> >  present at boot is setup before proceeding to run the legacy
> > service.
> > 
> >  Admins using legacy services can also place their RDMA hardware
> > modules
> >  (e.g.  mlx4_ib) directly in /etc/modules-load.d/ or in their initrd
> > which will
> >  cause systemd to defer passing to sysinit.target until all RDMA
> > hardware is
> >  setup, this is usually sufficient for legacy services. This is
> > probably the
> >  default behavior in many configurations.
> > 
> > Since you see the backwards ordering and the errors it meands that
> > ifupdown in stretch does not support hotplug. IMHO it is a bug in
> > that
> > package that it doesn't order after settle to try and avoid boot time
> > hot plug events that it cannot handle.
> > 
> > The modules solution is simplest, add ipoib and HCA drivers to
> > modules.conf
> 
> I added the systemd-udev-settle.service dependency:
> 
> ```
> $ systemctl cat networking.service 
> # /lib/systemd/system/networking.service
> [Unit]
> Description=Raise network interfaces
> Documentation=man:interfaces(5)
> DefaultDependencies=no
> Wants=network.target
> After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service
> Before=network.target shutdown.target network-online.target
> Conflicts=shutdown.target
> 
> [Install]
> WantedBy=multi-user.target
> WantedBy=network-online.target
> 
> [Service]
> Type=oneshot
> EnvironmentFile=-/etc/default/networking
> ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'

I wouldn't trust that you can run udevadm settle here and get the right
results.  This will only wait for the current udev hotplug events to
complete.  It won't necessarily wait for any unstarted hotplug events. 
I think you need to change the After line above to include the systemd-
udev-settle.service directly.

> ExecStart=/sbin/ifup -a --read-environment
> ExecStop=/sbin/ifdown -a --read-environment --exclude=lo
> RemainAfterExit=true
> TimeoutStartSec=5min
> 
> # /etc/systemd/system/networking.service.d/rdma.conf
> [Unit]
> # See https://marc.info/?l=linux-rdma&m=152639629213650&w=2
> After=systemd-udev-settle.service
> ```
> 
> but it is still not working (same error messages).
>
Jason Gunthorpe May 15, 2018, 7:20 p.m. UTC | #19
On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > I added the systemd-udev-settle.service dependency:
> > 
> > ```
> > $ systemctl cat networking.service 
> > # /lib/systemd/system/networking.service
> > [Unit]
> > Description=Raise network interfaces
> > Documentation=man:interfaces(5)
> > DefaultDependencies=no
> > Wants=network.target
> > After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service
> > Before=network.target shutdown.target network-online.target
> > Conflicts=shutdown.target
> > 
> > [Install]
> > WantedBy=multi-user.target
> > WantedBy=network-online.target
> > 
> > [Service]
> > Type=oneshot
> > EnvironmentFile=-/etc/default/networking
> > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle'
> 
> I wouldn't trust that you can run udevadm settle here and get the right
> results.  This will only wait for the current udev hotplug events to
> complete.

Oh, neat, so udev settle is already called by Debian's
networking.service (as it should be) - assuming CONFIGURE_INTERFACES
is set, and whatever that other stuff does (Ben is this triggering for you?)

If this is already happening inthis probably means you have it right
and the udev hotplug cycle with RDMA is even too async for 'udev
settle'??

Is it because we launch the module loads from system? Presumably if
it was internal it wouldn't break the hotplug cycle?

If this is the case we might have to replace the systemd based loader
with some kind of udev builtin loader :\

> It won't necessarily wait for any unstarted hotplug events.
> I think you need to change the After line above to include the systemd-
> udev-settle.service directly.

It is OK to have multiple Afters, systemd adds to the list in this
case. This is how the add-ins are supposed to work..

But the whole thing isn't neeed if the 'built-in' settle of
networking.service can be used.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 17, 2018, 5:02 p.m. UTC | #20
Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe:
> On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > > I added the systemd-udev-settle.service dependency:
> > > 
> > > ```
> > > $ systemctl cat networking.service 
> > > # /lib/systemd/system/networking.service
> > > [Unit]
> > > Description=Raise network interfaces
> > > Documentation=man:interfaces(5)
> > > DefaultDependencies=no
> > > Wants=network.target
> > > After=local-fs.target network-pre.target apparmor.service
> > > systemd-sysctl.service systemd-modules-load.service
> > > Before=network.target shutdown.target network-online.target
> > > Conflicts=shutdown.target
> > > 
> > > [Install]
> > > WantedBy=multi-user.target
> > > WantedBy=network-online.target
> > > 
> > > [Service]
> > > Type=oneshot
> > > EnvironmentFile=-/etc/default/networking
> > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] &&
> > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] &&
> > > udevadm settle'
> > 
> > I wouldn't trust that you can run udevadm settle here and get the
> > right
> > results.  This will only wait for the current udev hotplug events
> > to
> > complete.
> 
> Oh, neat, so udev settle is already called by Debian's
> networking.service (as it should be) - assuming CONFIGURE_INTERFACES
> is set, and whatever that other stuff does (Ben is this triggering
> for you?)

I should have looked more closely at the service file (I didn't notice
the udevadm settle in there). CONFIGURE_INTERFACES is not set in
/etc/default/networking and ifquery returns a bunch of interfaces.
Therefore 'udevadm settle' is executed.

I tried to debug it further by injecting commands to the pre-up hook.
When pre-up runs:

* lsmod shows that ib_ipoib is loaded
* 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are present

To me it looks like a race condition between populating
/sys/class/net/ibX after loading ib_ipoib and the networking service.
Do you have a suggestion how to address this? We are using Mellanox
OFED on the affected hosts. The mainline ipoib is not affected. Are the
commits that are related to this and that we should cherry-pick?
Jason Gunthorpe May 17, 2018, 5:16 p.m. UTC | #21
On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote:
> Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe:
> > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > > > I added the systemd-udev-settle.service dependency:
> > > > 
> > > > ```
> > > > $ systemctl cat networking.service 
> > > > # /lib/systemd/system/networking.service
> > > > [Unit]
> > > > Description=Raise network interfaces
> > > > Documentation=man:interfaces(5)
> > > > DefaultDependencies=no
> > > > Wants=network.target
> > > > After=local-fs.target network-pre.target apparmor.service
> > > > systemd-sysctl.service systemd-modules-load.service
> > > > Before=network.target shutdown.target network-online.target
> > > > Conflicts=shutdown.target
> > > > 
> > > > [Install]
> > > > WantedBy=multi-user.target
> > > > WantedBy=network-online.target
> > > > 
> > > > [Service]
> > > > Type=oneshot
> > > > EnvironmentFile=-/etc/default/networking
> > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] &&
> > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] &&
> > > > udevadm settle'
> > > 
> > > I wouldn't trust that you can run udevadm settle here and get the
> > > right
> > > results.  This will only wait for the current udev hotplug events
> > > to
> > > complete.
> > 
> > Oh, neat, so udev settle is already called by Debian's
> > networking.service (as it should be) - assuming CONFIGURE_INTERFACES
> > is set, and whatever that other stuff does (Ben is this triggering
> > for you?)
> 
> I should have looked more closely at the service file (I didn't notice
> the udevadm settle in there). CONFIGURE_INTERFACES is not set in
> /etc/default/networking and ifquery returns a bunch of interfaces.
> Therefore 'udevadm settle' is executed.
> 
> I tried to debug it further by injecting commands to the pre-up hook.
> When pre-up runs:
> 
> * lsmod shows that ib_ipoib is loaded
> * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are present
> 
> To me it looks like a race condition between populating
> /sys/class/net/ibX after loading ib_ipoib and the networking
> service.

Is the rdma device present at this point? eg sys/class/infiniband ?

Is any systemd-modules-load processes still running?

Are the mlx IB modules loaded?

> Do you have a suggestion how to address this? We are using Mellanox
> OFED on the affected hosts. The mainline ipoib is not affected. Are the
> commits that are related to this and that we should cherry-pick?

Oh, mainline ipoib works? Great.

I have no idea what is in Mellanox OFED, sorry..

I don't think this is anything that was fixed in mainline.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 18, 2018, 9:26 a.m. UTC | #22
Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe:
> On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote:
> > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe:
> > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > > > > I added the systemd-udev-settle.service dependency:
> > > > > 
> > > > > ```
> > > > > $ systemctl cat networking.service 
> > > > > # /lib/systemd/system/networking.service
> > > > > [Unit]
> > > > > Description=Raise network interfaces
> > > > > Documentation=man:interfaces(5)
> > > > > DefaultDependencies=no
> > > > > Wants=network.target
> > > > > After=local-fs.target network-pre.target apparmor.service
> > > > > systemd-sysctl.service systemd-modules-load.service
> > > > > Before=network.target shutdown.target network-online.target
> > > > > Conflicts=shutdown.target
> > > > > 
> > > > > [Install]
> > > > > WantedBy=multi-user.target
> > > > > WantedBy=network-online.target
> > > > > 
> > > > > [Service]
> > > > > Type=oneshot
> > > > > EnvironmentFile=-/etc/default/networking
> > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ]
> > > > > &&
> > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] &&
> > > > > udevadm settle'
> > > > 
> > > > I wouldn't trust that you can run udevadm settle here and get
> > > > the
> > > > right
> > > > results.  This will only wait for the current udev hotplug
> > > > events
> > > > to
> > > > complete.
> > > 
> > > Oh, neat, so udev settle is already called by Debian's
> > > networking.service (as it should be) - assuming
> > > CONFIGURE_INTERFACES
> > > is set, and whatever that other stuff does (Ben is this
> > > triggering
> > > for you?)
> > 
> > I should have looked more closely at the service file (I didn't
> > notice
> > the udevadm settle in there). CONFIGURE_INTERFACES is not set in
> > /etc/default/networking and ifquery returns a bunch of interfaces.
> > Therefore 'udevadm settle' is executed.
> > 
> > I tried to debug it further by injecting commands to the pre-up
> > hook.
> > When pre-up runs:
> > 
> > * lsmod shows that ib_ipoib is loaded
> > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are
> > present
> > 
> > To me it looks like a race condition between populating
> > /sys/class/net/ibX after loading ib_ipoib and the networking
> > service.
> 
> Is the rdma device present at this point? eg sys/class/infiniband ?

/sys/class/infiniband/mlx4_0 is present.

> Is any systemd-modules-load processes still running?

'/lib/systemd/systemd-modules-load /etc/rdma/modules/infiniband.conf'
is still running.

> Are the mlx IB modules loaded?

Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to
lsmod). The first two modules are already loaded in the initrd. Also
ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink are
loaded.
Jason Gunthorpe May 18, 2018, 3:35 p.m. UTC | #23
On Fri, May 18, 2018 at 11:26:07AM +0200, Benjamin Drung wrote:
> Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe:
> > On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote:
> > > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason Gunthorpe:
> > > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > > > > > I added the systemd-udev-settle.service dependency:
> > > > > > 
> > > > > > ```
> > > > > > $ systemctl cat networking.service 
> > > > > > # /lib/systemd/system/networking.service
> > > > > > [Unit]
> > > > > > Description=Raise network interfaces
> > > > > > Documentation=man:interfaces(5)
> > > > > > DefaultDependencies=no
> > > > > > Wants=network.target
> > > > > > After=local-fs.target network-pre.target apparmor.service
> > > > > > systemd-sysctl.service systemd-modules-load.service
> > > > > > Before=network.target shutdown.target network-online.target
> > > > > > Conflicts=shutdown.target
> > > > > > 
> > > > > > [Install]
> > > > > > WantedBy=multi-user.target
> > > > > > WantedBy=network-online.target
> > > > > > 
> > > > > > [Service]
> > > > > > Type=oneshot
> > > > > > EnvironmentFile=-/etc/default/networking
> > > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ]
> > > > > > &&
> > > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)" ] &&
> > > > > > udevadm settle'
> > > > > 
> > > > > I wouldn't trust that you can run udevadm settle here and get
> > > > > the
> > > > > right
> > > > > results.  This will only wait for the current udev hotplug
> > > > > events
> > > > > to
> > > > > complete.
> > > > 
> > > > Oh, neat, so udev settle is already called by Debian's
> > > > networking.service (as it should be) - assuming
> > > > CONFIGURE_INTERFACES
> > > > is set, and whatever that other stuff does (Ben is this
> > > > triggering
> > > > for you?)
> > > 
> > > I should have looked more closely at the service file (I didn't
> > > notice
> > > the udevadm settle in there). CONFIGURE_INTERFACES is not set in
> > > /etc/default/networking and ifquery returns a bunch of interfaces.
> > > Therefore 'udevadm settle' is executed.
> > > 
> > > I tried to debug it further by injecting commands to the pre-up
> > > hook.
> > > When pre-up runs:
> > > 
> > > * lsmod shows that ib_ipoib is loaded
> > > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are
> > > present
> > > 
> > > To me it looks like a race condition between populating
> > > /sys/class/net/ibX after loading ib_ipoib and the networking
> > > service.
> > 
> > Is the rdma device present at this point? eg sys/class/infiniband ?
> 
> /sys/class/infiniband/mlx4_0 is present.
> 
> > Is any systemd-modules-load processes still running?
> 
> '/lib/systemd/systemd-modules-load /etc/rdma/modules/infiniband.conf'
> is still running.

> > Are the mlx IB modules loaded?
> 
> Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to
> lsmod). The first two modules are already loaded in the initrd. Also
> ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink are
> loaded.

Hmm, that is very mysterious, then, I can't think how systemd-modules-load
could still be running at this point.

If you load the ib driver in initrd then the above should have been
scheduled very early in boot, and it has a Before=network-pre.target
which should delay networking.service from starting while it is running.

What does the logging say about when rdma-load-modules was started and
was the IB device created before the initrd device exited?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 18, 2018, 4:22 p.m. UTC | #24
Am Freitag, den 18.05.2018, 09:35 -0600 schrieb Jason Gunthorpe:
> On Fri, May 18, 2018 at 11:26:07AM +0200, Benjamin Drung wrote:
> > Am Donnerstag, den 17.05.2018, 11:16 -0600 schrieb Jason Gunthorpe:
> > > On Thu, May 17, 2018 at 07:02:47PM +0200, Benjamin Drung wrote:
> > > > Am Dienstag, den 15.05.2018, 13:20 -0600 schrieb Jason
> > > > Gunthorpe:
> > > > > On Tue, May 15, 2018 at 02:15:54PM -0400, Doug Ledford wrote:
> > > > > > > I added the systemd-udev-settle.service dependency:
> > > > > > > 
> > > > > > > ```
> > > > > > > $ systemctl cat networking.service 
> > > > > > > # /lib/systemd/system/networking.service
> > > > > > > [Unit]
> > > > > > > Description=Raise network interfaces
> > > > > > > Documentation=man:interfaces(5)
> > > > > > > DefaultDependencies=no
> > > > > > > Wants=network.target
> > > > > > > After=local-fs.target network-pre.target apparmor.service
> > > > > > > systemd-sysctl.service systemd-modules-load.service
> > > > > > > Before=network.target shutdown.target network-
> > > > > > > online.target
> > > > > > > Conflicts=shutdown.target
> > > > > > > 
> > > > > > > [Install]
> > > > > > > WantedBy=multi-user.target
> > > > > > > WantedBy=network-online.target
> > > > > > > 
> > > > > > > [Service]
> > > > > > > Type=oneshot
> > > > > > > EnvironmentFile=-/etc/default/networking
> > > > > > > ExecStartPre=-/bin/sh -c '[ "$CONFIGURE_INTERFACES" !=
> > > > > > > "no" ]
> > > > > > > &&
> > > > > > > [ -n "$(ifquery --read-environment --list --exclude=lo)"
> > > > > > > ] &&
> > > > > > > udevadm settle'
> > > > > > 
> > > > > > I wouldn't trust that you can run udevadm settle here and
> > > > > > get
> > > > > > the
> > > > > > right
> > > > > > results.  This will only wait for the current udev hotplug
> > > > > > events
> > > > > > to
> > > > > > complete.
> > > > > 
> > > > > Oh, neat, so udev settle is already called by Debian's
> > > > > networking.service (as it should be) - assuming
> > > > > CONFIGURE_INTERFACES
> > > > > is set, and whatever that other stuff does (Ben is this
> > > > > triggering
> > > > > for you?)
> > > > 
> > > > I should have looked more closely at the service file (I didn't
> > > > notice
> > > > the udevadm settle in there). CONFIGURE_INTERFACES is not set
> > > > in
> > > > /etc/default/networking and ifquery returns a bunch of
> > > > interfaces.
> > > > Therefore 'udevadm settle' is executed.
> > > > 
> > > > I tried to debug it further by injecting commands to the pre-up
> > > > hook.
> > > > When pre-up runs:
> > > > 
> > > > * lsmod shows that ib_ipoib is loaded
> > > > * 'ls -l /sys/class/net/' shows that neither ib0 and ib1 are
> > > > present
> > > > 
> > > > To me it looks like a race condition between populating
> > > > /sys/class/net/ibX after loading ib_ipoib and the networking
> > > > service.
> > > 
> > > Is the rdma device present at this point? eg sys/class/infiniband
> > > ?
> > 
> > /sys/class/infiniband/mlx4_0 is present.
> > 
> > > Is any systemd-modules-load processes still running?
> > 
> > '/lib/systemd/systemd-modules-load
> > /etc/rdma/modules/infiniband.conf'
> > is still running.
> > > Are the mlx IB modules loaded?
> > 
> > Yes: mlx4_ib, mlx4_core, and mlx_compat are loaded (according to
> > lsmod). The first two modules are already loaded in the initrd.
> > Also
> > ib_ipoib, ib_uverbs, ib_sa, ib_mad, ib_core, ib_addr, ib_netlink
> > are
> > loaded.
> 
> Hmm, that is very mysterious, then, I can't think how systemd-
> modules-load
> could still be running at this point.
> 
> If you load the ib driver in initrd then the above should have been
> scheduled very early in boot, and it has a Before=network-pre.target
> which should delay networking.service from starting while it is
> running.
>
> What does the logging say about when rdma-load-modules was started
> and
> was the IB device created before the initrd device exited?

I opened a bug report against systemd in Debian:
https://bugs.debian.org
/899002

Then I tried to implement a workaround (which does not work):

$ cat /etc/systemd/system/networking.service.d/rdma.conf
[Service]
# Work around systemd bug https://bugs.debian.org/899002
# See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2
ExecStartPre=/bin/ps auxff
ExecStartPre=/bin/ls -l /sys/class/infiniband
ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.service
ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules-load); do echo "Waiting for systemd-modules-load process $pid to exit..."; tail --pid=$pid -f /dev/null; done'

systemctl status says that rdma-load-modules@infiniband.service was
started one second after networking.service.

The ps command from ExecStartPre says that only systemd-journald,
systemd-udevd, multipathd, and init were running. "ls -l
/sys/class/infiniband" says that mlx4_0 is present. And "systemctl
status rdma-load-modules@infiniband.service" says:

rdma-load-modules@infiniband.service - Load RDMA modules from /etc/rdma/modules/infiniband.conf
  Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service; static; vendor preset: enabled)
  Active: inactive (dead)
    Docs: file:/usr/share/doc/rdma-core/udev.md

So it is clear, that rdma-load-modules@infiniband.service is not
triggered when networking.service is started.
Jason Gunthorpe May 18, 2018, 5:31 p.m. UTC | #25
On Fri, May 18, 2018 at 06:22:12PM +0200, Benjamin Drung wrote:
> > Hmm, that is very mysterious, then, I can't think how systemd-
> > modules-load
> > could still be running at this point.
> > 
> > If you load the ib driver in initrd then the above should have been
> > scheduled very early in boot, and it has a Before=network-pre.target
> > which should delay networking.service from starting while it is
> > running.
> >
> > What does the logging say about when rdma-load-modules was started
> > and
> > was the IB device created before the initrd device exited?
> 
> I opened a bug report against systemd in Debian:
> https://bugs.debian.org/899002
> 
> Then I tried to implement a workaround (which does not work):
> 
> $ cat /etc/systemd/system/networking.service.d/rdma.conf
> [Service]
> # Work around systemd bug https://bugs.debian.org/899002
> # See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2
> ExecStartPre=/bin/ps auxff
> ExecStartPre=/bin/ls -l /sys/class/infiniband
> ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.service
> ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules-load); do echo "Waiting for systemd-modules-load process $pid to exit..."; tail --pid=$pid -f /dev/null; done'
> 
> systemctl status says that rdma-load-modules@infiniband.service was
> started one second after networking.service.
> 
> The ps command from ExecStartPre says that only systemd-journald,
> systemd-udevd, multipathd, and init were running. "ls -l
> /sys/class/infiniband" says that mlx4_0 is present. And "systemctl
> status rdma-load-modules@infiniband.service" says:
> 
> rdma-load-modules@infiniband.service - Load RDMA modules from /etc/rdma/modules/infiniband.conf
>   Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service; static; vendor preset: enabled)
>   Active: inactive (dead)
>     Docs: file:/usr/share/doc/rdma-core/udev.md
> 
> So it is clear, that rdma-load-modules@infiniband.service is not
> triggered when networking.service is started.

Hum, if you have the modules in the initrd then udev should schedule
this service to run essentially immediately on boot, and it should
become ordered properly..

Ie the rdma device should already present when udev is started.

Starting *after* networking.service suggests that the mlx4 RDMA device
was hotplugged into the system a long time after early boot! Which is
not at all what I expect.

What does dmesg say about the mlx4 driver load?

Upstream blocks module completion until the driver is done (this takes
a long time), is it possible that MOFED does this async? That could
explain everything.

Also, IMHO, the networking.service above is wrong. It should not
attempt to do udevadm settle internally, but it must depend on
systemd-udev-settle.service.

The reason is due to how systemd scheduals ordering. Once it starts
running networking.service 'ExecStartPre' it will not re-consider
order past that point. So any activations done by udev while settling
have no impact on networking.service at all.

Having it depend on systemd-udev-settle.service means it gets to
recheck ordering after settle is done, but before starting
networking.sevice - which is the behavior it is really trying to get.

That may be a big part of this bug, go back to doing:

After=systemd-udev-settle.service
Requires=systemd-udev-settle.service

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Drung May 22, 2018, 1:23 p.m. UTC | #26
Am Freitag, den 18.05.2018, 11:31 -0600 schrieb Jason Gunthorpe:
> On Fri, May 18, 2018 at 06:22:12PM +0200, Benjamin Drung wrote:
> > > Hmm, that is very mysterious, then, I can't think how systemd-
> > > modules-load
> > > could still be running at this point.
> > > 
> > > If you load the ib driver in initrd then the above should have
> > > been
> > > scheduled very early in boot, and it has a Before=network-
> > > pre.target
> > > which should delay networking.service from starting while it is
> > > running.
> > > 
> > > What does the logging say about when rdma-load-modules was
> > > started
> > > and
> > > was the IB device created before the initrd device exited?
> > 
> > I opened a bug report against systemd in Debian:
> > https://bugs.debian.org/899002
> > 
> > Then I tried to implement a workaround (which does not work):
> > 
> > $ cat /etc/systemd/system/networking.service.d/rdma.conf
> > [Service]
> > # Work around systemd bug https://bugs.debian.org/899002
> > # See also https://marc.info/?l=linux-rdma&m=152639629213650&w=2
> > ExecStartPre=/bin/ps auxff
> > ExecStartPre=/bin/ls -l /sys/class/infiniband
> > ExecStartPre=/bin/systemctl status rdma-load-modules@infiniband.ser
> > vice
> > ExecStartPre=/bin/sh -c 'while pid=$(pidof -s systemd-modules-
> > load); do echo "Waiting for systemd-modules-load process $pid to
> > exit..."; tail --pid=$pid -f /dev/null; done'
> > 
> > systemctl status says that rdma-load-modules@infiniband.service was
> > started one second after networking.service.
> > 
> > The ps command from ExecStartPre says that only systemd-journald,
> > systemd-udevd, multipathd, and init were running. "ls -l
> > /sys/class/infiniband" says that mlx4_0 is present. And "systemctl
> > status rdma-load-modules@infiniband.service" says:
> > 
> > rdma-load-modules@infiniband.service - Load RDMA modules from
> > /etc/rdma/modules/infiniband.conf
> >   Loaded: loaded (/lib/systemd/system/rdma-load-modules@.service;
> > static; vendor preset: enabled)
> >   Active: inactive (dead)
> >     Docs: file:/usr/share/doc/rdma-core/udev.md
> > 
> > So it is clear, that rdma-load-modules@infiniband.service is not
> > triggered when networking.service is started.
> 
> Hum, if you have the modules in the initrd then udev should schedule
> this service to run essentially immediately on boot, and it should
> become ordered properly..
> 
> Ie the rdma device should already present when udev is started.
> 
> Starting *after* networking.service suggests that the mlx4 RDMA
> device
> was hotplugged into the system a long time after early boot! Which is
> not at all what I expect.
> 
> What does dmesg say about the mlx4 driver load?

I booted with break=bottom and listed the loaded modules in the initrd.
They were:

mlx4_ib
ib_sa
ib_mad
ib_core
ib_addr
ib_netlink
mlx4_core
mlx_compat

> Upstream blocks module completion until the driver is done (this
> takes
> a long time), is it possible that MOFED does this async? That could
> explain everything.
> 
> Also, IMHO, the networking.service above is wrong. It should not
> attempt to do udevadm settle internally, but it must depend on
> systemd-udev-settle.service.
> 
> The reason is due to how systemd scheduals ordering. Once it starts
> running networking.service 'ExecStartPre' it will not re-consider
> order past that point. So any activations done by udev while settling
> have no impact on networking.service at all.
> 
> Having it depend on systemd-udev-settle.service means it gets to
> recheck ordering after settle is done, but before starting
> networking.sevice - which is the behavior it is really trying to get.
> 
> That may be a big part of this bug, go back to doing:
> 
> After=systemd-udev-settle.service
> Requires=systemd-udev-settle.service

You are right. I modified networking.service accordingly and it works
as expected now. I send a patch for ifupdown to Debian, but a
discussion about the fix started: https://bugs.debian.org/899002
Jason Gunthorpe May 22, 2018, 3:44 p.m. UTC | #27
On Tue, May 22, 2018 at 03:23:47PM +0200, Benjamin Drung wrote:

> > Also, IMHO, the networking.service above is wrong. It should not
> > attempt to do udevadm settle internally, but it must depend on
> > systemd-udev-settle.service.

I'm still a little puzzled here, as ipoib should have been started
really, really early on, early enough to get ordered. Can you check
the boot log dmesgs to see when the banner for plugging in the mlx4 IB
device is printed? I fear that is being done after the initrd is
finished with MOFED.

> > The reason is due to how systemd scheduals ordering. Once it starts
> > running networking.service 'ExecStartPre' it will not re-consider
> > order past that point. So any activations done by udev while settling
> > have no impact on networking.service at all.
> > 
> > Having it depend on systemd-udev-settle.service means it gets to
> > recheck ordering after settle is done, but before starting
> > networking.sevice - which is the behavior it is really trying to get.
> > 
> > That may be a big part of this bug, go back to doing:
> > 
> > After=systemd-udev-settle.service
> > Requires=systemd-udev-settle.service
> 
> You are right. I modified networking.service accordingly and it works
> as expected now. I send a patch for ifupdown to Debian, but a
> discussion about the fix started: https://bugs.debian.org/899002

That bug discussion has taken a strange turn..

Firstly, the entire point of 'udev adm settle' is to create
compatibility with non-hotplug aware things like ifupdown by
synchronizing the cold plug of all the boot devices.

So it is wrong to say that this is somehow the fault of the rdma
scripts - they should work correctly with the settle mechanism.

I can understand the ifupdown not wanting to always depend on settle,
but their really is no choice. If thier scripts don't support hot plug
then they have to settle before running them or lots of stuff just
won't work right.

I think the compromise for them should be to keep the settle as optional,
but implement it correctly.

Something like this helper service:

ifupdown-pre.service:

[Unit]
Description=Helper to synchronize boot up for ifupdown
Wants=network.target
After=local-fs.target network-pre.target apparmor.service
systemd-sysctl.service systemd-modules-load.service
Before=network.target shutdown.target network-online.target
Conflicts=shutdown.target

[Service]
Type=oneshot
ExectStart=/bin/sh -c '[ "$CONFIGURE_INTERFACES" != "no" ] && systemctl start systemd-udev-settle.service

They could compromize and put the optional settled they already have
in their own script (like

Then change networking.service to have:

# Launch the oneshot to conditionally start ssytemd-udev-settle.service before starting this
Requires=ifupdown-pre.service
After=ifupdown-pre.service
# If it did lauch systemd-udev-settle.service then wait for it before starting
After=systemd-udev-settle.service

And obviously drop the wrong ExecStartPre from networking.service.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/srp_daemon/srp_daemon.service.in b/srp_daemon/srp_daemon.service.in
index 33dddd5cb46fef..cca1fce9c99283 100644
--- a/srp_daemon/srp_daemon.service.in
+++ b/srp_daemon/srp_daemon.service.in
@@ -1,6 +1,6 @@ 
 [Unit]
 Description=Daemon that discovers and logs in to SRP target systems
-Documentation=man:srp_daemon file:/etc/rdma/rdma.conf file:/etc/srp_daemon.conf
+Documentation=man:srp_daemon file:/etc/srp_daemon.conf
 DefaultDependencies=false
 Conflicts=emergency.target emergency.service
 Before=remote-fs-pre.target
diff --git a/srp_daemon/srp_daemon_port@.service.in b/srp_daemon/srp_daemon_port@.service.in
index 0ec966f912aec8..3c9c824fd243aa 100644
--- a/srp_daemon/srp_daemon_port@.service.in
+++ b/srp_daemon/srp_daemon_port@.service.in
@@ -3,6 +3,7 @@  Description=SRP daemon that monitors port %i
 Documentation=man:srp_daemon file:/etc/rdma/rdma.conf file:/etc/srp_daemon.conf
 DefaultDependencies=false
 Conflicts=emergency.target emergency.service
+Requires=srp_kernel_module.service
 After=srp_daemon.service dev-infiniband-umad-%i.device network.target
 BindsTo=srp_daemon.service dev-infiniband-umad-%i.device
 Before=remote-fs-pre.target
diff --git a/srp_daemon/srp_kernel_module.service b/srp_daemon/srp_kernel_module.service
new file mode 100644
index 00000000000000..b779031578dae1
--- /dev/null
+++ b/srp_daemon/srp_kernel_module.service
@@ -0,0 +1,12 @@ 
+[Unit]
+Description=Load the SRP daemon kernel module
+Documentation=man:srp_daemon
+After=systemd-modules-load.service
+ConditionCapability=CAP_SYS_MODULE
+ConditionPathIsDirectory=!/sys/class/infiniband_srp/
+
+[Service]
+Type=oneshot
+RemainAfterExit=yes
+ExecPre=bash -c 'mkdir -p /run/modules-load.d && echo ib_srp > /run/modules-load.d/rdma_core_srp_modules.conf'
+ExecStart=/lib/systemd/systemd-modules-load