[v2,33/33] lustre: update version to 2.9.99
diff mbox series

Message ID 1546812868-11794-34-git-send-email-jsimmons@infradead.org
State New
Headers show
Series
  • lustre: add PFL support
Related show

Commit Message

James Simmons Jan. 6, 2019, 10:14 p.m. UTC
With the majority of missing patches and features from the lustre
2.10 release merged upstream its time to update the upstream
client's version.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

NeilBrown Jan. 8, 2019, 1:38 a.m. UTC | #1
On Sun, Jan 06 2019, James Simmons wrote:

> With the majority of missing patches and features from the lustre
> 2.10 release merged upstream its time to update the upstream
> client's version.

:-)

Thanks to some of these patches (this batch or previous) I have fewer
failing tests now .. those not many fewer.

The current summary is
     45             status: FAIL
    556             status: PASS
     47             status: SKIP

It used to be >50 FAIL.

The failing tests are listed below.
I know why the FID patches fail - we've discussed that.
Maybe it is time to start working out why some of the others are
failing.

Your two recent series are in my lustre-testing branch now - thanks.

NeilBrown


sanity: FAIL: test_27G 'testpool' is not empty
sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1
sanity: FAIL: test_56x migrate failed rc = 11
sanity: FAIL: test_56xa migrate failed rc = 11
sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error
sanity: FAIL: test_56aa lfs find --size wrong under striped dir
sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed
sanity: FAIL: test_64b oos.sh failed: 1
sanity: FAIL: test_102c setstripe failed
sanity: FAIL: test_102j file1-0-1: size  != 65536
sanity: FAIL: test_103a misc test failed
sanity: FAIL: test_104b lfs check servers test failed
sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
sanity: FAIL: test_160a changelog 'f160a.sanity' fid  != file fid [0x240002342:0xd:0x0]
sanity: FAIL: test_161d cat failed
sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats
sanity: FAIL: test_208 get lease error
sanity: FAIL: test_225a mds-survey with zero-stripe failed
sanity: FAIL: test_225b mds-survey with stripe_count failed
sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]'
sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'
sanity: FAIL: test_255c Ladvise test10 failed, 255
sanity: FAIL: test_270a Can't create DoM layout
sanity: FAIL: test_270c bad pattern
sanity: FAIL: test_270e lfs find -L: found 1, expected 20
sanity: FAIL: test_270f Can't create file with 262144 DoM stripe
sanity: FAIL: test_271c Too few enqueues , expected > 2000
sanity: FAIL: test_271f expect 1 READ RPC,  occured
sanity: FAIL: test_300g create striped_dir failed
sanity: FAIL: test_300n create striped dir fails with gid=-1
sanity: FAIL: test_300q create d300q.sanity fails
sanity: FAIL: test_315 read is not accounted ()
sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity
sanity: FAIL: test_405 One layout swap locked test failed
sanity: FAIL: test_406 mkdir d406.sanity failed
sanity: FAIL: test_409 Fail to cleanup the env!
sanity: FAIL: test_410 no inode match
sanity: FAIL: test_412 mkdir failed
sanity: FAIL: test_413 don't expect 1
sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed
James Simmons Jan. 8, 2019, 4:26 a.m. UTC | #2
> On Sun, Jan 06 2019, James Simmons wrote:
> 
> > With the majority of missing patches and features from the lustre
> > 2.10 release merged upstream its time to update the upstream
> > client's version.
> 
> :-)
> 
> Thanks to some of these patches (this batch or previous) I have fewer
> failing tests now .. those not many fewer.
> 
> The current summary is
>      45             status: FAIL
>     556             status: PASS
>      47             status: SKIP
> 
> It used to be >50 FAIL.
> 
> The failing tests are listed below.
> I know why the FID patches fail - we've discussed that.
> Maybe it is time to start working out why some of the others are
> failing.

You are running a much newer test suite. Using the test suite from lustre 
2.10 I see the following failures.

sanity: FAIL: test_103a run_acl_subtest cp failed    (real failure)
sanity: FAIL: test_215 cannot read lnet.stats	     (not sysfs aware)
sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'
sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'
sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0  (fails for 2.10 LTS release as well)
 
> Your two recent series are in my lustre-testing branch now - thanks.
> 
> NeilBrown
> 
> 
> sanity: FAIL: test_27G 'testpool' is not empty 

See LU-11208. Test currently with older lustre versions.

> sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1
> sanity: FAIL: test_56x migrate failed rc = 11
> sanity: FAIL: test_56xa migrate failed rc = 11
> sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error
> sanity: FAIL: test_56aa lfs find --size wrong under striped dir
> sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed
> sanity: FAIL: test_64b oos.sh failed: 1
> sanity: FAIL: test_102c setstripe failed
> sanity: FAIL: test_102j file1-0-1: size  != 65536

I believe these are due to the DoM feature missing

> sanity: FAIL: test_103a misc test failed

103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu)

> sanity: FAIL: test_104b lfs check servers test failed

sysfs bug. I have a patch for this.

> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed

What version of e2fsprog are you running? You need a 1.44 version and
this should go away.

> sanity: FAIL: test_160a changelog 'f160a.sanity' fid  != file fid [0x240002342:0xd:0x0]
> sanity: FAIL: test_161d cat failed

Might be missing some more changelog improvements.

> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats

Strange?

> sanity: FAIL: test_208 get lease error
> sanity: FAIL: test_225a mds-survey with zero-stripe failed
> sanity: FAIL: test_225b mds-survey with stripe_count failed

Never ran that since its not in 2.10.

> sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]'
> sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'

> sanity: FAIL: test_255c Ladvise test10 failed, 255
> sanity: FAIL: test_270a Can't create DoM layout
> sanity: FAIL: test_270c bad pattern
> sanity: FAIL: test_270e lfs find -L: found 1, expected 20
> sanity: FAIL: test_270f Can't create file with 262144 DoM stripe
> sanity: FAIL: test_271c Too few enqueues , expected > 2000
> sanity: FAIL: test_271f expect 1 READ RPC,  occured
> sanity: FAIL: test_300g create striped_dir failed
> sanity: FAIL: test_300n create striped dir fails with gid=-1
> sanity: FAIL: test_300q create d300q.sanity fails
> sanity: FAIL: test_315 read is not accounted ()
> sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity
> sanity: FAIL: test_405 One layout swap locked test failed
> sanity: FAIL: test_406 mkdir d406.sanity failed
> sanity: FAIL: test_409 Fail to cleanup the env!
 
More DoM issues? Could be FLR as well if you are running the latest
test suite.

> sanity: FAIL: test_410 no inode match

This is a weird test running a local kernel module.

> sanity: FAIL: test_412 mkdir failed
> sanity: FAIL: test_413 don't expect 1

More DoM ???? Have to look at this.

> sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed

Is test is broken. It assumes you have a specially patched kernel.
Details are under ticket LU-684.

The nice thing is with the linux client is that we are at a point
it wouldn't be a huge leap to integrate DoM (Data on MetaData).
The reason I suggest cleanups and moving out of staging first was
to perserve git blame a bit better with future patches. Currently
we see a lot of "0846e85ba2346 (NeilBrown 2018-06-07" with git blame.
Andreas Dilger Jan. 8, 2019, 10:13 a.m. UTC | #3
On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote:
> 
>> 
>> On Sun, Jan 06 2019, James Simmons wrote:
>> 
>>> With the majority of missing patches and features from the lustre
>>> 2.10 release merged upstream its time to update the upstream
>>> client's version.
>> 
>> :-)
>> 
>> Thanks to some of these patches (this batch or previous) I have fewer
>> failing tests now .. those not many fewer.
>> 
>> The current summary is
>>     45             status: FAIL
>>    556             status: PASS
>>     47             status: SKIP
>> 
>> It used to be >50 FAIL.
>> 
>> The failing tests are listed below.
>> I know why the FID patches fail - we've discussed that.
>> Maybe it is time to start working out why some of the others are
>> failing.
> 
> You are running a much newer test suite. Using the test suite from lustre 
> 2.10 I see the following failures.
> 
> sanity: FAIL: test_103a run_acl_subtest cp failed    (real failure)
> sanity: FAIL: test_215 cannot read lnet.stats	     (not sysfs aware)
> sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'
> sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'
> sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0  (fails for 2.10 LTS release as well)

Yes, there are definitely some tests that do not have proper client/server version/feature checks, since the tests are introduced with the code they
are testing.  There are a number of patches in Gerrit that are adding the
proper checks that I'd like to get landed, because we do run client/server
version interop testing, but they always lag a bit behind and we never see
test-script/client version issues in our testing. 

>> Your two recent series are in my lustre-testing branch now - thanks.
>> 
>> NeilBrown
>> 
>> 
>> sanity: FAIL: test_27G 'testpool' is not empty 
> 
> See LU-11208. Test currently with older lustre versions.
> 
>> sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1
>> sanity: FAIL: test_56x migrate failed rc = 11
>> sanity: FAIL: test_56xa migrate failed rc = 11
>> sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error
>> sanity: FAIL: test_56aa lfs find --size wrong under striped dir
>> sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed
>> sanity: FAIL: test_64b oos.sh failed: 1
>> sanity: FAIL: test_102c setstripe failed
>> sanity: FAIL: test_102j file1-0-1: size  != 65536
> 
> I believe these are due to the DoM feature missing
> 
>> sanity: FAIL: test_103a misc test failed
> 
> 103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu)
> 
>> sanity: FAIL: test_104b lfs check servers test failed
> 
> sysfs bug. I have a patch for this.
> 
>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
> 
> What version of e2fsprog are you running? You need a 1.44 version and
> this should go away.

To be clear - the Lustre-patched "filefrag" at:

https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/

Once Lustre gets into upstream, or convince another filesystem to use the
Lustre filefrag extension (multiple devices, which BtrFS and XFS could
use) we can get the support landed into the upstream e2fsprogs.

>> sanity: FAIL: test_160a changelog 'f160a.sanity' fid  != file fid [0x240002342:0xd:0x0]
>> sanity: FAIL: test_161d cat failed
> 
> Might be missing some more changelog improvements.
> 
>> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats
> 
> Strange?

This might be because the upstream Lustre doesn't allow setting per-process
JobID via environment variable, only as a single per-node value.  The real
unfortunate part is that the "get JobID from environment" actually works for
every reasonable architecture (even the one which was originally broken
fixed it), but it got yanked anyway.  This is actually one of the features
of Lustre that lots of HPC sites like to use, since it allows them to track
on the servers which users/jobs/processes on the client are doing IO.

>> sanity: FAIL: test_208 get lease error
>> sanity: FAIL: test_225a mds-survey with zero-stripe failed
>> sanity: FAIL: test_225b mds-survey with stripe_count failed
> 
> Never ran that since its not in 2.10.
> 
>> sanity: FAIL: test_233a cannot access /mnt/lustre using its FID '[0x200000007:0x1:0x0]'
>> sanity: FAIL: test_233b cannot access /mnt/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'
> 
>> sanity: FAIL: test_255c Ladvise test10 failed, 255
>> sanity: FAIL: test_270a Can't create DoM layout
>> sanity: FAIL: test_270c bad pattern
>> sanity: FAIL: test_270e lfs find -L: found 1, expected 20
>> sanity: FAIL: test_270f Can't create file with 262144 DoM stripe
>> sanity: FAIL: test_271c Too few enqueues , expected > 2000
>> sanity: FAIL: test_271f expect 1 READ RPC,  occured
>> sanity: FAIL: test_300g create striped_dir failed
>> sanity: FAIL: test_300n create striped dir fails with gid=-1
>> sanity: FAIL: test_300q create d300q.sanity fails
>> sanity: FAIL: test_315 read is not accounted ()
>> sanity: FAIL: test_317 Expected Block 4096 got 10240 for f317.sanity
>> sanity: FAIL: test_405 One layout swap locked test failed
>> sanity: FAIL: test_406 mkdir d406.sanity failed
>> sanity: FAIL: test_409 Fail to cleanup the env!
> 
> More DoM issues? Could be FLR as well if you are running the latest
> test suite.
> 
>> sanity: FAIL: test_410 no inode match
> 
> This is a weird test running a local kernel module.
> 
>> sanity: FAIL: test_412 mkdir failed
>> sanity: FAIL: test_413 don't expect 1
> 
> More DoM ???? Have to look at this.
> 
>> sanity: FAIL: test_802 (5) Mount client with 'ro' should succeed
> 
> Is test is broken. It assumes you have a specially patched kernel.
> Details are under ticket LU-684.
> 
> The nice thing is with the linux client is that we are at a point
> it wouldn't be a huge leap to integrate DoM (Data on MetaData).
> The reason I suggest cleanups and moving out of staging first was
> to perserve git blame a bit better with future patches. Currently
> we see a lot of "0846e85ba2346 (NeilBrown 2018-06-07" with git blame.

Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud
James Simmons Jan. 8, 2019, 9:47 p.m. UTC | #4
> >> sanity: FAIL: test_104b lfs check servers test failed
> > 
> > sysfs bug. I have a patch for this.
> > 
> >> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
> >> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
> >> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
> >> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
> >> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
> > 
> > What version of e2fsprog are you running? You need a 1.44 version and
> > this should go away.
> 
> To be clear - the Lustre-patched "filefrag" at:
> 
> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/
> 
> Once Lustre gets into upstream, or convince another filesystem to use the
> Lustre filefrag extension (multiple devices, which BtrFS and XFS could
> use) we can get the support landed into the upstream e2fsprogs.

I swore that Ubuntu18 testing passed with the default e2fsprogs (1.44.4).
To let Neil know, this is why lustre_fiemap.h exist in the uapi headers 
directory. This kind of functionality would help the community at large.
 
> >> sanity: FAIL: test_160a changelog 'f160a.sanity' fid  != file fid [0x240002342:0xd:0x0]
> >> sanity: FAIL: test_161d cat failed
> > 
> > Might be missing some more changelog improvements.
> > 
> >> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats
> > 
> > Strange?
> 
> This might be because the upstream Lustre doesn't allow setting per-process
> JobID via environment variable, only as a single per-node value.  The real
> unfortunate part is that the "get JobID from environment" actually works for
> every reasonable architecture (even the one which was originally broken
> fixed it), but it got yanked anyway.  This is actually one of the features
> of Lustre that lots of HPC sites like to use, since it allows them to track
> on the servers which users/jobs/processes on the client are doing IO.

To give background for Neil see thread:

https://lore.kernel.org/patchwork/patch/416846

In this case I do agree with Greg. The latest jobid does implement an
upcall and upcalls don't play niece with containers. Their is also the
namespace issue pointed out. I think the namespace issue might be fixed
in the latest OpenSFS code. The whole approach to stats in lustre is
pretty awful. Take jobstats for example. Currently the approach is
to poll inside the kernel at specific intervals. Part of the polling is 
scanning the running processes environment space. On top of this the 
administor ends up creating scripts to poll the proc / debugfs entry. 
Other types of lustre stat files take a similar approach. Scripts have
to poll debugfs / procfs entries.

I have been thinking what would be a better approach since I like to
approach this problem for the 2.13 time frame. Our admins at my work
place want to be able to collect application stats without being root.
So placing stats in debugfs is not an option, which we currently do
the linux client :-( The stats are not a good fit for sysfs. The solution 
I have been pondering is using netlink. Since netlink is socket based it 
can be treated as a pipe. Now you are thinking well you still need to poll 
on the netlink socket but you don't have too. systemd does it for you :-)  
We can create systemd service file which uses

ListenNetlink=generic "multicast group" ...

to launch a service to collect the stats. As for job stats we use another
type of netlink class called process connector. The below link cover this
little know feature. It is available on every system in the support 
matrix.

https://www.slideshare.net/kerneltlv/kernel-proc-connector-and-containers

In this case we parsen for the job schedular ids and pass that to lustre
using system.d.
Andreas Dilger Jan. 9, 2019, 12:15 a.m. UTC | #5
On Jan 8, 2019, at 14:47, James Simmons <jsimmons@infradead.org> wrote:
> 
>>>> sanity: FAIL: test_104b lfs check servers test failed
>>> 
>>> sysfs bug. I have a patch for this.
>>> 
>>>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
>>>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
>>>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
>>>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
>>>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
>>> 
>>> What version of e2fsprog are you running? You need a 1.44 version and
>>> this should go away.
>> 
>> To be clear - the Lustre-patched "filefrag" at:
>> 
>> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/
>> 
>> Once Lustre gets into upstream, or convince another filesystem to use the
>> Lustre filefrag extension (multiple devices, which BtrFS and XFS could
>> use) we can get the support landed into the upstream e2fsprogs.
> 
> I swore that Ubuntu18 testing passed with the default e2fsprogs (1.44.4).
> To let Neil know, this is why lustre_fiemap.h exist in the uapi headers 
> directory. This kind of functionality would help the community at large.

The returned data is identical for single-striped files, so the vanilla
filefrag will work on Lustre in the common case.

>>>> sanity: FAIL: test_160a changelog 'f160a.sanity' fid  != file fid [0x240002342:0xd:0x0]
>>>> sanity: FAIL: test_161d cat failed
>>> 
>>> Might be missing some more changelog improvements.
>>> 
>>>> sanity: FAIL: test_205 No jobstats for id.205.mkdir.9480 found on mds1::*.lustre-MDT0000.job_stats
>>> 
>>> Strange?
>> 
>> This might be because the upstream Lustre doesn't allow setting per-process
>> JobID via environment variable, only as a single per-node value.  The real
>> unfortunate part is that the "get JobID from environment" actually works for
>> every reasonable architecture (even the one which was originally broken
>> fixed it), but it got yanked anyway.  This is actually one of the features
>> of Lustre that lots of HPC sites like to use, since it allows them to track
>> on the servers which users/jobs/processes on the client are doing IO.
> 
> To give background for Neil see thread:
> 
> https://lore.kernel.org/patchwork/patch/416846
> 
> In this case I do agree with Greg. The latest jobid does implement an
> upcall and upcalls don't play niece with containers. Their is also the
> namespace issue pointed out. I think the namespace issue might be fixed
> in the latest OpenSFS code.

I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
maybe in the kernel client where we weren't allowed to parse the process
environment directly.  I agree an upcall is problematic with namespaces,
in addition to being less functional (only a JobID per node instead of
per process), which is why direct access to JOBENV is better IMHO.

> The whole approach to stats in lustre is
> pretty awful. Take jobstats for example. Currently the approach is
> to poll inside the kernel at specific intervals. Part of the polling is 
> scanning the running processes environment space. On top of this the 
> administor ends up creating scripts to poll the proc / debugfs entry. 
> Other types of lustre stat files take a similar approach. Scripts have
> to poll debugfs / procfs entries.

I think that issue is orthogonal to getting the actual JobID.  That is
the stats collection from the kernel.  We shouldn't be inventing a new
way to process that.  What does "top" do?  Read a thousand /proc files
every second because that is flexible for different use cases.  There
are much fewer Lustre stats files on a given node, and I haven't heard
that the actual stats reading interface is a performance issue.

> I have been thinking what would be a better approach since I like to
> approach this problem for the 2.13 time frame. Our admins at my work
> place want to be able to collect application stats without being root.
> So placing stats in debugfs is not an option, which we currently do
> the linux client :-( The stats are not a good fit for sysfs. The solution 
> I have been pondering is using netlink. Since netlink is socket based it 
> can be treated as a pipe. Now you are thinking well you still need to poll 
> on the netlink socket but you don't have too. systemd does it for you :-)  
> We can create systemd service file which uses

For the love of all that is holy, do not make Lustre stats usage depend
on Systemd to be usable.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud
James Simmons Jan. 9, 2019, 6:28 p.m. UTC | #6
> >> This might be because the upstream Lustre doesn't allow setting per-process
> >> JobID via environment variable, only as a single per-node value.  The real
> >> unfortunate part is that the "get JobID from environment" actually works for
> >> every reasonable architecture (even the one which was originally broken
> >> fixed it), but it got yanked anyway.  This is actually one of the features
> >> of Lustre that lots of HPC sites like to use, since it allows them to track
> >> on the servers which users/jobs/processes on the client are doing IO.
> > 
> > To give background for Neil see thread:
> > 
> > https://lore.kernel.org/patchwork/patch/416846
> > 
> > In this case I do agree with Greg. The latest jobid does implement an
> > upcall and upcalls don't play niece with containers. Their is also the
> > namespace issue pointed out. I think the namespace issue might be fixed
> > in the latest OpenSFS code.
> 
> I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
> maybe in the kernel client where we weren't allowed to parse the process
> environment directly.  I agree an upcall is problematic with namespaces,
> in addition to being less functional (only a JobID per node instead of
> per process), which is why direct access to JOBENV is better IMHO.

I have some evil ideas about this. Need to think about it some more since
this is a more complex problem.
 
> > The whole approach to stats in lustre is
> > pretty awful. Take jobstats for example. Currently the approach is
> > to poll inside the kernel at specific intervals. Part of the polling is 
> > scanning the running processes environment space. On top of this the 
> > administor ends up creating scripts to poll the proc / debugfs entry. 
> > Other types of lustre stat files take a similar approach. Scripts have
> > to poll debugfs / procfs entries.
> 
> I think that issue is orthogonal to getting the actual JobID.  That is
> the stats collection from the kernel.  We shouldn't be inventing a new
> way to process that.  What does "top" do?  Read a thousand /proc files
> every second because that is flexible for different use cases.  There
> are much fewer Lustre stats files on a given node, and I haven't heard
> that the actual stats reading interface is a performance issue.

Because the policy for the linux kernel is not to add non processes 
related information in procfs anymore. "top" reads process information
from procfs which is okay. This means the stats lustre generates are
required to be placed in debugfs. The problem their is you need to be
root to access this information. I told the administrator about this
and they told me in no way will they run an application as root just
to read stats. We really don't want to require users to mount their
debugfs partitions to allow non root uses to access it. So I looked 
into alteranatives. Actually with netlink you have far more power for 
handling stats then polling some proc file. Also while for most cases the 
stat files are not huge in general but if we do end up having a stat 
seq_file with a huge amount of data then polling that file can really 
spike the load on an node. 

> > I have been thinking what would be a better approach since I like to
> > approach this problem for the 2.13 time frame. Our admins at my work
> > place want to be able to collect application stats without being root.
> > So placing stats in debugfs is not an option, which we currently do
> > the linux client :-( The stats are not a good fit for sysfs. The solution 
> > I have been pondering is using netlink. Since netlink is socket based it 
> > can be treated as a pipe. Now you are thinking well you still need to poll 
> > on the netlink socket but you don't have too. systemd does it for you :-)  
> > We can create systemd service file which uses
> 
> For the love of all that is holy, do not make Lustre stats usage depend
> on Systemd to be usable.

I never write code that locks in one approach ever. Take for example the
lctl conf_param / set_param -P handling with the move to sysfs. Instead
of the old upcall method to lctl now we have a udev rule. That rule is not
law!!! A site could create their own udev rule if they want to say log
changes to the lustre tunables. Keep in mind udev rules need to be simple
since they block until completed much like upcalls do. If you want to
run a heavy application you can create a system.d service to handle the
tunable uevent. If you are really clever you can use dbus to send that
the tuning event to external nodes. Their are many creative options.

The same is true with the stats netlink approach. I was up late last night
pondering a design for the netlink stats. I have to put together a list
of my ideas and run it by my admins. So no systemd is not a hard 
requirment. Just an option for people into that sort of thing. Using
udev and netlink opens up a whole new stack to take advantage of.
Andreas Dilger Jan. 9, 2019, 11:16 p.m. UTC | #7
On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote:
> 
> 
>>>> This might be because the upstream Lustre doesn't allow setting per-process
>>>> JobID via environment variable, only as a single per-node value.  The real
>>>> unfortunate part is that the "get JobID from environment" actually works for
>>>> every reasonable architecture (even the one which was originally broken
>>>> fixed it), but it got yanked anyway.  This is actually one of the features
>>>> of Lustre that lots of HPC sites like to use, since it allows them to track
>>>> on the servers which users/jobs/processes on the client are doing IO.
>>> 
>>> To give background for Neil see thread:
>>> 
>>> https://lore.kernel.org/patchwork/patch/416846
>>> 
>>> In this case I do agree with Greg. The latest jobid does implement an
>>> upcall and upcalls don't play niece with containers. Their is also the
>>> namespace issue pointed out. I think the namespace issue might be fixed
>>> in the latest OpenSFS code.
>> 
>> I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
>> maybe in the kernel client where we weren't allowed to parse the process
>> environment directly.  I agree an upcall is problematic with namespaces,
>> in addition to being less functional (only a JobID per node instead of
>> per process), which is why direct access to JOBENV is better IMHO.
> 
> I have some evil ideas about this. Need to think about it some more since
> this is a more complex problem.

Since the kernel manages the environment variables via getenv() and setenv(),
I honestly don't see why accessing them directly is a huge issue?

>>> The whole approach to stats in lustre is
>>> pretty awful. Take jobstats for example. Currently the approach is
>>> to poll inside the kernel at specific intervals. Part of the polling is 
>>> scanning the running processes environment space. On top of this the 
>>> administor ends up creating scripts to poll the proc / debugfs entry. 
>>> Other types of lustre stat files take a similar approach. Scripts have
>>> to poll debugfs / procfs entries.
>> 
>> I think that issue is orthogonal to getting the actual JobID.  That is
>> the stats collection from the kernel.  We shouldn't be inventing a new
>> way to process that.  What does "top" do?  Read a thousand /proc files
>> every second because that is flexible for different use cases.  There
>> are much fewer Lustre stats files on a given node, and I haven't heard
>> that the actual stats reading interface is a performance issue.
> 
> Because the policy for the linux kernel is not to add non processes 
> related information in procfs anymore. "top" reads process information
> from procfs which is okay.

The location of the stats (procfs vs. sysfs vs. debugfs) wasn't my point.
My point was that a *very* core kernel performance monitoring utility is
doing open/read/close from virtual kernel files, so before we go ahead
and invent our own performance monitoring framework (which may be frowned
upon by upstream for arbitrary reasons because it isn't using /proc or
/sys files).

> This means the stats lustre generates are
> required to be placed in debugfs. The problem their is you need to be
> root to access this information.

That is a self-inflicted problem because of upstream kernel policy to
move the existing files out of /proc and being unable to use /sys either.

> I told the administrator about this
> and they told me in no way will they run an application as root just
> to read stats. We really don't want to require users to mount their
> debugfs partitions to allow non root uses to access it. So I looked 
> into alteranatives. Actually with netlink you have far more power for 
> handling stats then polling some proc file. Also while for most cases the 
> stat files are not huge in general but if we do end up having a stat 
> seq_file with a huge amount of data then polling that file can really 
> spike the load on an node.

I agree it is not ideal.  One option (AFAIK) would be a udev rule that
changes the /sys/kernel/debug/lustre/* files to be at readable by a
non-root group (e.g. admin or perftools or whatever) for the collector.

>>> I have been thinking what would be a better approach since I like to
>>> approach this problem for the 2.13 time frame. Our admins at my work
>>> place want to be able to collect application stats without being root.
>>> So placing stats in debugfs is not an option, which we currently do
>>> the linux client :-( The stats are not a good fit for sysfs. The solution 
>>> I have been pondering is using netlink. Since netlink is socket based it 
>>> can be treated as a pipe. Now you are thinking well you still need to poll 
>>> on the netlink socket but you don't have too. systemd does it for you :-)  
>>> We can create systemd service file which uses
>> 
>> For the love of all that is holy, do not make Lustre stats usage depend
>> on Systemd to be usable.
> 
> I never write code that locks in one approach ever. Take for example the
> lctl conf_param / set_param -P handling with the move to sysfs. Instead
> of the old upcall method to lctl now we have a udev rule. That rule is not
> law!!! A site could create their own udev rule if they want to say log
> changes to the lustre tunables. Keep in mind udev rules need to be simple
> since they block until completed much like upcalls do. If you want to
> run a heavy application you can create a system.d service to handle the
> tunable uevent. If you are really clever you can use dbus to send that
> the tuning event to external nodes. Their are many creative options.
> 
> The same is true with the stats netlink approach. I was up late last night
> pondering a design for the netlink stats. I have to put together a list
> of my ideas and run it by my admins. So no systemd is not a hard 
> requirment. Just an option for people into that sort of thing. Using
> udev and netlink opens up a whole new stack to take advantage of.

Sorry to be negative, but I was just having fun with systemd over the
weekend on one of my home systems, and I really don't want to entangle
it into our stats.  If the existing procfs/sysfs/debugfs scraping will
continue to work in the future then I'm fine with that.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud
NeilBrown Jan. 10, 2019, 12:40 a.m. UTC | #8
On Tue, Jan 08 2019, Andreas Dilger wrote:

> On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote:
>> 
>>> 
>>> On Sun, Jan 06 2019, James Simmons wrote:
>>> 
>>>> With the majority of missing patches and features from the lustre
>>>> 2.10 release merged upstream its time to update the upstream
>>>> client's version.
>>> 
>>> :-)
>>> 
>>> Thanks to some of these patches (this batch or previous) I have fewer
>>> failing tests now .. those not many fewer.
>>> 
>>> The current summary is
>>>     45             status: FAIL
>>>    556             status: PASS
>>>     47             status: SKIP
>>> 
>>> It used to be >50 FAIL.
>>> 
>>> The failing tests are listed below.
>>> I know why the FID patches fail - we've discussed that.
>>> Maybe it is time to start working out why some of the others are
>>> failing.
>> 
>> You are running a much newer test suite. Using the test suite from lustre 
>> 2.10 I see the following failures.
>> 
>> sanity: FAIL: test_103a run_acl_subtest cp failed    (real failure)
>> sanity: FAIL: test_215 cannot read lnet.stats	     (not sysfs aware)
>> sanity: FAIL: test_233a cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'
>> sanity: FAIL: test_233b cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'
>> sanity: FAIL: test_256 Changelog catalog has wrong number of slots 0  (fails for 2.10 LTS release as well)
>
> Yes, there are definitely some tests that do not have proper client/server version/feature checks, since the tests are introduced with the code they
> are testing.  There are a number of patches in Gerrit that are adding the
> proper checks that I'd like to get landed, because we do run client/server
> version interop testing, but they always lag a bit behind and we never see
> test-script/client version issues in our testing. 
>
>>> Your two recent series are in my lustre-testing branch now - thanks.
>>> 
>>> NeilBrown
>>> 
>>> 
>>> sanity: FAIL: test_27G 'testpool' is not empty 
>> 
>> See LU-11208. Test currently with older lustre versions.
>> 
>>> sanity: FAIL: test_56w /root/lustre-release/lustre/utils/lfs getstripe -c /mnt/lustre/d56w.sanity/file1 wrong: found 2, expected 1
>>> sanity: FAIL: test_56x migrate failed rc = 11
>>> sanity: FAIL: test_56xa migrate failed rc = 11
>>> sanity: FAIL: test_56z /root/lustre-release/lustre/utils/lfs find did not continue after error
>>> sanity: FAIL: test_56aa lfs find --size wrong under striped dir
>>> sanity: FAIL: test_56ca create /mnt/lustre/d56ca.sanity/f56ca.sanity- failed
>>> sanity: FAIL: test_64b oos.sh failed: 1
>>> sanity: FAIL: test_102c setstripe failed
>>> sanity: FAIL: test_102j file1-0-1: size  != 65536
>> 
>> I believe these are due to the DoM feature missing
>> 
>>> sanity: FAIL: test_103a misc test failed
>> 
>> 103a is real failure. Never solved yet. (LU-11594 and LU-10334 for Ubuntu)
>> 
>>> sanity: FAIL: test_104b lfs check servers test failed
>> 
>> sysfs bug. I have a patch for this.
>> 
>>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
>>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
>>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
>>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
>>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
>> 
>> What version of e2fsprog are you running? You need a 1.44 version and
>> this should go away.
>
> To be clear - the Lustre-patched "filefrag" at:
>
> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/
>

I looked at Commit 41aee4226789 ("filefrag: Lustre changes to filefrag FIEMAP handling")
in the git tree instead.

This appears to add 3, features.

- It adds an optional device to struct fiemap.
  Presumably this is always returned if available, else zero is provided
  which means "the device".
- It adds a flag FIEMAP_EXTENT_NET which indicates that the device
  number is *not*  dev_t, but is some fs-specific value
- It allows FIEMAP_FLAG_DEVICE_ORDER to be requested.  I can't quite
  work out what this does.  Presumably it changes the order that entries
  are returned (why?) and maybe returns multiple entries for a region
  that is mirrored ???

As you say, I can see how these might be useful to other filesystems.
Maybe we should try upstreaming the support sooner rather than later.

NeilBrown
NeilBrown Jan. 10, 2019, 1:36 a.m. UTC | #9
On Wed, Jan 09 2019, Andreas Dilger wrote:

> On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote:
>> 
>> 
>>>>> This might be because the upstream Lustre doesn't allow setting per-process
>>>>> JobID via environment variable, only as a single per-node value.  The real
>>>>> unfortunate part is that the "get JobID from environment" actually works for
>>>>> every reasonable architecture (even the one which was originally broken
>>>>> fixed it), but it got yanked anyway.  This is actually one of the features
>>>>> of Lustre that lots of HPC sites like to use, since it allows them to track
>>>>> on the servers which users/jobs/processes on the client are doing IO.
>>>> 
>>>> To give background for Neil see thread:
>>>> 
>>>> https://lore.kernel.org/patchwork/patch/416846
>>>> 
>>>> In this case I do agree with Greg. The latest jobid does implement an
>>>> upcall and upcalls don't play niece with containers. Their is also the
>>>> namespace issue pointed out. I think the namespace issue might be fixed
>>>> in the latest OpenSFS code.
>>> 
>>> I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
>>> maybe in the kernel client where we weren't allowed to parse the process
>>> environment directly.  I agree an upcall is problematic with namespaces,
>>> in addition to being less functional (only a JobID per node instead of
>>> per process), which is why direct access to JOBENV is better IMHO.
>> 
>> I have some evil ideas about this. Need to think about it some more since
>> this is a more complex problem.
>
> Since the kernel manages the environment variables via getenv() and setenv(),
> I honestly don't see why accessing them directly is a huge issue?

This is, at best, an over-simplification.  The kernel doesn't "manage" the
environment variables.
When a process calls execve() (or similar) a collection of strings called
"arguments" and another collection of strings called "environment" are
extracted from the processes vm, and used for initializing part of the
newly created vm.  That is all the kernel does with either.
(except for providing /proc/*/cmdline and /proc/*/environ, which is best-effort).

getenv() ad setenv() are entirely implemented in user-space.  It is quite
possible for a process to mess-up its args or environment in a way that
will make /proc/*/{cmdline,environ} fail to return anything useful.

It is quite possible for the memory storing args and env to be swapped
out.  If a driver tried to accesses either, it might trigger page-in of
that part of the address space, which would probably work but might not
be a good idea.

As I understand it, the goal here is to have a cluster-wide identifier
that can be attached to groups of processes on different nodes.  Then
stats relating to all of those processes can be collected together.

If I didn't think that control-groups were an abomination I would
probably suggest using them to define a group of processes, then to
attach a tag to that group. Both the netcl and net_prio cgroups
do exactly this.  perf_event does as well, and even uses the tag exactly
for collecting performance-data together for a set of processes.
Maybe we could try to champion an fs_event control group?
However it is a long time since I've looked at control groups, and they
might have moved on a bit.

But as I do think that control-groups are an abomination, I couldn't
possible suggest any such thing.
Unix already has a perfectly good grouping abstraction - process groups
(unfortunately there are about 3 sorts of these, but that needn't be a
big problem).
Stats can be collected based on pgid, and a mapping from
client+pgid->jobid can be communicated to whatever collects the
statistics ... somehow.

NeilBrown
NeilBrown Jan. 10, 2019, 1:46 a.m. UTC | #10
On Tue, Jan 08 2019, James Simmons wrote:
>
> I have been thinking what would be a better approach since I like to
> approach this problem for the 2.13 time frame. Our admins at my work
> place want to be able to collect application stats without being root.
> So placing stats in debugfs is not an option, which we currently do
> the linux client :-( The stats are not a good fit for sysfs.

How much statistics data are we talking about here?
  /proc/self/mountstats
shows over 2K of stats for NFS filesystems.
Is this in the ball-park or do you need an order-of-magnitude more?

NeilBrown
Andreas Dilger Jan. 10, 2019, 7:28 a.m. UTC | #11
On Jan 9, 2019, at 17:40, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Jan 08 2019, Andreas Dilger wrote:
>> On Jan 7, 2019, at 21:26, James Simmons <jsimmons@infradead.org> wrote:
>>> 
>>>> sanity: FAIL: test_130a filefrag /mnt/lustre/f130a.sanity failed
>>>> sanity: FAIL: test_130b filefrag /mnt/lustre/f130b.sanity failed
>>>> sanity: FAIL: test_130c filefrag /mnt/lustre/f130c.sanity failed
>>>> sanity: FAIL: test_130e filefrag /mnt/lustre/f130e.sanity failed
>>>> sanity: FAIL: test_130f filefrag /mnt/lustre/f130f.sanity failed
>>> 
>>> What version of e2fsprog are you running? You need a 1.44 version and
>>> this should go away.
>> 
>> To be clear - the Lustre-patched "filefrag" at:
>> 
>> https://downloads.whamcloud.com/public/e2fsprogs/1.44.3.wc1/
> 
> I looked at Commit 41aee4226789 ("filefrag: Lustre changes to filefrag] FIEMAP handling") in the git tree instead.
> 
> This appears to add 3 features.
> 
> - It adds an optional device to struct fiemap.
>  Presumably this is always returned if available, else zero is provided
>  which means "the device".

Vanilla filefrag just returns 0 today.  For Lustre filefrag it returns
the OST index on which the blocks are located. For local filesystems
I'm expecting it to return the rdev of the block device, like 0x801 or
similar.

> - It adds a flag FIEMAP_EXTENT_NET which indicates that the device
>  number is *not*  dev_t, but is some fs-specific value

Right.

> - It allows FIEMAP_FLAG_DEVICE_ORDER to be requested.  I can't quite
>  work out what this does.

The logic makes sense once you understand it.  Consider a striped Lustre
file, or perhaps on an MD RAID device.  If you returned the blocks in
file-logical order (i.e. block 0...EOF), then the largest extent that
could be returned for the same device would be stripe_size/chunk_size.
This would be very verbose (e.g. 1TB file with 1MB stripe_size would be
1M lines of output, though still better than the 256M lines from the
old FIBMAP-based filefrag).  This would make it very hard to see if the
file allocation is contiguous or fragmented, which was our original
goal for implementing FIEMAP.

The DEVICE_ORDER flag means "return blocks in the underlying device
order".  This allows returning block extents of the maximum size for
the underlying filesystem (128MB for ext4), and much more clearly
shows whether the underlying file allocation is contiguous or fragmented.
It also simplifies the implementation at the Lustre side, because we
are essentially doing a series of per-OST FIEMAP calls until the OST
object is done, then moving on to the next object in the file.  The
alternative (which I've thought of impementing, just for compatibility
reasons) would be to interleave the FIEMAP output from each OST by the
logical file offset, but it would be ugly and not very useful, except
for tools that want to see if a file has holes or not.

$ filefrag -v /myth/tmp/4stripe 
Filesystem type is: bd00bd0
File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes)
 ext:     device_logical:        physical_offset: length:  dev: flags:
   0:        0..   28671: 1837711360..1837740031:  28672: 0004: net
   1:        0..   24575: 1280876544..1280901119:  24576: 0000: net
   2:        0..   24575: 1535643648..1535668223:  24576: 0001: net
   3:        0..   24575: 4882608128..4882632703:  24576: 0003: last,net

>  Presumably it changes the order that entries are returned (why?) and
>  maybe returns multiple entries for a region that is mirrored ???

The multiple entries per region is needed for mirrored files.

> As you say, I can see how these might be useful to other filesystems.
> Maybe we should try upstreaming the support sooner rather than later.

I've tried a few times, but have been rebuffed because Lustre isn't
in the mainline.  Originally, BtrFS wasn't going to have multiple device
support, but that has changed since the time FIEMAP was introduced.

I'd of course be happy if it was in mainline, or at least the fields
in struct fiemap_extent reserved to avoid future conflicts.  There
was also a proposal from SuSE for BtrFS to add support for compressed
extents, but it never quite made it over the finish line:

   David Serba "fiemap: introduce EXTENT_DATA_COMPRESSED flag"

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud
Andreas Dilger Jan. 10, 2019, 7:41 a.m. UTC | #12
On Jan 9, 2019, at 18:46, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Jan 08 2019, James Simmons wrote:
>> 
>> I have been thinking what would be a better approach since I like to
>> approach this problem for the 2.13 time frame. Our admins at my work
>> place want to be able to collect application stats without being root.
>> So placing stats in debugfs is not an option, which we currently do
>> the linux client :-( The stats are not a good fit for sysfs.
> 
> How much statistics data are we talking about here?
>  /proc/self/mountstats
> shows over 2K of stats for NFS filesystems.
> Is this in the ball-park or do you need an order-of-magnitude more?

Ah, the joys of being grandfathered into the code...  One of the larger
normal /proc files is the "obdfilter.*.brw_stats" files, which I agree
is a bit of an abomination of ASCII formatted output.

Most of the regular stats files are about 1KB in size, like:

wc /proc/fs/lustre//osc/myth-OST000*/stats
14 107 1028 /proc/fs/lustre//osc/myth-OST0000-osc-ffff880429ee7c00/stats
14 107 1011 /proc/fs/lustre//osc/myth-OST0001-osc-ffff880429ee7c00/stats
13 99 989 /proc/fs/lustre//osc/myth-OST0002-osc-ffff880429ee7c00/stats
14 107 1043 /proc/fs/lustre//osc/myth-OST0003-osc-ffff880429ee7c00/stats
14 107 1075 /proc/fs/lustre//osc/myth-OST0004-osc-ffff880429ee7c00/stats

The {obdfilter,mdt}.*.job_stats files can become quite big on a large
system if there are lots of jobs running.  James would have to report
on what kind of sizes they get on their nearly-largest-in-the-world
filesystem.  Definitely into the MB range.  I don't think there would be
a need to poll that super frequently, maybe in the 60s range, as it keeps
the stats for some minutes after a job stops IO, or it would be impossible
to gather accurate stats for the whole job.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud
Andreas Dilger Jan. 10, 2019, 9:10 a.m. UTC | #13
On Jan 9, 2019, at 18:36, NeilBrown <neilb@suse.com> wrote:
> 
> On Wed, Jan 09 2019, Andreas Dilger wrote:
> 
>> On Jan 9, 2019, at 11:28, James Simmons <jsimmons@infradead.org> wrote:
>>> 
>>>>>> This might be because the upstream Lustre doesn't allow setting
>>>>>> per-process JobID via environment variable, only as a single
>>>>>> per-node value.  The real unfortunate part is that the "get JobID
>>>>>> from environment" actually works for every reasonable architecture
>>>>>> (even the one which was originally broken fixed it), but it got
>>>>>> yanked anyway.  This is actually one of the features of Lustre that
>>>>>> lots of HPC sites like to use, since it allows them to track on the
>>>>>> servers which users/jobs/processes on the client are doing IO.
>>>>> 
>>>>> To give background for Neil see thread:
>>>>> 
>>>>> https://lore.kernel.org/patchwork/patch/416846
>>>>> 
>>>>> In this case I do agree with Greg. The latest jobid does implement an
>>>>> upcall and upcalls don't play niece with containers. Their is also the
>>>>> namespace issue pointed out. I think the namespace issue might be fixed
>>>>> in the latest OpenSFS code.
>>>> 
>>>> I'm not sure what you mean?  AFAIK, there is no upcall for JobID, except
>>>> maybe in the kernel client where we weren't allowed to parse the process
>>>> environment directly.  I agree an upcall is problematic with namespaces,
>>>> in addition to being less functional (only a JobID per node instead of
>>>> per process), which is why direct access to JOBENV is better IMHO.
>>> 
>>> I have some evil ideas about this. Need to think about it some more since
>>> this is a more complex problem.
>> 
>> Since the kernel manages the environment variables via getenv() and setenv(), I honestly don't see why accessing them directly is a huge issue?
> 
> This is, at best, an over-simplification.  The kernel doesn't "manage" the
> environment variables.
> When a process calls execve() (or similar) a collection of strings called
> "arguments" and another collection of strings called "environment" are
> extracted from the processes vm, and used for initializing part of the
> newly created vm.  That is all the kernel does with either.
> (except for providing /proc/*/cmdline and /proc/*/environ, which is best-effort).

Sure, and we only provide a best effort at parsing it as a series of NUL-
terminated strings.  Userspace can't corrupt the kernel VMA mappings,
so at worst we don't find anything we are looking for, which can also
happen if no JobID is set in the first place.  It's not really any more
dangerous than any copy_from_user() in the filesystem/ioctl code.

> getenv() ad setenv() are entirely implemented in user-space.  It is quite
> possible for a process to mess-up its args or environment in a way that
> will make /proc/*/{cmdline,environ} fail to return anything useful.

If userspace has also messed it up so badly that it can't parse the
environment variables themselves, then even a userspace upcall isn't
going to work.

> It is quite possible for the memory storing args and env to be swapped
> out.  If a driver tried to accesses either, it might trigger page-in of
> that part of the address space, which would probably work but might not
> be a good idea.

I've never seen a report of problems like this.  Processes that are
swapped out are probably not going to be submitting IO either...  We
cache the JobID in the kernel so it is only fetched on the first IO
for that process ID.  There once was a bug where the JobID was fetched
during mmap IO which caused a deadlock, and was since fixed.  We also
added the JobID cache, which has reduced the overhead significantly.

> As I understand it, the goal here is to have a cluster-wide identifier
> that can be attached to groups of processes on different nodes.  Then
> stats relating to all of those processes can be collected together.

Correct, but it isn't just _any_ system-wide identifier.  The large
parallel MPI applications already get assigned an identifier by the
batch scheduler before they are run, and a large number of tools in
these systems use JobID for tracking logs, CPU/IO accounting, etc.

The JobID is stored in an environment variable (e.g. SLURM_JOB_ID)
by the batch scheduler before the actual job is forked.  See the
comment at the start of lustre/obdclass/lprocfs_jobstats.c for
examples.  We can also set artificial jobid values for debugging
or use with systems not using MPI (e.g. procname_uid), but they do
not need access to the process environment.

For Lustre, the admin does a one-time configuration of the name of
the environment variable ("lctl conf_param jobid_var=SLURM_JOB_ID")
to tell the kernel which environment variable to use.

> ... But as I do think that control-groups are an abomination, I couldn't
> possible suggest any such thing.
> Unix already has a perfectly good grouping abstraction - process groups
> (unfortunately there are about 3 sorts of these, but that needn't be a
> big problem).  Stats can be collected based on pgid, and a mapping from
> client+pgid->jobid can be communicated to whatever collects the
> statistics ... somehow.

So, right now we have "scan a few KB of kernel memory for a string"
periodically in the out-of-tree client (see jobid_get_from_environ()
and cfs_get_environ()), and then a hash table that caches the JobID
internally and maps the pid to the JobID when it is needed.  Most of
the code is an simplified copy of access_process_vm() for kernels after
v2.6.24-rc1-652-g02c3530da6b9 when it was un-EXPORT_SYMBOL'd, but
since kernel v4.9-rc3-36-gfcd35857d662 it is again exported so it makes
sense to add a configure check.  Most of the rest is for when the
variable or value crosses a page boundary.


Conversely, the kernel client has something like "upcall a userspace
process, fork a program (millions of cycles), have that program do the
same scan of the kernel environment memory, but now it is doing it in
userspace, open a file, write the environment variable to the kernel,
exit and clean up the process that was created" to do the same thing.


Using a pgid seems mostly unusable, since the stats are not collected
on the client, they are collected on the server (the JobID is sent with
every userspace-driven RPC to the server), which is the centralized
location where all clients submit their IO.  JobStats gives us relatively
easy and direct method to see which client process(es) are going a lot of
IO or RPCs, just looking into a /proc file if necessary (though they are
typically further centralized and monitored from the multiple servers).

We can't send a different pgid from each client along with the RPCs and
hope to aggregate that at the server without adding huge complexity.  We
would need real-time mapping from every new pgid on each client (maybe
thousands per second per client) to the JobID then passed to the MDS/OSS
so that they can reverse-map the pgid back into a JobID before the first
RPC arrives at the server.  Alternately, track separate stats for each client:pgid combination on the server (num_cores * clients = millions of
times more than today if there are multiple jobs per client) until they
are fetched into userspace for mapping and re-aggregation.


Thanks, but I'd rather stick with the relatively simple and direct method
we are using today.  It's worked without problems for 10 years of kernels.
I think that is one of the big obstacles that we face with many of the
upstream kernel maintainers, is that they are focussed on issues that are
local to a one or a few nodes, but we have to deal with issues that may
involve hundreds or thousands of different nodes working as a single task
(unlike cloud stuff where there may be many nodes, but they are all doing
things independently).  It's not that we develop crazy things because we
have spare time to burn, but because they are needed to deal sanely with
such environments.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud

Patch
diff mbox series

diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h
index 1428fdd..e7a2eda 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ver.h
@@ -2,10 +2,10 @@ 
 #define _LUSTRE_VER_H_
 
 #define LUSTRE_MAJOR 2
-#define LUSTRE_MINOR 8
+#define LUSTRE_MINOR 9
 #define LUSTRE_PATCH 99
 #define LUSTRE_FIX 0
-#define LUSTRE_VERSION_STRING "2.8.99"
+#define LUSTRE_VERSION_STRING "2.9.99"
 
 #define OBD_OCD_VERSION(major, minor, patch, fix)			\
 	(((major) << 24) + ((minor) << 16) + ((patch) << 8) + (fix))