mbox series

[RFC,0/3] ptp: Add esterror support

Message ID 20240813125602.155827-1-maciek@machnikowski.net (mailing list archive)
Headers show
Series ptp: Add esterror support | expand

Message

Maciek Machnikowski Aug. 13, 2024, 12:55 p.m. UTC
This patch series implements handling of timex esterror field
by ptp devices.

Esterror field can be used to return or set the estimated error
of the clock. This is useful for devices containing a hardware
clock that is controlled and synchronized internally (such as
a time card) or when the synchronization is pushed to the embedded
CPU of a DPU.

Current implementation of ADJ_ESTERROR can enable pushing
current offset of the clock calculated by a userspace app
to the device, which can act upon this information by enabling
or disabling time-related functions when certain boundaries
are not met (eg. packet launchtime)


Maciek Machnikowski (3):
  ptp: Implement timex esterror support
  ptp: Implement support for esterror in ptp_mock
  ptp: Add setting esterror and reading timex structure

 drivers/ptp/ptp_clock.c               | 14 +++++++++-
 drivers/ptp/ptp_mock.c                | 30 +++++++++++++++++++++
 include/linux/ptp_clock_kernel.h      | 11 ++++++++
 tools/testing/selftests/ptp/testptp.c | 39 +++++++++++++++++++++++++--
 4 files changed, 91 insertions(+), 3 deletions(-)

Comments

Andrew Lunn Aug. 13, 2024, 8:05 p.m. UTC | #1
On Tue, Aug 13, 2024 at 12:55:59PM +0000, Maciek Machnikowski wrote:
> This patch series implements handling of timex esterror field
> by ptp devices.
> 
> Esterror field can be used to return or set the estimated error
> of the clock. This is useful for devices containing a hardware
> clock that is controlled and synchronized internally (such as
> a time card) or when the synchronization is pushed to the embedded
> CPU of a DPU.

How can you set the estimated error of a clock? Isn't it a properties
of the hardware, and maybe the network link? A 10BaseT/Half duplex
link is going to have a bigger error than a 1000BaseT link because the
bits take longer on the wire etc.

What is the device supposed to do with the set value?

     Andrew
Vadim Fedorenko Aug. 14, 2024, 8:44 a.m. UTC | #2
On 13/08/2024 21:05, Andrew Lunn wrote:
> On Tue, Aug 13, 2024 at 12:55:59PM +0000, Maciek Machnikowski wrote:
>> This patch series implements handling of timex esterror field
>> by ptp devices.
>>
>> Esterror field can be used to return or set the estimated error
>> of the clock. This is useful for devices containing a hardware
>> clock that is controlled and synchronized internally (such as
>> a time card) or when the synchronization is pushed to the embedded
>> CPU of a DPU.
> 
> How can you set the estimated error of a clock? Isn't it a properties
> of the hardware, and maybe the network link? A 10BaseT/Half duplex
> link is going to have a bigger error than a 1000BaseT link because the
> bits take longer on the wire etc.

AFAIU, it's in the spec of the hardware, but it can change depending on
the environment, like temperature. The link speed doesn't matter here,
this property can be used to calculate possible drift of the clock in
the micro holdover mode (between sync points).

> What is the device supposed to do with the set value?

It can be used to report the value back to user-space to calculate the
boundaries of "true time" returned by the hardware.

Thanks!
Andrew Lunn Aug. 14, 2024, 1:08 p.m. UTC | #3
On Wed, Aug 14, 2024 at 09:44:29AM +0100, Vadim Fedorenko wrote:
> On 13/08/2024 21:05, Andrew Lunn wrote:
> > On Tue, Aug 13, 2024 at 12:55:59PM +0000, Maciek Machnikowski wrote:
> > > This patch series implements handling of timex esterror field
> > > by ptp devices.
> > > 
> > > Esterror field can be used to return or set the estimated error
> > > of the clock. This is useful for devices containing a hardware
> > > clock that is controlled and synchronized internally (such as
> > > a time card) or when the synchronization is pushed to the embedded
> > > CPU of a DPU.
> > 
> > How can you set the estimated error of a clock? Isn't it a properties
> > of the hardware, and maybe the network link? A 10BaseT/Half duplex
> > link is going to have a bigger error than a 1000BaseT link because the
> > bits take longer on the wire etc.
> 
> AFAIU, it's in the spec of the hardware, but it can change depending on
> the environment, like temperature. The link speed doesn't matter here,
> this property can be used to calculate possible drift of the clock in
> the micro holdover mode (between sync points).

Is there a clear definition then? Could you reference a standard
indicating what is included and excluded from this?

> > What is the device supposed to do with the set value?
> 
> It can be used to report the value back to user-space to calculate the
> boundaries of "true time" returned by the hardware.

So the driver itself does not know its own error? It has to be told
it, so it can report it back to user space. Then why bother, just put
it directly into the ptp4l configuration file?

Maybe this is all obvious to somebody who knows PTP inside out, but to
me it is not. Please could you put a better explanation and
justification into the commit message. We need PHY driver writers who
have limited idea about PTP can implement these new calls.

	Andrew
Maciek Machnikowski Aug. 14, 2024, 3:08 p.m. UTC | #4
On 14/08/2024 15:08, Andrew Lunn wrote:
> On Wed, Aug 14, 2024 at 09:44:29AM +0100, Vadim Fedorenko wrote:
>> On 13/08/2024 21:05, Andrew Lunn wrote:
>>> On Tue, Aug 13, 2024 at 12:55:59PM +0000, Maciek Machnikowski wrote:
>>>> This patch series implements handling of timex esterror field
>>>> by ptp devices.
>>>>
>>>> Esterror field can be used to return or set the estimated error
>>>> of the clock. This is useful for devices containing a hardware
>>>> clock that is controlled and synchronized internally (such as
>>>> a time card) or when the synchronization is pushed to the embedded
>>>> CPU of a DPU.
>>>
>>> How can you set the estimated error of a clock? Isn't it a properties
>>> of the hardware, and maybe the network link? A 10BaseT/Half duplex
>>> link is going to have a bigger error than a 1000BaseT link because the
>>> bits take longer on the wire etc.
>>
>> AFAIU, it's in the spec of the hardware, but it can change depending on
>> the environment, like temperature. The link speed doesn't matter here,
>> this property can be used to calculate possible drift of the clock in
>> the micro holdover mode (between sync points).
> 
> Is there a clear definition then? Could you reference a standard
> indicating what is included and excluded from this?
> 
The esterror should return the error calculated by the device. There is
no standard defining this, but the simplest implementation can put the
offset calculated by the ptp daemon, or the offset to the nearest PPS in
cases where PPS is used as a source of time


>>> What is the device supposed to do with the set value?
>>
>> It can be used to report the value back to user-space to calculate the
>> boundaries of "true time" returned by the hardware.
> 
> So the driver itself does not know its own error? It has to be told
> it, so it can report it back to user space. Then why bother, just put
> it directly into the ptp4l configuration file?
> 
> Maybe this is all obvious to somebody who knows PTP inside out, but to
> me it is not. Please could you put a better explanation and
> justification into the commit message. We need PHY driver writers who
> have limited idea about PTP can implement these new calls.
>
> Andrew

It's designed to enable devices that either synchronize its own hw clock
to some source of time on a hardware layer (e.g. a Timecard that uses a
PPS signal from the GNSS), or a device in which the PTP is done on a
different function (multifunction NIC, or a DPU) to convey its best
estimate of the error (see above). In case of a multifunction NIC the
ptp daemon running on a function that synchronizes the clock will use
the ADJ_ESTERROR to push the error estimate to the device. The device
will then be able to
a. provide that information to hardware clocks on different functions
b. prevent time-dependent functionalities from acting when a clock
shifts beyond a predefined limit.

Also esterror is not maxerror and is not designed to include all errors,
but the best estimate

Thanks
Maciek
Jakub Kicinski Aug. 15, 2024, 12:41 a.m. UTC | #5
On Tue, 13 Aug 2024 12:55:59 +0000 Maciek Machnikowski wrote:
> This patch series implements handling of timex esterror field
> by ptp devices.
> 
> Esterror field can be used to return or set the estimated error
> of the clock. This is useful for devices containing a hardware
> clock that is controlled and synchronized internally (such as
> a time card) or when the synchronization is pushed to the embedded
> CPU of a DPU.
> 
> Current implementation of ADJ_ESTERROR can enable pushing
> current offset of the clock calculated by a userspace app
> to the device, which can act upon this information by enabling
> or disabling time-related functions when certain boundaries
> are not met (eg. packet launchtime)

Please do CC people who are working on the VM / PTP / vdso thing,
and time people like tglx. And an implementation for a real device
would be nice to establish attention.
Richard Cochran Aug. 15, 2024, 3:33 a.m. UTC | #6
On Wed, Aug 14, 2024 at 03:08:07PM +0200, Andrew Lunn wrote:

> So the driver itself does not know its own error? It has to be told
> it, so it can report it back to user space. Then why bother, just put
> it directly into the ptp4l configuration file?

This way my first reaction as well.

Thanks,
Richard
Richard Cochran Aug. 15, 2024, 3:42 a.m. UTC | #7
On Wed, Aug 14, 2024 at 05:41:47PM -0700, Jakub Kicinski wrote:

> Please do CC people who are working on the VM / PTP / vdso thing,
> and time people like tglx.

and CC John Stultz please

Thanks,
Richard
Richard Cochran Aug. 15, 2024, 3:53 a.m. UTC | #8
On Wed, Aug 14, 2024 at 05:08:24PM +0200, Maciek Machnikowski wrote:

> The esterror should return the error calculated by the device. There is
> no standard defining this, but the simplest implementation can put the
> offset calculated by the ptp daemon, or the offset to the nearest PPS in
> cases where PPS is used as a source of time

So user space produces the number, and other user space consumes it?

Sounds like it should say in user space, shared over some IPC, like
PTP management messages for example.

> the ADJ_ESTERROR to push the error estimate to the device. The device
> will then be able to
> a. provide that information to hardware clocks on different functions

Not really, because there no in-kernel API for one device to query
another.  At least you didn't provide one.  Looks like you mean that
some user space program sets the value, and another program reads it
to share it with other devices?

> b. prevent time-dependent functionalities from acting when a clock
> shifts beyond a predefined limit.

This can be controlled by the user space stack.  For example, "if
estimated error exceeds threshold, disable PPS output".

Thanks,
Richard
Richard Cochran Aug. 15, 2024, 4:16 a.m. UTC | #9
On Wed, Aug 14, 2024 at 08:33:26PM -0700, Richard Cochran wrote:
> On Wed, Aug 14, 2024 at 03:08:07PM +0200, Andrew Lunn wrote:
> 
> > So the driver itself does not know its own error? It has to be told
> > it, so it can report it back to user space. Then why bother, just put
> > it directly into the ptp4l configuration file?
> 
> This way my first reaction as well.

Actually, looking at the NTP code, we have:

void process_adjtimex_modes(const struct __kernel_timex *txc,)
{
	...
	if (txc->modes & ADJ_ESTERROR)
		time_esterror = txc->esterror;
	...
}

So I guess PHCs should also support setting this from user space?

adding CC: Miroslav

At least it would be consistent.


Thanks,
Richard
Richard Cochran Aug. 15, 2024, 4:17 a.m. UTC | #10
On Wed, Aug 14, 2024 at 08:42:57PM -0700, Richard Cochran wrote:
> and CC John Stultz please

and Miroslav.

Thanks,
Richard
Miroslav Lichvar Aug. 15, 2024, 5:24 a.m. UTC | #11
On Wed, Aug 14, 2024 at 09:16:12PM -0700, Richard Cochran wrote:
> Actually, looking at the NTP code, we have:
> 
> void process_adjtimex_modes(const struct __kernel_timex *txc,)
> {
> 	...
> 	if (txc->modes & ADJ_ESTERROR)
> 		time_esterror = txc->esterror;
> 	...
> }
> 
> So I guess PHCs should also support setting this from user space?

Yes, I'd like that very much. It would allow other applications to get
the estimated error of the clock they are using. Maxerror would be
nice too, even if it didn't increase automatically at 500 ppm as the
system clock. IIRC this was proposed before for the cross-timestamping
PHC between VM host and guests, but there wasn't much interest at the
time.
Maciek Machnikowski Aug. 15, 2024, 9:35 a.m. UTC | #12
On 15/08/2024 02:41, Jakub Kicinski wrote:
> On Tue, 13 Aug 2024 12:55:59 +0000 Maciek Machnikowski wrote:
>> This patch series implements handling of timex esterror field
>> by ptp devices.
>>
>> Esterror field can be used to return or set the estimated error
>> of the clock. This is useful for devices containing a hardware
>> clock that is controlled and synchronized internally (such as
>> a time card) or when the synchronization is pushed to the embedded
>> CPU of a DPU.
>>
>> Current implementation of ADJ_ESTERROR can enable pushing
>> current offset of the clock calculated by a userspace app
>> to the device, which can act upon this information by enabling
>> or disabling time-related functions when certain boundaries
>> are not met (eg. packet launchtime)
> 
> Please do CC people who are working on the VM / PTP / vdso thing,
> and time people like tglx. And an implementation for a real device
> would be nice to establish attention.

Noted - will CC in future releases.

AWS planned the implementation in ENA driver - will work with David on
making it part of the patchset once we "upgrade" from an RFC - but
wanted to discuss the approach first.
Maciek Machnikowski Aug. 15, 2024, 9:40 a.m. UTC | #13
On 15/08/2024 05:53, Richard Cochran wrote:
> On Wed, Aug 14, 2024 at 05:08:24PM +0200, Maciek Machnikowski wrote:
> 
>> The esterror should return the error calculated by the device. There is
>> no standard defining this, but the simplest implementation can put the
>> offset calculated by the ptp daemon, or the offset to the nearest PPS in
>> cases where PPS is used as a source of time
> 
> So user space produces the number, and other user space consumes it?
> 
> Sounds like it should say in user space, shared over some IPC, like
> PTP management messages for example.

The user spaces may run on completely isolated platforms in isolated
network with no direct path to communicate that.
I'm well aware of different solutions on the same platform (libpmc, AWS
Nitro or Clock Manager) , but this patchset tries to address different
use case
> 
>> the ADJ_ESTERROR to push the error estimate to the device. The device
>> will then be able to
>> a. provide that information to hardware clocks on different functions
> 
> Not really, because there no in-kernel API for one device to query
> another.  At least you didn't provide one.  Looks like you mean that
> some user space program sets the value, and another program reads it
> to share it with other devices?
> 
>> b. prevent time-dependent functionalities from acting when a clock
>> shifts beyond a predefined limit.
> 
> This can be controlled by the user space stack.  For example, "if
> estimated error exceeds threshold, disable PPS output".
> 
> Thanks,
> Richard
It's doable if everything runs on the same OS, but there are cases where
time sync is part of management and we will want to disable features on
different PFs/VFs as a result of bad time synchronizations. Such
features include launchtime or time-based congestion control which are
part of infrastructure.
Maciek Machnikowski Aug. 15, 2024, 9:41 a.m. UTC | #14
On 15/08/2024 06:16, Richard Cochran wrote:
> On Wed, Aug 14, 2024 at 08:33:26PM -0700, Richard Cochran wrote:
>> On Wed, Aug 14, 2024 at 03:08:07PM +0200, Andrew Lunn wrote:
>>
>>> So the driver itself does not know its own error? It has to be told
>>> it, so it can report it back to user space. Then why bother, just put
>>> it directly into the ptp4l configuration file?
>>
>> This way my first reaction as well.
> 
> Actually, looking at the NTP code, we have:
> 
> void process_adjtimex_modes(const struct __kernel_timex *txc,)
> {
> 	...
> 	if (txc->modes & ADJ_ESTERROR)
> 		time_esterror = txc->esterror;
> 	...
> }
> 
> So I guess PHCs should also support setting this from user space?
> 
> adding CC: Miroslav
> 
> At least it would be consistent.
> 
> 
> Thanks,
> Richard

Yes! It's exactly what inspired this patchset :)
Maciek Machnikowski Aug. 15, 2024, 9:45 a.m. UTC | #15
On 15/08/2024 07:24, Miroslav Lichvar wrote:
> On Wed, Aug 14, 2024 at 09:16:12PM -0700, Richard Cochran wrote:
>> Actually, looking at the NTP code, we have:
>>
>> void process_adjtimex_modes(const struct __kernel_timex *txc,)
>> {
>> 	...
>> 	if (txc->modes & ADJ_ESTERROR)
>> 		time_esterror = txc->esterror;
>> 	...
>> }
>>
>> So I guess PHCs should also support setting this from user space?
> 
> Yes, I'd like that very much. It would allow other applications to get
> the estimated error of the clock they are using. Maxerror would be
> nice too, even if it didn't increase automatically at 500 ppm as the
> system clock. IIRC this was proposed before for the cross-timestamping
> PHC between VM host and guests, but there wasn't much interest at the
> time.
> 
Thanks, I plan to include maxerror later (not sure if it should be in
the same RFC, or a separate one)
Vadim Fedorenko Aug. 15, 2024, 10:29 a.m. UTC | #16
On 15/08/2024 10:35, Maciek Machnikowski wrote:
> > 
> 
> 
> On 15/08/2024 02:41, Jakub Kicinski wrote:
>> On Tue, 13 Aug 2024 12:55:59 +0000 Maciek Machnikowski wrote:
>>> This patch series implements handling of timex esterror field
>>> by ptp devices.
>>>
>>> Esterror field can be used to return or set the estimated error
>>> of the clock. This is useful for devices containing a hardware
>>> clock that is controlled and synchronized internally (such as
>>> a time card) or when the synchronization is pushed to the embedded
>>> CPU of a DPU.
>>>
>>> Current implementation of ADJ_ESTERROR can enable pushing
>>> current offset of the clock calculated by a userspace app
>>> to the device, which can act upon this information by enabling
>>> or disabling time-related functions when certain boundaries
>>> are not met (eg. packet launchtime)
>>
>> Please do CC people who are working on the VM / PTP / vdso thing,
>> and time people like tglx. And an implementation for a real device
>> would be nice to establish attention.
> 
> Noted - will CC in future releases.
> 
> AWS planned the implementation in ENA driver - will work with David on
> making it part of the patchset once we "upgrade" from an RFC - but
> wanted to discuss the approach first.

I can implement the interface for OCP TimeCard too, I think we have HW
information about it already.
David Arinzon Aug. 15, 2024, 11:40 a.m. UTC | #17
> >
> >
> > On 15/08/2024 02:41, Jakub Kicinski wrote:
> >> On Tue, 13 Aug 2024 12:55:59 +0000 Maciek Machnikowski wrote:
> >>> This patch series implements handling of timex esterror field by ptp
> >>> devices.
> >>>
> >>> Esterror field can be used to return or set the estimated error of
> >>> the clock. This is useful for devices containing a hardware clock
> >>> that is controlled and synchronized internally (such as a time card)
> >>> or when the synchronization is pushed to the embedded CPU of a DPU.
> >>>
> >>> Current implementation of ADJ_ESTERROR can enable pushing current
> >>> offset of the clock calculated by a userspace app to the device,
> >>> which can act upon this information by enabling or disabling
> >>> time-related functions when certain boundaries are not met (eg.
> >>> packet launchtime)
> >>
> >> Please do CC people who are working on the VM / PTP / vdso thing, and
> >> time people like tglx. And an implementation for a real device would
> >> be nice to establish attention.
> >
> > Noted - will CC in future releases.
> >
> > AWS planned the implementation in ENA driver - will work with David on
> > making it part of the patchset once we "upgrade" from an RFC - but
> > wanted to discuss the approach first.
> 
> I can implement the interface for OCP TimeCard too, I think we have HW
> information about it already.

CC Amit Bernstein (AWS, ENA driver).
Andrew Lunn Aug. 15, 2024, 2:26 p.m. UTC | #18
On Thu, Aug 15, 2024 at 11:40:28AM +0200, Maciek Machnikowski wrote:
> 
> 
> On 15/08/2024 05:53, Richard Cochran wrote:
> > On Wed, Aug 14, 2024 at 05:08:24PM +0200, Maciek Machnikowski wrote:
> > 
> >> The esterror should return the error calculated by the device. There is
> >> no standard defining this, but the simplest implementation can put the
> >> offset calculated by the ptp daemon, or the offset to the nearest PPS in
> >> cases where PPS is used as a source of time
> > 
> > So user space produces the number, and other user space consumes it?
> > 
> > Sounds like it should say in user space, shared over some IPC, like
> > PTP management messages for example.
> 
> The user spaces may run on completely isolated platforms in isolated
> network with no direct path to communicate that.
> I'm well aware of different solutions on the same platform (libpmc, AWS
> Nitro or Clock Manager) , but this patchset tries to address different
> use case

So this in effect is just a communication mechanism between two user
space processes. The device itself does not know its own error, and
when told about its error, it does nothing. So why add new driver API
calls? It seems like the core should be able to handle this. You then
don't need a details explanation of the API which a PHY driver writer
can understand...

       Andrew
Maciek Machnikowski Aug. 15, 2024, 3 p.m. UTC | #19
On 15/08/2024 16:26, Andrew Lunn wrote:
> On Thu, Aug 15, 2024 at 11:40:28AM +0200, Maciek Machnikowski wrote:
>>
>>
>> On 15/08/2024 05:53, Richard Cochran wrote:
>>> On Wed, Aug 14, 2024 at 05:08:24PM +0200, Maciek Machnikowski wrote:
>>>
>>>> The esterror should return the error calculated by the device. There is
>>>> no standard defining this, but the simplest implementation can put the
>>>> offset calculated by the ptp daemon, or the offset to the nearest PPS in
>>>> cases where PPS is used as a source of time
>>>
>>> So user space produces the number, and other user space consumes it?
>>>
>>> Sounds like it should say in user space, shared over some IPC, like
>>> PTP management messages for example.
>>
>> The user spaces may run on completely isolated platforms in isolated
>> network with no direct path to communicate that.
>> I'm well aware of different solutions on the same platform (libpmc, AWS
>> Nitro or Clock Manager) , but this patchset tries to address different
>> use case
> 
> So this in effect is just a communication mechanism between two user
> space processes. The device itself does not know its own error, and
> when told about its error, it does nothing. So why add new driver API
> calls? It seems like the core should be able to handle this. You then
> don't need a details explanation of the API which a PHY driver writer
> can understand...
> 
>        Andrew

No - it is not the main use case. The easiest one to understand would be
the following:

Think about a Time Card
(https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction).

It is a device that exposes the precise time to the user space using the
PTP subsystem, but it is an autonomous device and the synchronization is
implemented on the hardware layer.
In this case no user space process is now aware of what is the expected
estimated error, because that is only known to the HW and its control loop.
And this information is needed for the aforementioned userspace
processes to calculate error boundaries (time uncertaninty) of a given
clock.

-Maciek
Richard Cochran Aug. 15, 2024, 9:08 p.m. UTC | #20
On Thu, Aug 15, 2024 at 05:00:24PM +0200, Maciek Machnikowski wrote:

> Think about a Time Card
> (https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction).

No, I won't think about that!

You need to present the technical details in the form of patches.

Hand-wavey hints don't cut it.

Thanks,
Richard
Maciek Machnikowski Aug. 15, 2024, 10:06 p.m. UTC | #21
On 15/08/2024 23:08, Richard Cochran wrote:
> On Thu, Aug 15, 2024 at 05:00:24PM +0200, Maciek Machnikowski wrote:
> 
>> Think about a Time Card
>> (https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction).
> 
> No, I won't think about that!
> 
> You need to present the technical details in the form of patches.
> 
> Hand-wavey hints don't cut it.
> 
> Thanks,
> Richard

This implementation addresses 3 use cases:

1. Autonomous devices that synchronize themselves to some external
sources (GNSS, NTP, dedicated time sync networks) and have the ability
to return the estimated error from the HW or FW loop to users

2. Multi function devices that may have a single isolated function
synchronizing the device clock (by means of PTP, or PPS or any other)
and letting other functions access the uncertainty information

3. Create a common interface to read the uncertainty from a device
(currently you can use PMC for PTP, but there is no way of receiving
that information from ts2phc)

#1 and #2 requires an interface to the driver to retrieve the error from
a device
#2 requires an interface to adjust the esterror and push it to the device
#3 can be terminated locally in the ptp class driver using the same
principle as the presented ptp_mock implementation, or something more
sophisticated

Also this is an RFC to help align work on this functionality across
different devices ] and validate if that's the right direction. If it is
- there will be a patch series with real drivers returning uncertainty
information using that interface. If it's not - I'd like to understand
what should I improve in the interface.

HTH
Maciek
Andrew Lunn Aug. 15, 2024, 11:11 p.m. UTC | #22
On Fri, Aug 16, 2024 at 12:06:51AM +0200, Maciek Machnikowski wrote:
> 
> 
> On 15/08/2024 23:08, Richard Cochran wrote:
> > On Thu, Aug 15, 2024 at 05:00:24PM +0200, Maciek Machnikowski wrote:
> > 
> >> Think about a Time Card
> >> (https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction).
> > 
> > No, I won't think about that!
> > 
> > You need to present the technical details in the form of patches.
> > 
> > Hand-wavey hints don't cut it.
> > 
> > Thanks,
> > Richard
> 
> This implementation addresses 3 use cases:
> 
> 1. Autonomous devices that synchronize themselves to some external
> sources (GNSS, NTP, dedicated time sync networks) and have the ability
> to return the estimated error from the HW or FW loop to users

So this contradicts what you said earlier, when you said the device
does not know its own error, it has to be told it.

So what is user space supposed to do with this error? And given that
you said it is undefined what this error includes and excludes, how is
user space supposed to deal with the error in the error? Given how
poorly this is defined, what is user space supposed to do when the
device changes the definition of the error?

The message Richard has always given is that those who care about
errors freeze their kernel and do measurement campaign to determine
what the real error is and then configure user space to deal with
it. Does this error value negate the need for this?

> 2. Multi function devices that may have a single isolated function
> synchronizing the device clock (by means of PTP, or PPS or any other)
> and letting other functions access the uncertainty information

So this is the simple message passing API, which could be implemented
purely in the core? This sounds like it should be a patch of its own,
explaining the use case. 

> 3. Create a common interface to read the uncertainty from a device
> (currently you can use PMC for PTP, but there is no way of receiving
> that information from ts2phc)

That sounds like a problem with ts2phc? Please could you expand on why
the kernel should be involved in feature deficits of user space tools?

> Also this is an RFC to help align work on this functionality across
> different devices ] and validate if that's the right direction. If it is
> - there will be a patch series with real drivers returning uncertainty
> information using that interface. If it's not - I'd like to understand
> what should I improve in the interface.

I think you took the wrong approach. You should first state in detail
the use cases. Then show how you solve each use cases, both the user
and kernel space parts, and include the needed changes to a real
device driver.

	Andrew
Maciek Machnikowski Aug. 16, 2024, 4:13 a.m. UTC | #23
On 16/08/2024 01:11, Andrew Lunn wrote:
> On Fri, Aug 16, 2024 at 12:06:51AM +0200, Maciek Machnikowski wrote:
>>
>>
>> On 15/08/2024 23:08, Richard Cochran wrote:
>>> On Thu, Aug 15, 2024 at 05:00:24PM +0200, Maciek Machnikowski wrote:
>>>
>>>> Think about a Time Card
>>>> (https://opencomputeproject.github.io/Time-Appliance-Project/docs/time-card/introduction).
>>>
>>> No, I won't think about that!
>>>
>>> You need to present the technical details in the form of patches.
>>>
>>> Hand-wavey hints don't cut it.
>>>
>>> Thanks,
>>> Richard
>>
>> This implementation addresses 3 use cases:
>>
>> 1. Autonomous devices that synchronize themselves to some external
>> sources (GNSS, NTP, dedicated time sync networks) and have the ability
>> to return the estimated error from the HW or FW loop to users
> 
> So this contradicts what you said earlier, when you said the device
> does not know its own error, it has to be told it.

No - it’s a different type of device.

> So what is user space supposed to do with this error? And given that
> you said it is undefined what this error includes and excludes, how is
> user space supposed to deal with the error in the error? Given how
> poorly this is defined, what is user space supposed to do when the
> device changes the definition of the error?

Esterror returns the last error to the master clock that the device
synchronizes to.
In the case of PPS - is the last error registered on the top of the second.
In the case of PTP - the last error is calculated based on a transaction.

> The message Richard has always given is that those who care about
> errors freeze their kernel and do measurement campaign to determine
> what the real error is and then configure user space to deal with
> it. Does this error value negate the need for this?

AFIR, this comment was relevant to measuring errors coming from delays
inside the system.

>> 2. Multi function devices that may have a single isolated function
>> synchronizing the device clock (by means of PTP, or PPS or any other)
>> and letting other functions access the uncertainty information
> 
> So this is the simple message passing API, which could be implemented
> purely in the core? This sounds like it should be a patch of its own,
> explaining the use case. 

If functions are isolated then there is no path for passing the messages
other than through the device.
The trusted function that can control the clock will push the last error
and control the clock to synchronize it as best as it can, other
functions will get the time from the clock and additional info to
calculate the uncertainty.

>> 3. Create a common interface to read the uncertainty from a device
>> (currently you can use PMC for PTP, but there is no way of receiving
>> that information from ts2phc)
> 
> That sounds like a problem with ts2phc? Please could you expand on why
> the kernel should be involved in feature deficits of user space tools?

Not really. Why would all userspace processes need to understand what
synchronizes the time currently and talk to the relevant tool?
All it cares is what the time is and the primitives for calculating
error boundaries and understand if the clock is synchronized good enough
for a given application.

>> Also this is an RFC to help align work on this functionality across
>> different devices ] and validate if that's the right direction. If it is
>> - there will be a patch series with real drivers returning uncertainty
>> information using that interface. If it's not - I'd like to understand
>> what should I improve in the interface.
> 
> I think you took the wrong approach. You should first state in detail
> the use cases. Then show how you solve each use cases, both the user
> and kernel space parts, and include the needed changes to a real
> device driver.
> 
> 	Andrew

Also there is one more use case I missed in that summary:
4. Device need to consume the information about the uncertainty to act
upon it. This is the case when you want to allow certain features to
work only if error boundaries requirements are met. For example you
don't want to allow launch time to work when the error of the clock is
huge, as your packets will launch precisely at unprecise time.

The current adjtime() API call without any flags can fill the time of
day and the last frequency correction. At the same time, the API's timex
structure consists of many other helpful information that are not
populated for PTP hardware clocks. This information consist of esterror
field which is there since kernel 0.99.13k. I'm not inventing a new API,
just trying to add hooks to implement existing APIs inside the PTP
subsystem.
Richard Cochran Aug. 17, 2024, 4:29 a.m. UTC | #24
On Fri, Aug 16, 2024 at 12:06:51AM +0200, Maciek Machnikowski wrote:

> Also this is an RFC to help align work on this functionality across
> different devices ] and validate if that's the right direction. If it is
> - there will be a patch series with real drivers returning uncertainty
> information using that interface.

Please post a real world example with those real drivers.  That will
help us understand what you are trying to accomplish.

Thanks,
Richard