diff mbox

[3/6] dax: add tracepoint infrastructure, PMD tracing

Message ID 1479926662-21718-4-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Nov. 23, 2016, 6:44 p.m. UTC
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems.  Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler.  This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.

I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
be wrapped by the parent function tracepoints so the code flow is more
easily understood.  Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault.  These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.

For PMD faults we primarily want to understand the faulting address and
whether it fell back to 4k faults.  If it fell back to 4k faults the
tracepoints should let us understand why.

I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.

Here is an example output for these events from a successful PMD fault:

big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
max_pgoff 0x1400

big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
max_pgoff 0x1400 NOPAGE

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
---
 fs/dax.c                      | 29 +++++++++++++-------
 include/linux/mm.h            | 14 ++++++++++
 include/trace/events/fs_dax.h | 61 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/fs_dax.h

Comments

Jan Kara Nov. 24, 2016, 9:16 a.m. UTC | #1
On Wed 23-11-16 11:44:19, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.
> 
> I added both an entry and exit tracepoint because future patches will add
> tracepoints to child functions of dax_iomap_pmd_fault() like
> dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> be wrapped by the parent function tracepoints so the code flow is more
> easily understood.  Having entry and exit tracepoints for faults also
> allows us to easily see what filesystems functions were called during the
> fault.  These filesystem functions get executed via iomap_begin() and
> iomap_end() calls, for example, and will have their own tracepoints.
> 
> For PMD faults we primarily want to understand the faulting address and
> whether it fell back to 4k faults.  If it fell back to 4k faults the
> tracepoints should let us understand why.
> 
> I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> to have its own separate tracing header in the same directory at some
> point.
> 
> Here is an example output for these events from a successful PMD fault:
> 
> big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
> address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> max_pgoff 0x1400
> 
> big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
> address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> max_pgoff 0x1400 NOPAGE
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>

Looks good. Just one minor comment:

> +	TP_printk("%s mapping %s address %#lx vm_start %#lx vm_end %#lx "
> +		"pgoff %#lx max_pgoff %#lx %s",
> +		__entry->vm_flags & VM_SHARED ? "shared" : "private",
> +		__entry->flags & FAULT_FLAG_WRITE ? "write" : "read",
> +		__entry->address,
> +		__entry->vm_start,
> +		__entry->vm_end,
> +		__entry->pgoff,
> +		__entry->max_pgoff,
> +		__print_flags(__entry->result, "|", VM_FAULT_RESULT_TRACE)
> +	)
> +)

I think it may be useful to dump full 'flags', not just FAULT_FLAG_WRITE...

								Honza
Al Viro Nov. 24, 2016, 5:32 p.m. UTC | #2
On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.

	It also has one hell of potential for becoming a massive nuisance.
Keep in mind that if any userland code becomes dependent on those - that's it,
they have become parts of stable userland ABI and are to be maintained
indefinitely.  Don't expect "tracepoints are special case" to prevent that.

	So treat anything you add in that manner as potential stable ABI
you might have to keep around forever.  It's *not* a glorified debugging
printk.
Dave Chinner Nov. 25, 2016, 2:49 a.m. UTC | #3
On Thu, Nov 24, 2016 at 05:32:20PM +0000, Al Viro wrote:
> On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> > Tracepoints are the standard way to capture debugging and tracing
> > information in many parts of the kernel, including the XFS and ext4
> > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> > be done in the same way as the filesystem tracing so that developers can
> > look at them together and get a coherent idea of what the system is doing.
> 
> 	It also has one hell of potential for becoming a massive nuisance.
> Keep in mind that if any userland code becomes dependent on those - that's it,
> they have become parts of stable userland ABI and are to be maintained
> indefinitely.  Don't expect "tracepoints are special case" to prevent that.

I call bullshit just like I always do when someone spouts this
"tracepoints are stable ABI" garbage.

If we want to provide stable tracepoints, then we need to /create a
stable tracepoint API/ and convert all the tracepoints that /need
to be stable/ to use it. Then developers only need to be careful
about modifying code around the /explicitly stable/ tracepoints and
we avoid retrospectively locking the kernel implementation into a
KABI so tight we can't do anything anymore....

Quite frankly, anyone that wants to stop us from
adding/removing/changing tracepoints or the code that they are
reporting information about "because ABI" can go take a long walk
off a short cliff.  Diagnostic tracepoints are not part of the
stable ABI. End of story.

> 	So treat anything you add in that manner as potential stable ABI
> you might have to keep around forever.  It's *not* a glorified debugging
> printk.

trace_printk() is the glorified debugging printk for tracing, not
trace events.

Cheers,

Dave.
Dave Chinner Nov. 25, 2016, 3 a.m. UTC | #4
On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> Tracepoints are the standard way to capture debugging and tracing
> information in many parts of the kernel, including the XFS and ext4
> filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> be done in the same way as the filesystem tracing so that developers can
> look at them together and get a coherent idea of what the system is doing.
> 
> I added both an entry and exit tracepoint because future patches will add
> tracepoints to child functions of dax_iomap_pmd_fault() like
> dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> be wrapped by the parent function tracepoints so the code flow is more
> easily understood.  Having entry and exit tracepoints for faults also
> allows us to easily see what filesystems functions were called during the
> fault.  These filesystem functions get executed via iomap_begin() and
> iomap_end() calls, for example, and will have their own tracepoints.
> 
> For PMD faults we primarily want to understand the faulting address and
> whether it fell back to 4k faults.  If it fell back to 4k faults the
> tracepoints should let us understand why.
> 
> I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> to have its own separate tracing header in the same directory at some
> point.
> 
> Here is an example output for these events from a successful PMD fault:
> 
> big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
> address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> max_pgoff 0x1400
> 
> big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
> address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> max_pgoff 0x1400 NOPAGE

Can we make the output use the same format as most of the filesystem
code? i.e. the output starts with backing device + inode number like
so:

	xfs_ilock:            dev 8:96 ino 0x493 flags ILOCK_EXCL....

This way we can filter the output easily across both dax and
filesystem tracepoints with 'grep "ino 0x493"'...

Cheers,

Dave.
Al Viro Nov. 25, 2016, 4:14 a.m. UTC | #5
[Linus Cc'd]

On Fri, Nov 25, 2016 at 01:49:18PM +1100, Dave Chinner wrote:
> > they have become parts of stable userland ABI and are to be maintained
> > indefinitely.  Don't expect "tracepoints are special case" to prevent that.
> 
> I call bullshit just like I always do when someone spouts this
> "tracepoints are stable ABI" garbage.

> Quite frankly, anyone that wants to stop us from
> adding/removing/changing tracepoints or the code that they are
> reporting information about "because ABI" can go take a long walk
> off a short cliff.  Diagnostic tracepoints are not part of the
> stable ABI. End of story.

Tell that to Linus.  You had been in the room, IIRC, when that had been
brought up this year in Santa Fe.  "End of story" is not going to be
yours (or mine, for that matter) to declare - Linus is the only one who
can do that.  If he says "if userland code relies upon it, so that
userland code needs to be fixed" - I'm very happy (and everyone involved
can count upon quite a few free drinks from me at the next summit).  If
it's "that userland code really shouldn't have relied upon it, and it's
real unfortunate that it does, but we still get to keep it working" -
too bad, "because ABI" is the reality and we will be the ones to take
that long walk.

What I heard from Linus sounded a lot closer to the second variant.
_IF_ I have misinterpreted that, I'd love to hear that.  Linus?

PS: I'm dead serious about large amounts of booze of choice at LSFS 2017.
Bribery or shared celebration - call it whatever you like; I really,
really want to have tracepoints free from ABIfication concerns.  They
certainly are useful for debugging purposes - no arguments here.
Dave Chinner Nov. 25, 2016, 7:06 a.m. UTC | #6
On Fri, Nov 25, 2016 at 04:14:19AM +0000, Al Viro wrote:
> [Linus Cc'd]
> 
> On Fri, Nov 25, 2016 at 01:49:18PM +1100, Dave Chinner wrote:
> > > they have become parts of stable userland ABI and are to be maintained
> > > indefinitely.  Don't expect "tracepoints are special case" to prevent that.
> > 
> > I call bullshit just like I always do when someone spouts this
> > "tracepoints are stable ABI" garbage.
> 
> > Quite frankly, anyone that wants to stop us from
> > adding/removing/changing tracepoints or the code that they are
> > reporting information about "because ABI" can go take a long walk
> > off a short cliff.  Diagnostic tracepoints are not part of the
> > stable ABI. End of story.
> 
> Tell that to Linus.  You had been in the room, IIRC, when that had been
> brought up this year in Santa Fe.

No, I wasn't at KS or plumbers, so this is all news to me. Beleive
me, if I was in the room when this discussion was in progress, you'd
remember it /very clearly/.

> "End of story" is not going to be
> yours (or mine, for that matter) to declare - Linus is the only one who
> can do that.  If he says "if userland code relies upon it, so that
> userland code needs to be fixed" - I'm very happy (and everyone involved
> can count upon quite a few free drinks from me at the next summit).  If
> it's "that userland code really shouldn't have relied upon it, and it's
> real unfortunate that it does, but we still get to keep it working" -
> too bad, "because ABI" is the reality and we will be the ones to take
> that long walk.

When the tracepoint infrastructure was added it was considered a
debugging tool and not stable - it was even exposed through
/sys/kernel/debug! We connected up the ~280 /debug/ tracepoints we
had in XFS at the time with the understanding it was a /diagnostic
tool/. We exposed all sorts of internal details we'd previously been
exposing with tracing through lcrash and kdb (and Irix before that)
so we could diagnose problems quickly on a running kernel.

The scope of tracepoints may have grown since then, but it does not
change the fact that many of the tracepoints that were added years
ago were done under the understanding that it was a mutable
interface and nobody could rely on any specific tracepoint detail
remaining unchanged.

We're still treating then as mutable diagnostic and debugging aids
across the kernel. In XFS, We've now got over *500* unique trace
events and *650* tracepoints; ignoring comments, *4%* of the entire
XFS kernel code base is tracing code.  We expose structure contents,
transaction states, locking algorithms, object life cycles, journal
operations, etc. All the new reverse mapping and shared data extent
code that has been merged in 4.8 and 4.9 has been extensively
exposed by tracepoints - these changes also modified a significant
number of existing tracepoints.

Put simply: every major and most minor pieces of functionality in
XFS are exposed via tracepoints.

Hence if the stable ABI tracepoint rules you've just described are
going to enforced, it will mean we will not be able to change
anything signficant in XFS because almost everything significant we
do involves changing tracepoints in some way. This leaves us with
three unacceptable choices:

	1. stop developing XFS so we can maintain the stable
	tracepoint ABI;

	2. ignore the ABI rules and hope that Linus keeps pulling
	code that obviously ignores the ABI rules; or

	3. screw over our upstream/vanilla kernel users by removing
	the tracepoints from Linus' tree and suck up the pain of
	maintaining an out of tree patch for XFS developers and
	distros so kernel tracepoint ABI rules can be ignored.

Nobody wins if these are the only choices we are being given.

I understand why there is a desire for stable tracepoints, and
that's why I suggested that there should be an in-kernel API to
declare stable tracepoints. That way we can have the best of both
worlds - tracepoints that applications need to be stable can be
declared, reviewed and explicitly marked as stable in full knowledge
of what that implies. The rest of the vast body of tracepoints can
be left as mutable with no stability or existence guarantees so that
developers can continue to treat them in a way that best suits
problem diagnosis without compromising the future development of the
code being traced. If userspace finds some of those tracepoints
useful, then they can be taken through the process of making them
into a maintainable stable form and being marked as such.

We already have distros mounting the tracing subsystem on
/sys/kernel/tracing. Expose all the stable tracepoints there, and
leave all the other tracepoints under /sys/kernel/debug/tracing.
Simple, clear separation between stable and mutable diagnostic
tracepoints for users, combined with a simple, clear in-kernel API
and process for making tracepoints stable....

Cheers,

Dave.
Al Viro Nov. 25, 2016, 7:37 a.m. UTC | #7
On Fri, Nov 25, 2016 at 06:06:42PM +1100, Dave Chinner wrote:

> > Tell that to Linus.  You had been in the room, IIRC, when that had been
> > brought up this year in Santa Fe.
> 
> No, I wasn't at KS or plumbers, so this is all news to me.

Sorry, thought you had been at KS ;-/  My apologies...

[snip bloody good points I fully agree with]

> I understand why there is a desire for stable tracepoints, and
> that's why I suggested that there should be an in-kernel API to
> declare stable tracepoints. That way we can have the best of both
> worlds - tracepoints that applications need to be stable can be
> declared, reviewed and explicitly marked as stable in full knowledge
> of what that implies. The rest of the vast body of tracepoints can
> be left as mutable with no stability or existence guarantees so that
> developers can continue to treat them in a way that best suits
> problem diagnosis without compromising the future development of the
> code being traced. If userspace finds some of those tracepoints
> useful, then they can be taken through the process of making them
> into a maintainable stable form and being marked as such.

My impression is that nobody (at least kernel-side) wants them to be
a stable ABI, so long as nobody in userland screams about their code
being broken, everything is fine.  As usual, if nobody notices an ABI
change, it hasn't happened.  The question is what happens when somebody
does.

> We already have distros mounting the tracing subsystem on
> /sys/kernel/tracing. Expose all the stable tracepoints there, and
> leave all the other tracepoints under /sys/kernel/debug/tracing.
> Simple, clear separation between stable and mutable diagnostic
> tracepoints for users, combined with a simple, clear in-kernel API
> and process for making tracepoints stable....

Yep.  That kind of separation would be my preference as well - ideally,
with review for stable ones being a lot less casual that for unstable;
AFAICS what happens now is that we have no mechanisms for marking them as
stable or unstable and everything keeps going on hope that nobody will
cause a mess by creating such a userland dependency.  So far it's been mostly
working, but as the set of tracepoints (and their use) gets wider and wider,
IMO it's only matter of time until we get seriously screwed that way.

Basically, we are gambling on the next one to be cast in stone by userland
dependency being sane enough to make it possible to maintain it indefinitely
and I don't like the odds.
Linus Torvalds Nov. 25, 2016, 7:51 p.m. UTC | #8
On Thu, Nov 24, 2016 at 11:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> My impression is that nobody (at least kernel-side) wants them to be
> a stable ABI, so long as nobody in userland screams about their code
> being broken, everything is fine.  As usual, if nobody notices an ABI
> change, it hasn't happened.  The question is what happens when somebody
> does.

Right. There is basically _no_ "stable API" for the kernel anywhere,
it's just an issue of "you can't break workflow for normal people".

And if somebody writes his own trace scripts, and some random trace
point goes away (or changes semantics), that's easy: he can just fix
his script. Tracepoints aren't ever going to be stable in that sense.

But when then somebody writes a trace script that is so useful that
distros pick it up, and people start using it and depending on it,
then _that_ trace point may well have become effectively locked in
stone.

That's happened once already with the whole powertop thing. It didn't
get all that widely spread, and the fix was largely to improve on
powertop to the point where it wasn't a problem any more, but we've
clearly had one case of this happening.

But I suspect that something like powertop is fairly unusual. There is
certainly room for similar things in the VFS layer (think "better
vmstat that uses tracepoints"), but I suspect the bulk of tracepoints
are going to be for random debugging (so that developers can say
"please run this script") rather than for an actual user application
kind of situation.

So I don't think we should be _too_ afraid of tracepoints either. When
being too anti-tracepoint is a bigger practical problem than the
possible problems down the line, the balance is wrong.

As long as tracepoints make sense from a higher standpoint (ie not
just random implementation detail of the day), and they aren't
everywhere, they are unlikely to cause much problem.

We do have filesystem code that is just disgusting. As an example:
fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
single function. If you want that, use the function tracer. That seems
to be just debugging code that has been left around for others to
stumble over. I do *not* believe that we should encourage that kind of
"machine gun spray" use of tracepoints.

But tracing actual high-level things like IO and faults? I think that
makes perfect sense, as long as the data that is collected is also the
actual event data, and not so much a random implementation issue of
the day.

             Linus
Mike Marshall Nov. 25, 2016, 8:36 p.m. UTC | #9
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function.

Hmmm... we have "gossip" statements in Orangefs which can be triggered with
a debugfs knob... lots of functions have a gossip statement at the
top... is that
disgusting?

-Mike

On Fri, Nov 25, 2016 at 2:51 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Nov 24, 2016 at 11:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> My impression is that nobody (at least kernel-side) wants them to be
>> a stable ABI, so long as nobody in userland screams about their code
>> being broken, everything is fine.  As usual, if nobody notices an ABI
>> change, it hasn't happened.  The question is what happens when somebody
>> does.
>
> Right. There is basically _no_ "stable API" for the kernel anywhere,
> it's just an issue of "you can't break workflow for normal people".
>
> And if somebody writes his own trace scripts, and some random trace
> point goes away (or changes semantics), that's easy: he can just fix
> his script. Tracepoints aren't ever going to be stable in that sense.
>
> But when then somebody writes a trace script that is so useful that
> distros pick it up, and people start using it and depending on it,
> then _that_ trace point may well have become effectively locked in
> stone.
>
> That's happened once already with the whole powertop thing. It didn't
> get all that widely spread, and the fix was largely to improve on
> powertop to the point where it wasn't a problem any more, but we've
> clearly had one case of this happening.
>
> But I suspect that something like powertop is fairly unusual. There is
> certainly room for similar things in the VFS layer (think "better
> vmstat that uses tracepoints"), but I suspect the bulk of tracepoints
> are going to be for random debugging (so that developers can say
> "please run this script") rather than for an actual user application
> kind of situation.
>
> So I don't think we should be _too_ afraid of tracepoints either. When
> being too anti-tracepoint is a bigger practical problem than the
> possible problems down the line, the balance is wrong.
>
> As long as tracepoints make sense from a higher standpoint (ie not
> just random implementation detail of the day), and they aren't
> everywhere, they are unlikely to cause much problem.
>
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function. If you want that, use the function tracer. That seems
> to be just debugging code that has been left around for others to
> stumble over. I do *not* believe that we should encourage that kind of
> "machine gun spray" use of tracepoints.
>
> But tracing actual high-level things like IO and faults? I think that
> makes perfect sense, as long as the data that is collected is also the
> actual event data, and not so much a random implementation issue of
> the day.
>
>              Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 25, 2016, 9:48 p.m. UTC | #10
On Fri, Nov 25, 2016 at 11:51:26AM -0800, Linus Torvalds wrote:
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function. If you want that, use the function tracer. That seems
> to be just debugging code that has been left around for others to
> stumble over. I do *not* believe that we should encourage that kind of
> "machine gun spray" use of tracepoints.

There is a reason why people want to be able to do that, and that's
because kprobes doesn't give you access to the arguments and return
codes to the functions.  Maybe there could be a way to do this more
easily using DWARF information and EBPF magic, perhaps?  It won't help
for inlined functions, of course, but most of the functions where
people want to do this aren't generally functions which are going to
be inlined, but rather things like write_begin, writepages, which are
called via a struct ops table and so will never be inlined to begin
with.

And it *is* handy to be able to do this when you don't know ahead of
time that you might need to debug a production system that is
malfunctioning for some reason.  This is the "S" in RAS (Reliability,
Availability, Serviceability).  This is why it's nice if there were a
way to be clear that it is intended for debugging purposes only ---
and maybe kprobes with EBPF and DWARF would be the answer.

After all, we need *some* way of saying this can never be considered
stable --- what would we do if some userspace program like powertop
started depending on a function name via ktrace and that function
disappeared --- would the userspace application really be intended to
demand that we revert the recatoring, because eliminating a function
name that they were depending on via ktrace point broke them?

						- Ted
Linus Torvalds Nov. 25, 2016, 11:38 p.m. UTC | #11
On Fri, Nov 25, 2016 at 1:48 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> There is a reason why people want to be able to do that, and that's
> because kprobes doesn't give you access to the arguments and return
> codes to the functions.

Honestly, that's simply not a good reason.

What if everybody did this? Do we pollute the whole kernel with this crap? No.

And if not, then what's so special about something like afs that it
would make sense there?

The thing is, with function tracing, you *can* get the return value
and arguments. Sure, you'll probably need to write eBPF and just
attach it to that fentry call point, and yes, if something is inlined
you're just screwed, but Christ, if you do debugging that way you
shouldn't be writing kernel code in the first place.

If you cannot do filesystem debugging without tracing every single
function entry, you are doing something seriously wrong. Add a couple
of relevant and valid trace points to get the initial arguments etc
(and perhaps to turn on the function tracing going down the stack).

> After all, we need *some* way of saying this can never be considered
> stable.

Oh, if you pollute the kernel with random idiotic trace points, not
only are they not going to be considered stable, after a while people
should stop pulling from you.

I do think we should probably add a few generic VFS level breakpoints
to make it easier for people to catch the arguments they get from the
VFS layer (not every system call - if you're a filesystem person, you
_shouldn't_ care about all the stuff that the VFS layer caches for you
so that you never even have to see it). I do think that Al's "no trace
points what-so-ever" is too strict.

But I think a lot of people add complete crap with the "maybe it's
needed some day" kind of mentality.

The tracepoints should have a good _specific_ reason, and they should
make sense. Not be randomly sprinkled "just because".

             Linus
Dave Chinner Nov. 27, 2016, 10:42 p.m. UTC | #12
On Fri, Nov 25, 2016 at 11:51:26AM -0800, Linus Torvalds wrote:
> On Thu, Nov 24, 2016 at 11:37 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > My impression is that nobody (at least kernel-side) wants them to be
> > a stable ABI, so long as nobody in userland screams about their code
> > being broken, everything is fine.  As usual, if nobody notices an ABI
> > change, it hasn't happened.  The question is what happens when somebody
> > does.
> 
> Right. There is basically _no_ "stable API" for the kernel anywhere,
> it's just an issue of "you can't break workflow for normal people".
> 
> And if somebody writes his own trace scripts, and some random trace
> point goes away (or changes semantics), that's easy: he can just fix
> his script. Tracepoints aren't ever going to be stable in that sense.
> 
> But when then somebody writes a trace script that is so useful that
> distros pick it up, and people start using it and depending on it,
> then _that_ trace point may well have become effectively locked in
> stone.

And that's exactly why we need a method of marking tracepoints as
stable. How else are we going to know whether a specific tracepoint
is stable if the kernel code doesn't document that it's stable? And
how are we going to know why it's considered stable if there isn't a
commit message that explains why it was made stable?

> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function. If you want that, use the function tracer. That seems
> to be just debugging code that has been left around for others to
> stumble over. I do *not* believe that we should encourage that kind of
> "machine gun spray" use of tracepoints.

Inappropriate use of tracepoints is a different problem. The issue
here is getting rid of the uncertainty caused by the handwavy
"tracepoints a mutable until someone, somewhere decides to use it in
userspace" policy.

> But tracing actual high-level things like IO and faults? I think that
> makes perfect sense, as long as the data that is collected is also the
> actual event data, and not so much a random implementation issue of
> the day.

IME, a tracepoint that doesn't expose detailed context specific
information isn't really useful for complex problem diagnosis...

Cheers,

Dave.
Linus Torvalds Nov. 28, 2016, 12:58 a.m. UTC | #13
On Sun, Nov 27, 2016 at 2:42 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> And that's exactly why we need a method of marking tracepoints as
> stable. How else are we going to know whether a specific tracepoint
> is stable if the kernel code doesn't document that it's stable?

You are living in some unrealistic dream-world where you think you can
get the right tracepoint on the first try.

So there is no way in hell I would ever mark any tracepoint "stable"
until it has had a fair amount of use, and there are useful tools that
actually make use of it, and it has shown itself to be the right
trace-point.

And once that actually happens, what's the advantage of marking it
stable? None. It's a catch-22. Before it has uses and has been tested
and found to be good, it's not stable. And after, it's pointless.

So at no point does such a "stable" tracepoint marking make sense. At
most, you end up adding a comment saying "this tracepoint is used by
tools such-and-such".

                   Linus
Al Viro Nov. 28, 2016, 1:45 a.m. UTC | #14
On Sun, Nov 27, 2016 at 04:58:43PM -0800, Linus Torvalds wrote:
> You are living in some unrealistic dream-world where you think you can
> get the right tracepoint on the first try.
> 
> So there is no way in hell I would ever mark any tracepoint "stable"
> until it has had a fair amount of use, and there are useful tools that
> actually make use of it, and it has shown itself to be the right
> trace-point.
> 
> And once that actually happens, what's the advantage of marking it
> stable? None. It's a catch-22. Before it has uses and has been tested
> and found to be good, it's not stable. And after, it's pointless.
> 
> So at no point does such a "stable" tracepoint marking make sense. At
> most, you end up adding a comment saying "this tracepoint is used by
> tools such-and-such".

I can't speak for Dave, but I suspect that it's more about "this, this and
that tracepoints are purely internal and we can and will change them whenever
we bloody feel like that; stick your fingers in those and they _will_ get
crushed".

Incidentally, take a look at
        trace_ocfs2_file_aio_read(inode, filp, filp->f_path.dentry,
                        (unsigned long long)OCFS2_I(inode)->ip_blkno,
                        filp->f_path.dentry->d_name.len,
                        filp->f_path.dentry->d_name.name,
                        to->nr_segs);   /* GRRRRR */
Note that there is nothing whatsoever protecting the use of ->d_name in
there (not that poking in iov_iter guts was a good idea).  Besides, suppose
something *did* grab a hold of that one a while ago.  What would we have
to do to avoid stepping on its toes every time when somebody call ocfs2
->splice_read(), which has recently started to go through ->read_iter()
calls?  Prepend something like if (!(to->type & ITER_PIPE)) to it?

I'm very tempted to just go and remove it, along with its analogues.
If nothing else, the use of ->d_name *is* racy, and while it might be
tolerable for occasional debugging, for anything in heavier use it's
asking for trouble...
Jan Kara Nov. 28, 2016, 8:33 a.m. UTC | #15
On Fri 25-11-16 16:48:40, Ted Tso wrote:
> On Fri, Nov 25, 2016 at 11:51:26AM -0800, Linus Torvalds wrote:
> > We do have filesystem code that is just disgusting. As an example:
> > fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> > single function. If you want that, use the function tracer. That seems
> > to be just debugging code that has been left around for others to
> > stumble over. I do *not* believe that we should encourage that kind of
> > "machine gun spray" use of tracepoints.
> 
> There is a reason why people want to be able to do that, and that's
> because kprobes doesn't give you access to the arguments and return
> codes to the functions.  Maybe there could be a way to do this more
> easily using DWARF information and EBPF magic, perhaps?  It won't help
> for inlined functions, of course, but most of the functions where
> people want to do this aren't generally functions which are going to
> be inlined, but rather things like write_begin, writepages, which are
> called via a struct ops table and so will never be inlined to begin
> with.

Actually, you can print register & stack contents from a kprobe and you can
get a function return value from a kretprobe (see
Documentation/trace/kprobetrace.txt). Since calling convention is fixed
(arg 1 in RDI, arg 2 in RSI...) you can relatively easily dump function
arguments on entry and dump return value on return for arbitrary function
of your choice. I was already debugging issues like that several times (in
VFS actually because of missing trace points ;)). You can even create a
kprobe to dump register contents in the middle of the function (although
there it takes more effort reading the dissasembly to see what you are
interested in).

								Honza
Dave Chinner Nov. 28, 2016, 9:09 a.m. UTC | #16
On Sun, Nov 27, 2016 at 04:58:43PM -0800, Linus Torvalds wrote:
> On Sun, Nov 27, 2016 at 2:42 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > And that's exactly why we need a method of marking tracepoints as
> > stable. How else are we going to know whether a specific tracepoint
> > is stable if the kernel code doesn't document that it's stable?
> 
> You are living in some unrealistic dream-world where you think you can
> get the right tracepoint on the first try.

No, I'm  under no illusions that we'd get stable tracepoints right
the first go. I don't care about how we stabilise stable
tracepoints, because nothing I maintain will use stable tracepoints.
However, I will point out that we have /already solved these ABI
development problems/.

$ ls Documentation/ABI/
obsolete  README  removed  stable  testing
$

Expands this to include stable tracepoints, not just sysfs files.
New stable stuff gets classified as "testing" meaning it is supposed
to be stable but may change before being declared officially stable.
"stable" is obvious, are "obsolete" and "removed".


> So there is no way in hell I would ever mark any tracepoint "stable"
> until it has had a fair amount of use, and there are useful tools that
> actually make use of it, and it has shown itself to be the right
> trace-point.
> 
> And once that actually happens, what's the advantage of marking it
> stable? None. It's a catch-22. Before it has uses and has been tested
> and found to be good, it's not stable. And after, it's pointless.

And that "catch-22" is *precisely the problem we need to solve*.

Pointing out that there's a catch-22 doesn't help anyone - all the
developers that are telling you that they really need a way to mark
stable tracepoints already understand this catch-22 and they want a
way to avoid it.  Being able to either say "this is stable and we'll
support it forever" or "this will never be stable so use at your own
risk" is a simple way of avoiding the catch-22. If an unstable
tracepoint is useful to applications *and it can be implemented in a
maintainable, stable form* then it can go through the process of
being made stable and documented in Documentation/ABI/stable.

Problem is solved, catch-22 is gone.

All we want is some method of making a clear, unambiguous statement
about the nature of a specific tracepoint and a process for
transitioning a tracepoint to a stable, maintainable form. We do it
for other ABI interfaces, so why can't we do this for tracepoints?

Cheers,

Dave.
Ross Zwisler Nov. 28, 2016, 10:46 p.m. UTC | #17
On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> > Tracepoints are the standard way to capture debugging and tracing
> > information in many parts of the kernel, including the XFS and ext4
> > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> > be done in the same way as the filesystem tracing so that developers can
> > look at them together and get a coherent idea of what the system is doing.
> > 
> > I added both an entry and exit tracepoint because future patches will add
> > tracepoints to child functions of dax_iomap_pmd_fault() like
> > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> > be wrapped by the parent function tracepoints so the code flow is more
> > easily understood.  Having entry and exit tracepoints for faults also
> > allows us to easily see what filesystems functions were called during the
> > fault.  These filesystem functions get executed via iomap_begin() and
> > iomap_end() calls, for example, and will have their own tracepoints.
> > 
> > For PMD faults we primarily want to understand the faulting address and
> > whether it fell back to 4k faults.  If it fell back to 4k faults the
> > tracepoints should let us understand why.
> > 
> > I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> > to have its own separate tracing header in the same directory at some
> > point.
> > 
> > Here is an example output for these events from a successful PMD fault:
> > 
> > big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
> > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> > max_pgoff 0x1400
> > 
> > big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
> > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> > max_pgoff 0x1400 NOPAGE
> 
> Can we make the output use the same format as most of the filesystem
> code? i.e. the output starts with backing device + inode number like
> so:
> 
> 	xfs_ilock:            dev 8:96 ino 0x493 flags ILOCK_EXCL....
> 
> This way we can filter the output easily across both dax and
> filesystem tracepoints with 'grep "ino 0x493"'...

I think I can include the inode number, which I have via mapping->host.  Am I
correct in assuming "struct inode.i_ino" will always be the same as
"struct xfs_inode.i_ino"?

Unfortunately I don't have access to the major/minor (the dev_t) until I call
iomap_begin().  Currently we call iomap_begin() only after we've done most of
our sanity checking that would cause us to fall back to PTEs.

I don't think we want to reorder things so that we start with an iomap_begin()
- that would mean that we would begin by asking the filesystem for a block
allocation, when in many cases we would then do an alignment check or
something similar and then fall back to PTEs.

I'll add "ino" to the output so it looks something like this:

big-2057  [000] ....   136.397943: dax_pmd_fault_done: ino 0x493 shared
mapping write address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff
0x200 max_pgoff 0x1400 NOPAGE
Dave Chinner Nov. 29, 2016, 2:02 a.m. UTC | #18
On Mon, Nov 28, 2016 at 03:46:51PM -0700, Ross Zwisler wrote:
> On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote:
> > On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> > > Tracepoints are the standard way to capture debugging and tracing
> > > information in many parts of the kernel, including the XFS and ext4
> > > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> > > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> > > be done in the same way as the filesystem tracing so that developers can
> > > look at them together and get a coherent idea of what the system is doing.
> > > 
> > > I added both an entry and exit tracepoint because future patches will add
> > > tracepoints to child functions of dax_iomap_pmd_fault() like
> > > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> > > be wrapped by the parent function tracepoints so the code flow is more
> > > easily understood.  Having entry and exit tracepoints for faults also
> > > allows us to easily see what filesystems functions were called during the
> > > fault.  These filesystem functions get executed via iomap_begin() and
> > > iomap_end() calls, for example, and will have their own tracepoints.
> > > 
> > > For PMD faults we primarily want to understand the faulting address and
> > > whether it fell back to 4k faults.  If it fell back to 4k faults the
> > > tracepoints should let us understand why.
> > > 
> > > I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> > > to have its own separate tracing header in the same directory at some
> > > point.
> > > 
> > > Here is an example output for these events from a successful PMD fault:
> > > 
> > > big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
> > > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> > > max_pgoff 0x1400
> > > 
> > > big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
> > > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
> > > max_pgoff 0x1400 NOPAGE
> > 
> > Can we make the output use the same format as most of the filesystem
> > code? i.e. the output starts with backing device + inode number like
> > so:
> > 
> > 	xfs_ilock:            dev 8:96 ino 0x493 flags ILOCK_EXCL....
> > 
> > This way we can filter the output easily across both dax and
> > filesystem tracepoints with 'grep "ino 0x493"'...
> 
> I think I can include the inode number, which I have via mapping->host.  Am I
> correct in assuming "struct inode.i_ino" will always be the same as
> "struct xfs_inode.i_ino"?

Yes - just use inode.i_ino.

> Unfortunately I don't have access to the major/minor (the dev_t) until I call
> iomap_begin(). 

In general, filesystem tracing uses inode->sb->s_dev as the
identifier. NFS, gfs2, XFS, ext4 and others all use this.

Cheers,

Dave.
Mike Marshall March 8, 2017, 10:05 p.m. UTC | #19
This is a reply to a thread from back in Nov 2016...

Linus> The thing is, with function tracing, you *can* get the return value
Linus> and arguments. Sure, you'll probably need to write eBPF and just
Linus> attach it to that fentry call point, and yes, if something is inlined
Linus> you're just screwed, but Christ, if you do debugging that way you
Linus> shouldn't be writing kernel code in the first place.

I've been thinking about this thread ever since then... lately I've been
reading about eBPF and looking at Brendan Gregg's bcc tool, and
reading about kprobes and I made some test tracepoints too...

Orangefs has a function:

orangefs_create(struct inode *dir,
                        struct dentry *dentry,
                        umode_t mode,
                        bool exclusive)

I found it was easy to use a kprobe to get a function's return code:

echo 'r:orangefs_create_retprobe orangefs_create $retval' >>
/sys/kernel/debug/tracing/kprobe_events

It was also easy to use a kprobe to print the arguments to a function,
but since arguments are often pointers, maybe the pointer value isn't
really what you want to see, rather you might wish you could "see into"
a structure that you have a pointer to...

Here's a kprobe for looking at the arguments to a function, the funny
di, si, dx, stack
stuff has to do with a particular architecture's parameter passing mechanism as
it relates to registers and the stack... I mention that because the
example in kprobetrace.txt
seems to be for a 32-bit computer and I'm (like most of us?) on a
64-bit computer.

echo 'p:orangefs_create_probe orangefs_create dir=%di dentry=%si
mode=%dx exclusive=+4($stack)' >
/sys/kernel/debug/tracing/kprobe_events

I wrote a bcc program to extract the name of an object from the dentry struct
passed into orangefs_create. The bcc program is kind of verbose, but is
mostly "templatey", here it is... is this the kind of thing Linus was
talking about?

Either way, I think it is kind of cool <g>...

#!/usr/bin/python
#
# Brendan Gregg's mdflush used as a template.
#

from __future__ import print_function
from bcc import BPF
from time import strftime
import ctypes as ct

# load BPF program
b = BPF(text="""
#include <uapi/linux/limits.h>
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/genhd.h>
#include <linux/dcache.h>

struct data_t {
    u64 pid;
    char comm[TASK_COMM_LEN];
    char objname[NAME_MAX];
};
BPF_PERF_OUTPUT(events);

int kprobe__orangefs_create(struct pt_regs *ctx,
                            void *dir,
                            struct dentry *dentry)
{
    struct data_t data = {};
    u32 pid = bpf_get_current_pid_tgid();
    data.pid = pid;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    bpf_probe_read(&data.objname,
                   sizeof(data.objname),
                   (void *)dentry->d_name.name);
    events.perf_submit(ctx, &data, sizeof(data));
    return 0;
}
""")

# event data
TASK_COMM_LEN = 16  # linux/sched.h
NAME_MAX = 255  # uapi/linux/limits.h
class Data(ct.Structure):
    _fields_ = [
        ("pid", ct.c_ulonglong),
        ("comm", ct.c_char * TASK_COMM_LEN),
        ("objname", ct.c_char * NAME_MAX)
    ]

# header
print("orangefs creates... Hit Ctrl-C to end.")
print("%-8s %-6s %-16s %s" % ("TIME", "PID", "COMM", "OBJNAME"))

# process event
# print_event is the callback invoked from open_perf_buffers...
# cpu refers to a set of per-cpu ring buffers that will receive the event data.
def print_event(cpu, data, size):
    event = ct.cast(data, ct.POINTER(Data)).contents
    print("%-8s %-6d %-16s %s" % (strftime("%H:%M:%S"), event.pid, event.comm,
        event.objname))

# read events
b["events"].open_perf_buffer(print_event)
while 1:
    b.kprobe_poll()




-Mike

On Mon, Nov 28, 2016 at 9:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 28, 2016 at 03:46:51PM -0700, Ross Zwisler wrote:
>> On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote:
>> > On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
>> > > Tracepoints are the standard way to capture debugging and tracing
>> > > information in many parts of the kernel, including the XFS and ext4
>> > > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
>> > > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
>> > > be done in the same way as the filesystem tracing so that developers can
>> > > look at them together and get a coherent idea of what the system is doing.
>> > >
>> > > I added both an entry and exit tracepoint because future patches will add
>> > > tracepoints to child functions of dax_iomap_pmd_fault() like
>> > > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
>> > > be wrapped by the parent function tracepoints so the code flow is more
>> > > easily understood.  Having entry and exit tracepoints for faults also
>> > > allows us to easily see what filesystems functions were called during the
>> > > fault.  These filesystem functions get executed via iomap_begin() and
>> > > iomap_end() calls, for example, and will have their own tracepoints.
>> > >
>> > > For PMD faults we primarily want to understand the faulting address and
>> > > whether it fell back to 4k faults.  If it fell back to 4k faults the
>> > > tracepoints should let us understand why.
>> > >
>> > > I named the new tracepoint header file "fs_dax.h" to allow for device DAX
>> > > to have its own separate tracing header in the same directory at some
>> > > point.
>> > >
>> > > Here is an example output for these events from a successful PMD fault:
>> > >
>> > > big-2057  [000] ....   136.396855: dax_pmd_fault: shared mapping write
>> > > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
>> > > max_pgoff 0x1400
>> > >
>> > > big-2057  [000] ....   136.397943: dax_pmd_fault_done: shared mapping write
>> > > address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200
>> > > max_pgoff 0x1400 NOPAGE
>> >
>> > Can we make the output use the same format as most of the filesystem
>> > code? i.e. the output starts with backing device + inode number like
>> > so:
>> >
>> >     xfs_ilock:            dev 8:96 ino 0x493 flags ILOCK_EXCL....
>> >
>> > This way we can filter the output easily across both dax and
>> > filesystem tracepoints with 'grep "ino 0x493"'...
>>
>> I think I can include the inode number, which I have via mapping->host.  Am I
>> correct in assuming "struct inode.i_ino" will always be the same as
>> "struct xfs_inode.i_ino"?
>
> Yes - just use inode.i_ino.
>
>> Unfortunately I don't have access to the major/minor (the dev_t) until I call
>> iomap_begin().
>
> In general, filesystem tracing uses inode->sb->s_dev as the
> identifier. NFS, gfs2, XFS, ext4 and others all use this.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index cc8a069..1aa7616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -35,6 +35,9 @@ 
 #include <linux/iomap.h>
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/fs_dax.h>
+
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -1310,6 +1313,16 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	loff_t pos;
 	int error;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
+
+	trace_dax_pmd_fault(vma, address, flags, pgoff, max_pgoff, 0);
+
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
 		goto fallback;
@@ -1320,16 +1333,10 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
 		goto fallback;
 
-	/*
-	 * Check whether offset isn't beyond end of file now. Caller is
-	 * supposed to hold locks serializing us with truncate / punch hole so
-	 * this is a reliable test.
-	 */
-	pgoff = linear_page_index(vma, pmd_addr);
-	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
-
-	if (pgoff > max_pgoff)
-		return VM_FAULT_SIGBUS;
+	if (pgoff > max_pgoff) {
+		result = VM_FAULT_SIGBUS;
+		goto out;
+	}
 
 	/* If the PMD would extend beyond the file size */
 	if ((pgoff | PG_PMD_COLOUR) > max_pgoff)
@@ -1400,6 +1407,8 @@  int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		split_huge_pmd(vma, pmd, address);
 		count_vm_event(THP_FAULT_FALLBACK);
 	}
+out:
+	trace_dax_pmd_fault_done(vma, address, flags, pgoff, max_pgoff, result);
 	return result;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5f52c0..e373917 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1107,6 +1107,20 @@  static inline void clear_page_pfmemalloc(struct page *page)
 			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
 			 VM_FAULT_FALLBACK)
 
+#define VM_FAULT_RESULT_TRACE \
+	{ VM_FAULT_OOM,			"OOM" }, \
+	{ VM_FAULT_SIGBUS,		"SIGBUS" }, \
+	{ VM_FAULT_MAJOR,		"MAJOR" }, \
+	{ VM_FAULT_WRITE,		"WRITE" }, \
+	{ VM_FAULT_HWPOISON,		"HWPOISON" }, \
+	{ VM_FAULT_HWPOISON_LARGE,	"HWPOISON_LARGE" }, \
+	{ VM_FAULT_SIGSEGV,		"SIGSEGV" }, \
+	{ VM_FAULT_NOPAGE,		"NOPAGE" }, \
+	{ VM_FAULT_LOCKED,		"LOCKED" }, \
+	{ VM_FAULT_RETRY,		"RETRY" }, \
+	{ VM_FAULT_FALLBACK,		"FALLBACK" }, \
+	{ VM_FAULT_DONE_COW,		"DONE_COW" }
+
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
 #define VM_FAULT_GET_HINDEX(x) (((x) >> 12) & 0xf)
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
new file mode 100644
index 0000000..f9ed4eb
--- /dev/null
+++ b/include/trace/events/fs_dax.h
@@ -0,0 +1,61 @@ 
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fs_dax
+
+#if !defined(_TRACE_FS_DAX_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FS_DAX_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(dax_pmd_fault_class,
+	TP_PROTO(struct vm_area_struct *vma, unsigned long address,
+		unsigned int flags, pgoff_t pgoff, pgoff_t max_pgoff,
+		int result),
+	TP_ARGS(vma, address, flags, pgoff, max_pgoff, result),
+	TP_STRUCT__entry(
+		__field(unsigned long, vm_start)
+		__field(unsigned long, vm_end)
+		__field(unsigned long, vm_flags)
+		__field(unsigned long, address)
+		__field(unsigned int, flags)
+		__field(pgoff_t, pgoff)
+		__field(pgoff_t, max_pgoff)
+		__field(int, result)
+	),
+	TP_fast_assign(
+		__entry->vm_start = vma->vm_start;
+		__entry->vm_end = vma->vm_end;
+		__entry->vm_flags = vma->vm_flags;
+		__entry->address = address;
+		__entry->flags = flags;
+		__entry->pgoff = pgoff;
+		__entry->max_pgoff = max_pgoff;
+		__entry->result = result;
+	),
+	TP_printk("%s mapping %s address %#lx vm_start %#lx vm_end %#lx "
+		"pgoff %#lx max_pgoff %#lx %s",
+		__entry->vm_flags & VM_SHARED ? "shared" : "private",
+		__entry->flags & FAULT_FLAG_WRITE ? "write" : "read",
+		__entry->address,
+		__entry->vm_start,
+		__entry->vm_end,
+		__entry->pgoff,
+		__entry->max_pgoff,
+		__print_flags(__entry->result, "|", VM_FAULT_RESULT_TRACE)
+	)
+)
+
+#define DEFINE_PMD_FAULT_EVENT(name) \
+DEFINE_EVENT(dax_pmd_fault_class, name, \
+	TP_PROTO(struct vm_area_struct *vma, unsigned long address, \
+		unsigned int flags, pgoff_t pgoff, pgoff_t max_pgoff, \
+		int result), \
+	TP_ARGS(vma, address, flags, pgoff, max_pgoff, result))
+
+DEFINE_PMD_FAULT_EVENT(dax_pmd_fault);
+DEFINE_PMD_FAULT_EVENT(dax_pmd_fault_done);
+
+
+#endif /* _TRACE_FS_DAX_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>