diff mbox

[rfc] larger batches for crc32c

Message ID 20161028031747.68472ac7@roar.ozlabs.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nicholas Piggin Oct. 27, 2016, 4:17 p.m. UTC
Hi guys,

We're seeing crc32c_le show up in xfs log checksumming on a MySQL benchmark
on powerpc. I could reproduce similar overheads with dbench as well.

1.11%  mysqld           [kernel.vmlinux]            [k] __crc32c_le
        |
        ---__crc32c_le
           |          
            --1.11%--chksum_update
                      |          
                       --1.11%--crypto_shash_update
                                 crc32c
                                 xlog_cksum
                                 xlog_sync
                                 _xfs_log_force_lsn
                                 xfs_file_fsync
                                 vfs_fsync_range
                                 do_fsync
                                 sys_fsync
                                 system_call
                                 0x17738
                                 0x17704
                                 os_file_flush_func
                                 fil_flush

As a rule, it helps the crc implementation if it can operate on as large a
chunk as possible (alignment, startup overhead, etc). So I did a quick hack
at getting XFS checksumming to feed crc32c() with larger chunks, by setting
the existing crc to 0 before running over the entire buffer. Together with
some small work on the powerpc crc implementation, crc drops below 0.1%.

I don't know if something like this would be acceptable? It's not pretty,
but I didn't see an easier way.

Comments

L A Walsh Nov. 2, 2016, 2:18 a.m. UTC | #1
Snipping in various places:

Dave Chinner wrote:
> directory operations:
> 	lookups = 79418,  creates = 460612
> 	removes = 460544, getdents = 5685160
>
> SO, we have 80k directory lookups to instantiate an inode, but 460k
> creates and removes, and 5.6M readdir calls. That's a lot of readdir
> calls....
>   
>> trans 0 3491960 0
>>     
> 3.5 million asynchronous transaction commits. That's a lot, so far I
> can account for about 1.5M of them (creates, removes and a handful
> of data extents). Oh, a couple of million data writes - I bet the
> rest are timestamp/i_version updates.
>   
> [...]
> log force sleeps = 143932               (fsync waits!)
> ...
> Identifying what that is will give you some idea
> of how to reduce the fsync load and hence potentially decrease the
> log and CRC overhead that it is causing....
>   
----
    Back when the free-inode performance enhancement was introduced,
it was required that crc'ing of metadata also be enabled.  Have
these options been separated yet?  If so, save yourself some reading
and just get the "thanks!", and ignore the rest of this, but if
not... please read on, and please re-consider...

    I asked for a separation of these options on the basis that
crc'ing of the file system was not something I wanted on my files due
to performance and data security and recovery considerations.  I was
told that the crc'ing of the data wouldn't be noticeable (something I
was skeptical of), and that allowing the options separately wouldn't
be offered). 

    My inner-cynic thought that this was because many, if not most
users wouldn't need the crc and it would only be another source of
problems -- one that would disable data-access if the crc counts
couldn't be verified. 

    I really, still, don't want the crc's on my disks, I don't see
that they would provide any positive benefit in my usage -- nor in
many uses of xfs -- which ISN'T to say I can't see the need and/or
desire to have them for many classes of xfs users -- just not one
that I belong to at this point.  I was more worried about performance
than anything else (always trying to eek the most out of fixed budget!).

    I see the speedup from the free-inode tree in saving the
time during the real-time-search for free space, but only saw the crc
calculations as time-wasting overhead for the customer not needing
such integrity guarantees. 

    Is the free-space inode feature anything more than something to
offset the speed lost by crc calculations?  Wouldn't xfs be more
flexible if it could be tuned for performance OR integrity (or
a mix of both, using both).   Even if someone only wants to support
the combination in some official release, allowing selection
of those options to be separate would allow better testing
as well as isolation of features for specific workloads, testing
and/or benchmarks.

L. Walsh


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 3, 2016, 8:29 a.m. UTC | #2
Hi Linda,

On Tue, Nov 01, 2016 at 07:18:32PM -0700, L.A. Walsh wrote:
s
> Snipping in various places:
> 
> Dave Chinner wrote:
> >directory operations:
> >	lookups = 79418,  creates = 460612
> >	removes = 460544, getdents = 5685160
> >
> >SO, we have 80k directory lookups to instantiate an inode, but 460k
> >creates and removes, and 5.6M readdir calls. That's a lot of readdir
> >calls....
> >>trans 0 3491960 0
> >3.5 million asynchronous transaction commits. That's a lot, so far I
> >can account for about 1.5M of them (creates, removes and a handful
> >of data extents). Oh, a couple of million data writes - I bet the
> >rest are timestamp/i_version updates.
> >[...]
> >log force sleeps = 143932               (fsync waits!)
> >...
> >Identifying what that is will give you some idea
> >of how to reduce the fsync load and hence potentially decrease the
> >log and CRC overhead that it is causing....
> ----
>    Back when the free-inode performance enhancement was introduced,
> it was required that crc'ing of metadata also be enabled.  Have
> these options been separated yet?

No, and they won't be. The answer has not changed.

.....

>    I really, still, don't want the crc's on my disks, I don't see
> that they would provide any positive benefit in my usage -- nor in
> many uses of xfs -- which ISN'T to say I can't see the need and/or
> desire to have them for many classes of xfs users -- just not one
> that I belong to at this point.  I was more worried about performance
> than anything else (always trying to eek the most out of fixed budget!).

You may not want CRC's, but they /aren't for you/.  The benefits of
CRCs are realised when things go wrong, not when things go right.
IOWs, CRCs are most useful for the people supporting XFS as they
provide certainty about where errors and corruptions have
originated.

As most users never have things go wrong, all they think is "CRCs
are unnecessary overhead". It's just like backups - how many people
don't make backups because they cost money right now and there's no
tangible benefit until something goes wrong which almost never
happens?

>    I see the speedup from the free-inode tree in saving the
> time during the real-time-search for free space, but only saw the crc
> calculations as time-wasting overhead for the customer not needing
> such integrity guarantees.

Exactly my point. Humans are terrible at risk assessment and
mitigation because most people are unaware of the unconcious
cognitive biases that affect this sort of decision making.

>    Is the free-space inode feature anything more than something to
> offset the speed lost by crc calculations?

That's not what the free inode btree does. It actually slows down
allocation on an empty filesystem and trades off that increase in
"empty filesystem" overhead for significantly better aged filesystem
inode allocation performance. i.e. the finobt provides more
deterministic inode allocation overhead, not "faster" allocation.

Let me demonstrate with some numbers on empty filesystem create
rate:

			create rate	sys CPU time	write rate
			(files/s)	(seconds)	  (MB/s)
crc = 0, finobt = 0:	238943		  2629		 ~200
crc = 1, finobt = 0:	231582		  2711		  ~40
crc = 1, finobt = 1:	232563		  2766		  ~40
*hacked* crc disable:   231435		  2789		  ~40

Std deviation of the file create rate is about +/-10%, so the
differences between the create rate results (3%) are not
significant. Yes, there's a slight decrease in performance with CRCs
enabled, but that's because of the increased btree footprint  due to
the increased btree header size in the v5 format. Hence lookups,
modifications, etc take slightly longer and cost more CPU.

We can see that the system CPU time increased by 3.1% with the
"addition of CRCs".  The CPU usage increases by a further 2% with
the addition of the free inode btree, which should give you an idea
of how much CPU time even a small btree consumes. The allocated
inode btree is huge in comparison to the finobt in this workload,
which is why even a small change in header size (when CRCs are
enabled) makes a large difference in CPU usage.

To verify that CRC has no significant impact on inode allocation,
let's look at the actual CPU being used by the CRC calculations in
this workload are:

  0.28%  [kernel]  [k] crc32c_pcl_intel_update

Only a small proportion of the entire increase in CPU consumption
that comes from "turning on CRCS". Indeed, the "*hacked* CRC
disable" results are from skipping CRC calculations in the code
altogether and returning "verify ok" without calculating them. The
create rate is identical to the crc=1,finobt=1 numbers and the CPU
usage is /slightly higher/ than when CRCs are enabled.

IOWs, for most workloads CRCs have no impact on filesystem
performance.  Yes, there are cases where there is some impact (as
Nick has pointed out) but these are situations where minor
optimisations can make a big difference, such as Nick's suggestion
to zero before write rather than do two partial calcs. This is
not a severe issue - it's just the normal process of iterative
improvement.

Cheers,

Dave.
L A Walsh Nov. 3, 2016, 4:04 p.m. UTC | #3
Dave Chinner wrote:
> 
> As most users never have things go wrong, all they think is "CRCs
> are unnecessary overhead". It's just like backups - how many people
> don't make backups because they cost money right now and there's no
> tangible benefit until something goes wrong which almost never
> happens?
----
	But it's not like backups.  You can't run a util
program upon discovering bad CRC's that will fix the file system
because the file system is no longer usable.  That means you
have to restore from backup.  Thus, for those keeping
backups, there is no benefit, as they'll have to restore
from backups in either case.

> Exactly my point. Humans are terrible at risk assessment and
> mitigation because most people are unaware of the unconcious
> cognitive biases that affect this sort of decision making.
---
	My risk is near 0 since my file systems are monitored
by a raid controller with read patrols made over the data on
a period basis.  I'll assert that the chance of data randomly
going corrupt is much higher because there is ALOT more data than
metadata.  On top of that, because I keep backups, my risk, is
at worst, the same without crc's as with them.

>  It actually slows down
> allocation on an empty filesystem and trades off that increase in
> "empty filesystem" overhead for significantly better aged filesystem
> inode allocation performance.
----
	Ok, let's see, ages of my file systems:
4 from 2009 (7 years)
1 from 2013 (3 years)
9 from 2014 (2 years)
---
	I don't think I have any empty or new filesystems
(FWIW, I store the creation time in the UUID).

 i.e. the finobt provides more
> deterministic inode allocation overhead, not "faster" allocation.
> 
> Let me demonstrate with some numbers on empty filesystem create
> rate:
> 
> 			create rate	sys CPU time	write rate
> 			(files/s)	(seconds)	  (MB/s)
> crc = 0, finobt = 0:	238943		  2629		 ~200
> crc = 1, finobt = 0:	231582		  2711		  ~40
> crc = 1, finobt = 1:	232563		  2766		  ~40
> *hacked* crc disable:   231435	  2789		  ~40


> We can see that the system CPU time increased by 3.1% with the
> "addition of CRCs".  The CPU usage increases by a further 2% with
> the addition of the free inode btree,
---
	On an empty file system or older ones that are >50%
used?  It's *nice* to be able to benchmarks, but not allowing
crc to be disabled, disables that possibility -- and that's
sorta the point.  In order to prove you point, you created a
benchmark with crc's disabled. But the thing about benchmarks
is making so others can reproduce your results.  That's
the problem.  If I could do the same benchmarks, and get 
similar results, I'd give up as finobt not being worth it.

	But I'm not able to run such tests on my workload
and/or filesystems.  The common advice about performance numbers
and how they are affected by options is to do benchmarks
on your own systems with your own workload and see if the option
helps.  That's what I want to do.  Why deny that?



 which should give you an idea
> of how much CPU time even a small btree consumes.
---
	In a non-real-world case on empty file systems. How
does it work in the real world on file systems like mine?
I know the MB/s isn't close, w/my max sustained I/O rates
being about 1GB/s (all using direct i/o -- rate drops
significantly if I use kernel buffering).  Even not
pre-allocating and defragmenting the test file will noticeable
affect I/O rates.  

	Showing the result on an empty file
system is when finobt would have the *least* affect, since
it is when the kernel has to search for space that things
slow down, but if the free space is pre-allocated in a 
dedicated b-tree, then the kernel doesn't have to search --
which would be a much bigger difference than on an empty
file system.


 The allocated
> inode btree is huge in comparison to the finobt in this workload,
> which is why even a small change in header size (when CRCs are
> enabled) makes a large difference in CPU usage.
> 
> To verify that CRC has no significant impact on inode allocation,
> let's look at the actual CPU being used by the CRC calculations in
> this workload are:
> 
>   0.28%  [kernel]  [k] crc32c_pcl_intel_update
---
	And how much is spent searching for free space?
On multi-gig files it can reduces I/O rates by 30% or more.

> 
> Only a small proportion of the entire increase in CPU consumption
> that comes from "turning on CRCS". Indeed, the "*hacked* CRC
> disable" results are from skipping CRC calculations in the code
> altogether and returning "verify ok" without calculating them. The
> create rate is identical to the crc=1,finobt=1 numbers and the CPU
> usage is /slightly higher/ than when CRCs are enabled.
> 
> IOWs, for most workloads CRCs have no impact on filesystem
> performance.  
---
	Too bad no one can test such the effect on their
own workloads, though if not doing crc's takes more CPU, then
it sounds like an algorithm problem: crc calculations don't
take "negative time", and a benchmark showing they do indicates
something else is causing the slowdown.

> Cheers,
> Dave.
----
Sigh... and Cheers to you too! ;-)
Linda

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 3, 2016, 6:15 p.m. UTC | #4
On 11/3/16 11:04 AM, L.A. Walsh wrote:
> 
> 
> Dave Chinner wrote:
>>
>> As most users never have things go wrong, all they think is "CRCs
>> are unnecessary overhead". It's just like backups - how many people
>> don't make backups because they cost money right now and there's no
>> tangible benefit until something goes wrong which almost never
>> happens?
> ----
>     But it's not like backups.  You can't run a util
> program upon discovering bad CRC's that will fix the file system
> because the file system is no longer usable.

Sure you can - xfs_repair knows what to do in such a case.

It's not guaranteed that you get /all/ your data back, in every
corrupted fs situation, but that's not because of CRCs - a bad CRC
is just the messenger about filesystem corruption.

> That means you
> have to restore from backup.  Thus, for those keeping
> backups, there is no benefit, as they'll have to restore
> from backups in either case. 

Again, as Dave said, the format changes implemented for CRCS help
us support XFS.  It's not something /directly/ useful to the user,
other than possibly stopping a corruption before it gets worse, by
early detection.

If you have a corruption which is so bad that you have to go to
backups, that would be true with or without CRCs.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 3, 2016, 11 p.m. UTC | #5
On Thu, Nov 03, 2016 at 09:04:42AM -0700, L.A. Walsh wrote:
> 
> 
> Dave Chinner wrote:
> >
> >As most users never have things go wrong, all they think is "CRCs
> >are unnecessary overhead". It's just like backups - how many people
> >don't make backups because they cost money right now and there's no
> >tangible benefit until something goes wrong which almost never
> >happens?
> ----
> 	But it's not like backups.  You can't run a util
> program upon discovering bad CRC's that will fix the file system
> because the file system is no longer usable.

xfs_repair will fix it, just like it will fix the same corruption
on non-CRC filesystems.

> >Exactly my point. Humans are terrible at risk assessment and
> >mitigation because most people are unaware of the unconcious
> >cognitive biases that affect this sort of decision making.
> ---
> 	My risk is near 0 since my file systems are monitored
> by a raid controller with read patrols made over the data on
> a period basis.

If I had a dollar for every time someone said "hardware raid
protects me" I'd have retired years ago.

Media scrubbing does not protect against misdirected writes,
corruptions to/from the storage, memory errors, software bugs, bad
compilers (yes, we've already had XFS CRCs find a compiler bug),
etc.

> I'll assert that the chance of data randomly
> going corrupt is much higher because there is ALOT more data than
> metadata.  On top of that, because I keep backups, my risk, is
> at worst, the same without crc's as with them.

The /scale of disaster/ for metadata corruption is far higher than
for file data - a single bit error can trash the entire filesystem.

You may not care about this, but plenty of other XFS users do.

> i.e. the finobt provides more
> >deterministic inode allocation overhead, not "faster" allocation.
> >
> >Let me demonstrate with some numbers on empty filesystem create
> >rate:
> >
> >			create rate	sys CPU time	write rate
> >			(files/s)	(seconds)	  (MB/s)
> >crc = 0, finobt = 0:	238943		  2629		 ~200
> >crc = 1, finobt = 0:	231582		  2711		  ~40
> >crc = 1, finobt = 1:	232563		  2766		  ~40
> >*hacked* crc disable:   231435	  2789		  ~40
> 
> 
> >We can see that the system CPU time increased by 3.1% with the
> >"addition of CRCs".  The CPU usage increases by a further 2% with
> >the addition of the free inode btree,
> ---
> 	On an empty file system or older ones that are >50%
> used?
>
> It's *nice* to be able to benchmarks, but not allowing
> crc to be disabled, disables that possibility -- and that's
> sorta the point. 

If you want to reproduce the above numbers, the script is below.
You don't need the "CRC disable" hack to test whether CRCs have
overhead or not, CPU profiles are sufficient for that. But, really,
I don't care about whether you can reproduce these tests, because
microbenchmarks don't matter to production systems.

That is,, you haven't provided any numbers to back up your
assertions that CRCs have excessive overhead for /your workload/.
For me to care about what you are saying, you need to demonstrate a
performance degradation between v4 and v5 filesystem formats for
/your workloads/.

I can't do this for you. I don't know what your workload is, nor
what hardware you use.  *Give me numbers* that I can work with -
something we can measure and fix. You need to do the work to show
there's an issue - I can't do that for you, and no amount of
demanding that I do will change that.

> >IOWs, for most workloads CRCs have no impact on filesystem
> >performance.
> ---
> 	Too bad no one can test such the effect on their
> own workloads, though if not doing crc's takes more CPU, then
> it sounds like an algorithm problem: crc calculations don't
> take "negative time", and a benchmark showing they do indicates
> something else is causing the slowdown.

I'm guessing that you aren't aware of how memory access patterns
affect performance on modern CPUs. i.e. sequential memory access to
a structure is much faster than random meory access becase the
hardware prefetchers detect the sequential accesses and minimises
cache miss latency.

e.g. for a typical 4k btree block, doing a binary search on a cache
cold block requires 10-12 cache misses for complete search. However,
if we run a CRC check on it, we take a couple of cache misses before
the hardware prefetcher kicks in then it's just CPU time to run the
calc. Then the binary search doesn't have a cache miss at all. Hence
if the CRC calc is fast enough (and for h/w accel it is fast enough)
adding CRCs will make the code run faster....

This is actually a well known behaviour of modern CPUs - for
years we've been using memset() to initialise structures when it's
technically not necessary because it's the fastest way to prime the
CPU caches for upcoming accesses to that structure.

Cheers,

Dave.
Nicholas Piggin Nov. 4, 2016, 2:28 a.m. UTC | #6
On Fri, 4 Nov 2016 11:12:48 +1100
Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Oct 28, 2016 at 03:17:47AM +1100, Nicholas Piggin wrote:
> > Hi guys,
> > 
> > We're seeing crc32c_le show up in xfs log checksumming on a MySQL benchmark
> > on powerpc. I could reproduce similar overheads with dbench as well.
> > 
> > 1.11%  mysqld           [kernel.vmlinux]            [k] __crc32c_le
> >         |
> >         ---__crc32c_le
> >            |          
> >             --1.11%--chksum_update
> >                       |          
> >                        --1.11%--crypto_shash_update
> >                                  crc32c
> >                                  xlog_cksum
> >                                  xlog_sync
> >                                  _xfs_log_force_lsn
> >                                  xfs_file_fsync
> >                                  vfs_fsync_range
> >                                  do_fsync
> >                                  sys_fsync
> >                                  system_call
> >                                  0x17738
> >                                  0x17704
> >                                  os_file_flush_func
> >                                  fil_flush
> > 
> > As a rule, it helps the crc implementation if it can operate on as large a
> > chunk as possible (alignment, startup overhead, etc). So I did a quick hack
> > at getting XFS checksumming to feed crc32c() with larger chunks, by setting
> > the existing crc to 0 before running over the entire buffer. Together with
> > some small work on the powerpc crc implementation, crc drops below 0.1%.
> > 
> > I don't know if something like this would be acceptable? It's not pretty,
> > but I didn't see an easier way.  
> 
> Here's an alternative, slightly cleaner patch that optimises the CRC
> update side but leaves the verify side as it is. I've not yet
> decided exactly what is cleanest for the xlog_cksum() call in log
> recovery, but that won't change the performance of the code.  Can
> you give this a run through, Nick?

Hi Dave,

Yeah sorry for the slow response, I've been getting a more realistic
MySQL benchmark setup working (what I had reproduced what appeared to
be the same overhead, but I wanted to get something better to retest
with). So your patch comes at a good time (and thanks for working on
it). I'll see if I can get something running and have results for you
by next week.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
L A Walsh Nov. 4, 2016, 6:56 a.m. UTC | #7
Dave Chinner wrote:
> If I had a dollar for every time someone said "hardware raid
> protects me" I'd have retired years ago.
----
	It's not a sole line of protection, but it does a fair
job of *detection*.  

> Media scrubbing does not protect against misdirected writes,
> corruptions to/from the storage, memory errors, software bugs, bad
> compilers (yes, we've already had XFS CRCs find a compiler bug),
> etc.
---
	Crc's don't _protect_ against such things either -- they
can allow detection, but for protection ... I make due with
xfsdump.  As for crc's -- I already had the feature
prevent me from creating and using new partitions where it
didn't like me setting the disk UUID when creating a
new partition -- something that I heard
would be fixed -- but something that prevented me from testing
the new finobt feature I was interested in.

I'm sure I'm not alone in wanting to test in my environment, but 
_both_ with and without the crc option.  

	I seem to remember, recently, that it took other kernel
developers to disable xfs panicing the entire system on problem
detection, with the idea of only disabling what was broken.

	Along the same lines, if I am using crc's and badness is
detected, I'd want to be able to keep the disk online, though
in "read-only" mode to allow me to explore what was wrong and find
the best solution. 
 
> You may not care about this, but plenty of other XFS users do.
---
	That's a crock -- I never said I didn't care about having
it -- just that I wanted to be able to run finobt with crc's being
disabled.  I noted after my last response, that you said your crc
testing only jumped around the tests to compare crc's -- which
sounded like they were still being generated but not checked -- no 
wonder the no-crc option was the slowest of tested options.
 
> I can't do this for you. I don't know what your workload is, nor
> what hardware you use.  *Give me numbers* that I can work with -
> something we can measure and fix. You need to do the work to show
> there's an issue - I can't do that for you, and no amount of
> demanding that I do will change that.

I don't want to bother others with my testing, I just want the
platform to be open enough for me to quietly do my own testing
and go forward from there.  I may be ignorant, but I'm also
picky about things I care about -- thus I prefer to do _some_
things myself, thank-you.

I realize that closed-platforms where the undesirable 
is bundled with the desirable and where end-users have no choice
is the wave of the future.  I realize that in many cases, large
corporations are pushing these changes.

Microsoft's last "open-update" that allowed selecting what 
updates you wanted is history and now you are forced to take
everything or nothing.  Such offerings are not made for the benefit
of the users -- it's not something they want.  But it doesn't matter,
it's where large companies are pushing things so consumers will be
easier to control.  I wouldn't be surprised if this feature bundling
was being pushed by project sponsors and that developers had little
choice and I may not either, as I don't have the smallest fraction
of the time I would need to add in options or changes to the
SW I use, just to hold the "status quo", let alone make improvements. 

All I can usually do is ask and later, "re-ask" -- since policies
in effect at one point in time may not be present later.

Cheers!
-l



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 4, 2016, 5:37 p.m. UTC | #8
On 11/4/16 1:56 AM, L.A. Walsh wrote:
> 
> 
> Dave Chinner wrote:
>> If I had a dollar for every time someone said "hardware raid
>> protects me" I'd have retired years ago.
> ----
>     It's not a sole line of protection, but it does a fair
> job of *detection*. 
>> Media scrubbing does not protect against misdirected writes,
>> corruptions to/from the storage, memory errors, software bugs, bad
>> compilers (yes, we've already had XFS CRCs find a compiler bug),
>> etc.
> ---
>     Crc's don't _protect_ against such things either -- they
> can allow detection, but for protection ... I make due with
> xfsdump.  As for crc's -- I already had the feature
> prevent me from creating and using new partitions where it
> didn't like me setting the disk UUID when creating a
> new partition -- something that I heard
> would be fixed -- but something that prevented me from testing
> the new finobt feature I was interested in.

That has been fixed for over a year, now, FWIW.

v4.3 kernel / v4.2.0 xfsprogs

> I'm sure I'm not alone in wanting to test in my environment, but _both_ with and without the crc option. 
>     I seem to remember, recently, that it took other kernel
> developers to disable xfs panicing the entire system on problem
> detection, with the idea of only disabling what was broken.

Sorry, what?

>     Along the same lines, if I am using crc's and badness is
> detected, I'd want to be able to keep the disk online, though
> in "read-only" mode to allow me to explore what was wrong and find
> the best solution.
>> You may not care about this, but plenty of other XFS users do.
> ---
>     That's a crock -- I never said I didn't care about having
> it -- just that I wanted to be able to run finobt with crc's being
> disabled. 

If you want an untestable a la carte mishmash every option, there
are other filesystems which will do that for you.  They have lots of
weird broken corner cases as a result.

The xfs development crew has limited resources, and part of the decision
to /not/ make infinite combinations of features available was so that
we can get reliable coverage in testing and produce a robust result.

But we already went through all this on a similar thread 1 year ago...

> I noted after my last response, that you said your crc
> testing only jumped around the tests to compare crc's -- which
> sounded like they were still being generated but not checked -- no wonder the no-crc option was the slowest of tested options.
> 
>> I can't do this for you. I don't know what your workload is, nor
>> what hardware you use.  *Give me numbers* that I can work with -
>> something we can measure and fix. You need to do the work to show
>> there's an issue - I can't do that for you, and no amount of
>> demanding that I do will change that.
> 
> I don't want to bother others with my testing, I just want the
> platform to be open enough for me to quietly do my own testing
> and go forward from there.  I may be ignorant, but I'm also
> picky about things I care about -- thus I prefer to do _some_
> things myself, thank-you.
> 
> I realize that closed-platforms where the undesirable is bundled with the desirable and where end-users have no choice
> is the wave of the future.  I realize that in many cases, large
> corporations are pushing these changes.

Oh come on, now.

Open source doesn't mean everyone gets a pet pony in their favorite
color.  Failing to offer said ponies to each user doesn't make 
this a "closed platform" and I resent the assertion on behalf of
Dave and other xfs developers.  We're performing a herculean task
with a skeleton crew.  Dave, in particular, has performed tirelessly
for /years/ helping to create what is now arguably the highest-performing,
most robust, most scalable filesystem available on Linux.

A large part of XFS's success has been due to wise design and development
decisions, /including/ the decision to keep the test matrix under control.

And I'm not sure what to make sure of your mention of "large corporations"
but I can assure you that decisions like this are made within the xfs
developer community, without outside influence.

-Eric

> -l
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_cksum.h b/fs/xfs/libxfs/xfs_cksum.h
index fad1676..88f4431 100644
--- a/fs/xfs/libxfs/xfs_cksum.h
+++ b/fs/xfs/libxfs/xfs_cksum.h
@@ -9,20 +9,9 @@ 
  * cksum_offset parameter.
  */
 static inline __uint32_t
-xfs_start_cksum(char *buffer, size_t length, unsigned long cksum_offset)
+xfs_start_cksum(void *buffer, size_t length)
 {
-	__uint32_t zero = 0;
-	__uint32_t crc;
-
-	/* Calculate CRC up to the checksum. */
-	crc = crc32c(XFS_CRC_SEED, buffer, cksum_offset);
-
-	/* Skip checksum field */
-	crc = crc32c(crc, &zero, sizeof(__u32));
-
-	/* Calculate the rest of the CRC. */
-	return crc32c(crc, &buffer[cksum_offset + sizeof(__be32)],
-		      length - (cksum_offset + sizeof(__be32)));
+	return crc32c(XFS_CRC_SEED, buffer, length);
 }
 
 /*
@@ -42,22 +31,29 @@  xfs_end_cksum(__uint32_t crc)
  * Helper to generate the checksum for a buffer.
  */
 static inline void
-xfs_update_cksum(char *buffer, size_t length, unsigned long cksum_offset)
+xfs_update_cksum(void *buffer, size_t length, unsigned long cksum_offset)
 {
-	__uint32_t crc = xfs_start_cksum(buffer, length, cksum_offset);
+	__le32 *crcp = buffer + cksum_offset;
 
-	*(__le32 *)(buffer + cksum_offset) = xfs_end_cksum(crc);
+	*crcp = 0; /* set to 0 before calculating checksum */
+	*crcp = xfs_end_cksum(xfs_start_cksum(buffer, length));
 }
 
 /*
  * Helper to verify the checksum for a buffer.
  */
 static inline int
-xfs_verify_cksum(char *buffer, size_t length, unsigned long cksum_offset)
+xfs_verify_cksum(void *buffer, size_t length, unsigned long cksum_offset)
 {
-	__uint32_t crc = xfs_start_cksum(buffer, length, cksum_offset);
+	__le32 *crcp = buffer + cksum_offset;
+	__le32 crc = *crcp;
+	__le32 new_crc;
+
+	*crcp = 0;
+	new_crc = xfs_end_cksum(xfs_start_cksum(buffer, length));
+	*crcp = crc; /* restore original CRC in place */
 
-	return *(__le32 *)(buffer + cksum_offset) == xfs_end_cksum(crc);
+	return new_crc == crc;
 }
 
 #endif /* _XFS_CKSUM_H */
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 8de9a3a..efd8daa 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -419,15 +419,11 @@  xfs_dinode_calc_crc(
 	struct xfs_mount	*mp,
 	struct xfs_dinode	*dip)
 {
-	__uint32_t		crc;
-
 	if (dip->di_version < 3)
 		return;
 
 	ASSERT(xfs_sb_version_hascrc(&mp->m_sb));
-	crc = xfs_start_cksum((char *)dip, mp->m_sb.sb_inodesize,
-			      XFS_DINODE_CRC_OFF);
-	dip->di_crc = xfs_end_cksum(crc);
+	xfs_update_cksum(dip, mp->m_sb.sb_inodesize, XFS_DINODE_CRC_OFF);
 }
 
 /*
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 3b74fa0..4e242f0 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1658,7 +1658,7 @@  xlog_pack_data(
  * This is a little more complicated than it should be because the various
  * headers and the actual data are non-contiguous.
  */
-__le32
+void
 xlog_cksum(
 	struct xlog		*log,
 	struct xlog_rec_header	*rhead,
@@ -1667,10 +1667,9 @@  xlog_cksum(
 {
 	__uint32_t		crc;
 
-	/* first generate the crc for the record header ... */
-	crc = xfs_start_cksum((char *)rhead,
-			      sizeof(struct xlog_rec_header),
-			      offsetof(struct xlog_rec_header, h_crc));
+	/* first generate the crc for the record header with 0 crc... */
+	rhead->h_crc = 0;
+	crc = xfs_start_cksum(rhead, sizeof(struct xlog_rec_header));
 
 	/* ... then for additional cycle data for v2 logs ... */
 	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb)) {
@@ -1691,7 +1690,7 @@  xlog_cksum(
 	/* ... and finally for the payload */
 	crc = crc32c(crc, dp, size);
 
-	return xfs_end_cksum(crc);
+	rhead->h_crc = xfs_end_cksum(crc);
 }
 
 /*
@@ -1840,8 +1839,7 @@  xlog_sync(
 	}
 
 	/* calculcate the checksum */
-	iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header,
-					    iclog->ic_datap, size);
+	xlog_cksum(log, &iclog->ic_header, iclog->ic_datap, size);
 #ifdef DEBUG
 	/*
 	 * Intentionally corrupt the log record CRC based on the error injection
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2b6eec5..18ba385 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -432,7 +432,7 @@  xlog_recover_finish(
 extern int
 xlog_recover_cancel(struct xlog *);
 
-extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
+extern void	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
 			    char *dp, int size);
 
 extern kmem_zone_t *xfs_log_ticket_zone;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 9b3d7c7..3f62d9a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5114,8 +5114,13 @@  xlog_recover_process(
 {
 	int			error;
 	__le32			crc;
+	__le32			old_crc;
 
-	crc = xlog_cksum(log, rhead, dp, be32_to_cpu(rhead->h_len));
+	old_crc = rhead->h_crc;
+	rhead->h_crc = 0;
+	xlog_cksum(log, rhead, dp, be32_to_cpu(rhead->h_len));
+	crc = rhead->h_crc;
+	rhead->h_crc = old_crc;
 
 	/*
 	 * Nothing else to do if this is a CRC verification pass. Just return