mbox series

[00/53] Get rid of UTF-8 chars that can be mapped as ASCII

Message ID cover.1620641727.git.mchehab+huawei@kernel.org (mailing list archive)
Headers show
Series Get rid of UTF-8 chars that can be mapped as ASCII | expand

Message

Mauro Carvalho Chehab May 10, 2021, 10:26 a.m. UTC
There are several UTF-8 characters at the Kernel's documentation.

Several of them were due to the process of converting files from
DocBook, LaTeX, HTML and Markdown. They were probably introduced
by the conversion tools used on that time.

Other UTF-8 characters were added along the time, but they're easily
replaceable by ASCII chars.

As Linux developers are all around the globe, and not everybody has UTF-8
as their default charset, better to use UTF-8 only on cases where it is really
needed.

The first 3 patches on this series were manually written, in order to solve
a few special cases.

The remaining patches on series address such cases on *.rst files and 
inside the Documentation/ABI, using this perl map table in order to do the
charset conversion:

my %char_map = (
	0x2010 => '-',		# HYPHEN
	0xad   => '-',		# SOFT HYPHEN
	0x2013 => '-',		# EN DASH
	0x2014 => '-',		# EM DASH

	0x2018 => "'",		# LEFT SINGLE QUOTATION MARK
	0x2019 => "'",		# RIGHT SINGLE QUOTATION MARK
	0xb4   => "'",		# ACUTE ACCENT

	0x201c => '"',		# LEFT DOUBLE QUOTATION MARK
	0x201d => '"',		# RIGHT DOUBLE QUOTATION MARK

	0x2212 => '-',		# MINUS SIGN
	0x2217 => '*',		# ASTERISK OPERATOR
	0xd7   => 'x',		# MULTIPLICATION SIGN

	0xbb   => '>',		# RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

	0xa0   => ' ',		# NO-BREAK SPACE
	0xfeff => '',		# ZERO WIDTH NO-BREAK SPACE
);

After the conversion, those UTF-8 chars will be kept:

	- U+00a9 ('©'): COPYRIGHT SIGN
	- U+00ac ('¬'): NOT SIGN		# only at Documentation/powerpc/transactional_memory.rst
	- U+00ae ('®'): REGISTERED SIGN
	- U+00b0 ('°'): DEGREE SIGN
	- U+00b1 ('±'): PLUS-MINUS SIGN
	- U+00b2 ('²'): SUPERSCRIPT TWO
	- U+00b5 ('µ'): MICRO SIGN
	- U+00b7 ('·'): MIDDLE DOT		# See below
	- U+00bd ('½'): VULGAR FRACTION ONE HALF
	- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
	- U+00df ('ß'): LATIN SMALL LETTER SHARP S
	- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
	- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
	- U+00e6 ('æ'): LATIN SMALL LETTER AE
	- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
	- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
	- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
	- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
	- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
	- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
	- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
	- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
	- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
	- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
	- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
	- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
	- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
	- U+03bc ('μ'): GREEK SMALL LETTER MU
	- U+2026 ('…'): HORIZONTAL ELLIPSIS
	- U+2122 ('™'): TRADE MARK SIGN
	- U+2191 ('↑'): UPWARDS ARROW
	- U+2192 ('→'): RIGHTWARDS ARROW
	- U+2193 ('↓'): DOWNWARDS ARROW
	- U+2264 ('≤'): LESS-THAN OR EQUAL TO
	- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
	- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
	- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
	- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
	- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

PS.: maintainers were bcc on patch 00/53, in order to reduce the
risk of patch 00 to be rejected by list servers.

-

For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:

- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt

  As this file will be some day converted to yaml, where the 
  MIDDLE DOT will be removed, I guess it is not worth touching it.

- Documentation/scheduler/sched-deadline.rst

  There, it is used on a math expressions. So, better to keep.

- Documentation/devicetree/bindings/media/video-interface-devices.yaml

  There, it part of an ASCII artwork.

- translations/zh_CN

  I prefer not touching it, as it might have some special meaning in Simplified Chinese.

Mauro Carvalho Chehab (53):
  docs: cdrom-standard.rst: get rid of uneeded UTF-8 chars
  docs: ABI: remove a meaningless UTF-8 character
  docs: ABI: remove some spurious characters
  docs: index.rst: avoid using UTF-8 chars
  docs: hwmon: avoid using UTF-8 chars
  docs: admin-guide: avoid using UTF-8 chars
  docs: admin-guide: media: ipu3.rst: avoid using UTF-8 chars
  docs: admin-guide: sysctl: kernel.rst: avoid using UTF-8 chars
  docs: admin-guide: perf: imx-ddr.rst: avoid using UTF-8 chars
  docs: admin-guide: pm: avoid using UTF-8 chars
  docs: trace: coresight: coresight-etm4x-reference.rst: avoid using
    UTF-8 chars
  docs: driver-api: avoid using UTF-8 chars
  docs: driver-api: fpga: avoid using UTF-8 chars
  docs: driver-api: iio: avoid using UTF-8 chars
  docs: driver-api: thermal: avoid using UTF-8 chars
  docs: driver-api: media: drivers: avoid using UTF-8 chars
  docs: driver-api: firmware: other_interfaces.rst: avoid using UTF-8
    chars
  docs: driver-api: nvdimm: btt.rst: avoid using UTF-8 chars
  docs: fault-injection: nvme-fault-injection.rst: avoid using UTF-8
    chars
  docs: usb: avoid using UTF-8 chars
  docs: process: avoid using UTF-8 chars
  docs: block: data-integrity.rst: avoid using UTF-8 chars
  docs: userspace-api: media: fdl-appendix.rst: avoid using UTF-8 chars
  docs: userspace-api: media: v4l: avoid using UTF-8 chars
  docs: userspace-api: media: dvb: avoid using UTF-8 chars
  docs: vm: zswap.rst: avoid using UTF-8 chars
  docs: filesystems: f2fs.rst: avoid using UTF-8 chars
  docs: filesystems: ext4: avoid using UTF-8 chars
  docs: kernel-hacking: avoid using UTF-8 chars
  docs: hid: avoid using UTF-8 chars
  docs: security: tpm: avoid using UTF-8 chars
  docs: security: keys: trusted-encrypted.rst: avoid using UTF-8 chars
  docs: riscv: vm-layout.rst: avoid using UTF-8 chars
  docs: networking: scaling.rst: avoid using UTF-8 chars
  docs: networking: devlink: devlink-dpipe.rst: avoid using UTF-8 chars
  docs: networking: device_drivers: avoid using UTF-8 chars
  docs: x86: avoid using UTF-8 chars
  docs: scheduler: sched-deadline.rst: avoid using UTF-8 chars
  docs: dev-tools: testing-overview.rst: avoid using UTF-8 chars
  docs: power: powercap: powercap.rst: avoid using UTF-8 chars
  docs: ABI: avoid using UTF-8 chars
  docs: doc-guide: contributing.rst: avoid using UTF-8 chars
  docs: PCI: acpi-info.rst: avoid using UTF-8 chars
  docs: gpu: avoid using UTF-8 chars
  docs: sound: kernel-api: writing-an-alsa-driver.rst: avoid using UTF-8
    chars
  docs: arm64: arm-acpi.rst: avoid using UTF-8 chars
  docs: infiniband: tag_matching.rst: avoid using UTF-8 chars
  docs: timers: no_hz.rst: avoid using UTF-8 chars
  docs: misc-devices: ibmvmc.rst: avoid using UTF-8 chars
  docs: firmware-guide: acpi: lpit.rst: avoid using UTF-8 chars
  docs: firmware-guide: acpi: dsd: graph.rst: avoid using UTF-8 chars
  docs: virt: kvm: avoid using UTF-8 chars
  docs: RCU: avoid using UTF-8 chars

 .../obsolete/sysfs-kernel-fadump_registered   |   2 +-
 .../obsolete/sysfs-kernel-fadump_release_mem  |   2 +-
 ...sfs-class-chromeos-driver-cros-ec-lightbar |   2 +-
 .../ABI/testing/sysfs-class-net-cdc_ncm       |   2 +-
 .../ABI/testing/sysfs-devices-platform-ipmi   |   2 +-
 .../testing/sysfs-devices-platform-trackpoint |   2 +-
 Documentation/ABI/testing/sysfs-devices-soc   |   4 +-
 Documentation/ABI/testing/sysfs-module        |   4 +-
 Documentation/PCI/acpi-info.rst               |  26 +-
 .../Data-Structures/Data-Structures.rst       |  52 ++--
 .../Expedited-Grace-Periods.rst               |  40 +--
 .../Tree-RCU-Memory-Ordering.rst              |  10 +-
 .../RCU/Design/Requirements/Requirements.rst  | 126 ++++-----
 Documentation/admin-guide/index.rst           |   2 +-
 Documentation/admin-guide/media/ipu3.rst      |   2 +-
 Documentation/admin-guide/module-signing.rst  |   4 +-
 Documentation/admin-guide/perf/imx-ddr.rst    |   2 +-
 Documentation/admin-guide/pm/intel_idle.rst   |   4 +-
 Documentation/admin-guide/pm/intel_pstate.rst |   4 +-
 Documentation/admin-guide/ras.rst             |  94 +++----
 .../admin-guide/reporting-issues.rst          |  12 +-
 Documentation/admin-guide/sysctl/kernel.rst   |   2 +-
 Documentation/arm64/arm-acpi.rst              |   8 +-
 Documentation/block/data-integrity.rst        |   2 +-
 Documentation/cdrom/cdrom-standard.rst        |  30 +--
 Documentation/dev-tools/testing-overview.rst  |   4 +-
 Documentation/doc-guide/contributing.rst      |   2 +-
 .../driver-api/firmware/other_interfaces.rst  |   2 +-
 Documentation/driver-api/fpga/fpga-bridge.rst |  10 +-
 Documentation/driver-api/fpga/fpga-mgr.rst    |  12 +-
 .../driver-api/fpga/fpga-programming.rst      |   8 +-
 Documentation/driver-api/fpga/fpga-region.rst |  20 +-
 Documentation/driver-api/iio/buffers.rst      |   8 +-
 Documentation/driver-api/iio/hw-consumer.rst  |  10 +-
 .../driver-api/iio/triggered-buffers.rst      |   6 +-
 Documentation/driver-api/iio/triggers.rst     |  10 +-
 Documentation/driver-api/index.rst            |   2 +-
 Documentation/driver-api/ioctl.rst            |   8 +-
 .../media/drivers/sh_mobile_ceu_camera.rst    |   8 +-
 .../driver-api/media/drivers/vidtv.rst        |   4 +-
 .../driver-api/media/drivers/zoran.rst        |   2 +-
 Documentation/driver-api/nvdimm/btt.rst       |   2 +-
 .../driver-api/thermal/cpu-idle-cooling.rst   |  14 +-
 .../driver-api/thermal/intel_powerclamp.rst   |   6 +-
 .../thermal/x86_pkg_temperature_thermal.rst   |   2 +-
 .../fault-injection/nvme-fault-injection.rst  |   2 +-
 Documentation/filesystems/ext4/attributes.rst |  20 +-
 Documentation/filesystems/ext4/bigalloc.rst   |   6 +-
 Documentation/filesystems/ext4/blockgroup.rst |   8 +-
 Documentation/filesystems/ext4/blocks.rst     |   2 +-
 Documentation/filesystems/ext4/directory.rst  |  16 +-
 Documentation/filesystems/ext4/eainode.rst    |   2 +-
 Documentation/filesystems/ext4/inlinedata.rst |   6 +-
 Documentation/filesystems/ext4/inodes.rst     |   6 +-
 Documentation/filesystems/ext4/journal.rst    |   8 +-
 Documentation/filesystems/ext4/mmp.rst        |   2 +-
 .../filesystems/ext4/special_inodes.rst       |   4 +-
 Documentation/filesystems/ext4/super.rst      |  10 +-
 Documentation/filesystems/f2fs.rst            |   6 +-
 .../firmware-guide/acpi/dsd/graph.rst         |   2 +-
 Documentation/firmware-guide/acpi/lpit.rst    |   2 +-
 Documentation/gpu/i915.rst                    |   2 +-
 Documentation/gpu/komeda-kms.rst              |   2 +-
 Documentation/hid/hid-sensor.rst              |  70 ++---
 Documentation/hid/intel-ish-hid.rst           | 246 +++++++++---------
 Documentation/hwmon/ir36021.rst               |   2 +-
 Documentation/hwmon/ltc2992.rst               |   2 +-
 Documentation/hwmon/pm6764tr.rst              |   2 +-
 Documentation/hwmon/tmp103.rst                |   4 +-
 Documentation/index.rst                       |   4 +-
 Documentation/infiniband/tag_matching.rst     |   8 +-
 Documentation/kernel-hacking/hacking.rst      |   2 +-
 Documentation/kernel-hacking/locking.rst      |   2 +-
 Documentation/misc-devices/ibmvmc.rst         |   8 +-
 .../device_drivers/ethernet/intel/i40e.rst    |  12 +-
 .../device_drivers/ethernet/intel/iavf.rst    |   6 +-
 .../device_drivers/ethernet/netronome/nfp.rst |  12 +-
 .../networking/devlink/devlink-dpipe.rst      |   2 +-
 Documentation/networking/scaling.rst          |  18 +-
 Documentation/power/powercap/powercap.rst     | 210 +++++++--------
 Documentation/process/code-of-conduct.rst     |   2 +-
 .../process/kernel-enforcement-statement.rst  |   2 +-
 Documentation/riscv/vm-layout.rst             |   2 +-
 Documentation/scheduler/sched-deadline.rst    |   4 +-
 .../security/keys/trusted-encrypted.rst       |   4 +-
 Documentation/security/tpm/tpm_event_log.rst  |   2 +-
 Documentation/security/tpm/xen-tpmfront.rst   |   2 +-
 .../kernel-api/writing-an-alsa-driver.rst     |  68 ++---
 Documentation/timers/no_hz.rst                |   2 +-
 .../coresight/coresight-etm4x-reference.rst   |  16 +-
 Documentation/usb/ehci.rst                    |   2 +-
 Documentation/usb/gadget_printer.rst          |   2 +-
 Documentation/usb/mass-storage.rst            |  36 +--
 Documentation/usb/mtouchusb.rst               |   2 +-
 Documentation/usb/usb-serial.rst              |   2 +-
 .../media/dvb/audio-set-bypass-mode.rst       |   2 +-
 .../userspace-api/media/dvb/audio.rst         |   2 +-
 .../userspace-api/media/dvb/dmx-fopen.rst     |   2 +-
 .../userspace-api/media/dvb/dmx-fread.rst     |   2 +-
 .../media/dvb/dmx-set-filter.rst              |   2 +-
 .../userspace-api/media/dvb/intro.rst         |   6 +-
 .../userspace-api/media/dvb/video.rst         |   2 +-
 .../userspace-api/media/fdl-appendix.rst      |  64 ++---
 .../userspace-api/media/v4l/biblio.rst        |   8 +-
 .../userspace-api/media/v4l/crop.rst          |  16 +-
 .../userspace-api/media/v4l/dev-decoder.rst   |   6 +-
 .../userspace-api/media/v4l/diff-v4l.rst      |   2 +-
 .../userspace-api/media/v4l/open.rst          |   2 +-
 .../media/v4l/vidioc-cropcap.rst              |   4 +-
 Documentation/virt/kvm/api.rst                |  28 +-
 .../virt/kvm/running-nested-guests.rst        |  12 +-
 Documentation/vm/zswap.rst                    |   4 +-
 Documentation/x86/resctrl.rst                 |   2 +-
 Documentation/x86/sgx.rst                     |   4 +-
 114 files changed, 807 insertions(+), 807 deletions(-)

Comments

Thorsten Leemhuis May 10, 2021, 10:52 a.m. UTC | #1
On 10.05.21 12:26, Mauro Carvalho Chehab wrote:
>
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset, better to use UTF-8 only on cases where it is really
> needed.
> […]
> The remaining patches on series address such cases on *.rst files and 
> inside the Documentation/ABI, using this perl map table in order to do the
> charset conversion:
> 
> my %char_map = (
> […]
> 	0x2013 => '-',		# EN DASH
> 	0x2014 => '-',		# EM DASH

I might be performing bike shedding here, but wouldn't it be better to
replace those two with "--", as explained in
https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens

For EM DASH there seems to be even "---", but I'd say that is a bit too
much.

Or do you fear the extra work as some lines then might break the
80-character limit then?

Ciao, Thorsten
David Woodhouse May 10, 2021, 10:54 a.m. UTC | #2
On Mon, 2021-05-10 at 12:26 +0200, Mauro Carvalho Chehab wrote:
> There are several UTF-8 characters at the Kernel's documentation.
> 
> Several of them were due to the process of converting files from
> DocBook, LaTeX, HTML and Markdown. They were probably introduced
> by the conversion tools used on that time.
> 
> Other UTF-8 characters were added along the time, but they're easily
> replaceable by ASCII chars.
> 
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset, better to use UTF-8 only on cases where it is really
> needed.

No, that is absolutely the wrong approach.

If someone has a local setup which makes bogus assumptions about text
encodings, that is their own mistake.

We don't do them any favours by trying to *hide* it in the common case
so that they don't notice it for longer.

There really isn't much excuse for such brokenness, this far into the
21st century.

Even *before* UTF-8 came along in the final decade of the last
millennium, it was important to know which character set a given piece
of text was encoded in.

In fact it was even *more* important back then, we couldn't just assume
UTF-8 everywhere like we can in modern times.

Git can already do things like CRLF conversion on checking files out to
match local conventions; if you want to teach it to do character set
conversions too then I suppose that might be useful to a few developers
who've fallen through a time warp and still need it. But nobody's ever
bothered before because it just isn't necessary these days.

Please *don't* attempt to address this anachronistic and esoteric
"requirement" by dragging the kernel source back in time by three
decades.
Mauro Carvalho Chehab May 10, 2021, 11:19 a.m. UTC | #3
Em Mon, 10 May 2021 12:52:44 +0200
Thorsten Leemhuis <linux@leemhuis.info> escreveu:

> On 10.05.21 12:26, Mauro Carvalho Chehab wrote:
> >
> > As Linux developers are all around the globe, and not everybody has UTF-8
> > as their default charset, better to use UTF-8 only on cases where it is really
> > needed.
> > […]
> > The remaining patches on series address such cases on *.rst files and 
> > inside the Documentation/ABI, using this perl map table in order to do the
> > charset conversion:
> > 
> > my %char_map = (
> > […]
> > 	0x2013 => '-',		# EN DASH
> > 	0x2014 => '-',		# EM DASH  


> I might be performing bike shedding here, but wouldn't it be better to
> replace those two with "--", as explained in
> https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens
> 
> For EM DASH there seems to be even "---", but I'd say that is a bit too
> much.

Yeah, we can do, instead:

 	0x2013 => '--',		# EN DASH
 	0x2014 => '---',	# EM DASH  

I was actually in doubt about those ;-)

Btw, when producing HTML documentation,  Sphinx should convert:
	-- into EN DASH
and:
	--- into EM DASH

So, the resulting html will be identical.

> Or do you fear the extra work as some lines then might break the
> 80-character limit then?

No, I suspect that the line size won't be an issue. Some care should
taken when EN DASH and EM DASH are used inside tables.

Thanks,
Mauro
Mauro Carvalho Chehab May 10, 2021, 11:55 a.m. UTC | #4
Hi David,

Em Mon, 10 May 2021 11:54:02 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Mon, 2021-05-10 at 12:26 +0200, Mauro Carvalho Chehab wrote:
> > There are several UTF-8 characters at the Kernel's documentation.
> > 
> > Several of them were due to the process of converting files from
> > DocBook, LaTeX, HTML and Markdown. They were probably introduced
> > by the conversion tools used on that time.
> > 
> > Other UTF-8 characters were added along the time, but they're easily
> > replaceable by ASCII chars.
> > 
> > As Linux developers are all around the globe, and not everybody has UTF-8
> > as their default charset, better to use UTF-8 only on cases where it is really
> > needed.  
> 
> No, that is absolutely the wrong approach.
> 
> If someone has a local setup which makes bogus assumptions about text
> encodings, that is their own mistake.
> 
> We don't do them any favours by trying to *hide* it in the common case
> so that they don't notice it for longer.
> 
> There really isn't much excuse for such brokenness, this far into the
> 21st century.
> 
> Even *before* UTF-8 came along in the final decade of the last
> millennium, it was important to know which character set a given piece
> of text was encoded in.
> 
> In fact it was even *more* important back then, we couldn't just assume
> UTF-8 everywhere like we can in modern times.
> 
> Git can already do things like CRLF conversion on checking files out to
> match local conventions; if you want to teach it to do character set
> conversions too then I suppose that might be useful to a few developers
> who've fallen through a time warp and still need it. But nobody's ever
> bothered before because it just isn't necessary these days.
> 
> Please *don't* attempt to address this anachronistic and esoteric
> "requirement" by dragging the kernel source back in time by three
> decades.

No. The idea is not to go back three decades ago. 

The goal is just to avoid use UTF-8 where it is not needed. See, the vast
majority of UTF-8 chars are kept:

	- Non-ASCII Latin and Greek chars;
	- Box drawings;
	- arrows;
	- most symbols.

There, it makes perfect sense to keep using UTF-8.

We should keep using UTF-8 on Kernel. This is something that it shouldn't
be changed.

---

This patch series is doing conversion only when using ASCII makes
more sense than using UTF-8. 

See, a number of converted documents ended with weird characters
like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
character doesn't do any good.

Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
someone tries to use grep[1].

[1] try to run:

    $ git grep "CPU 0 has been" Documentation/RCU/

    it will return nothing with current upstream.

    But it will work fine after the series is applied:

    $ git grep "CPU 0 has been" Documentation/RCU/
      Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
      Documentation/RCU/Design/Data-Structures/Data-Structures.rst:|    notices that CPU 0 has been in dyntick idle mode, which qualifies  |

The main point on this series is to replace just the occurrences
where ASCII represents the symbol equally well, e. g. it is limited
for those chars:

	- U+2010 ('‐'): HYPHEN
	- U+00ad ('­'): SOFT HYPHEN
	- U+2013 ('–'): EN DASH
	- U+2014 ('—'): EM DASH

	- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
	- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
	- U+00b4 ('´'): ACUTE ACCENT

	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK

	- U+00d7 ('×'): MULTIPLICATION SIGN
	- U+2212 ('−'): MINUS SIGN

	- U+2217 ('∗'): ASTERISK OPERATOR
	  (this one used as a pointer reference like "*foo" on C code
	   example inside a document converted from LaTeX)

	- U+00bb ('»'): RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
	  (this one also used wrongly on an ABI file, meaning '>')

	- U+00a0 (' '): NO-BREAK SPACE
	- U+feff (''): ZERO WIDTH NO-BREAK SPACE

Using the above symbols will just trick tools like grep for no good
reason.

Thanks,
Mauro
Mauro Carvalho Chehab May 10, 2021, 12:27 p.m. UTC | #5
Em Mon, 10 May 2021 13:19:50 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Mon, 10 May 2021 12:52:44 +0200
> Thorsten Leemhuis <linux@leemhuis.info> escreveu:
> 
> > On 10.05.21 12:26, Mauro Carvalho Chehab wrote:  
> > >
> > > As Linux developers are all around the globe, and not everybody has UTF-8
> > > as their default charset, better to use UTF-8 only on cases where it is really
> > > needed.
> > > […]
> > > The remaining patches on series address such cases on *.rst files and 
> > > inside the Documentation/ABI, using this perl map table in order to do the
> > > charset conversion:
> > > 
> > > my %char_map = (
> > > […]
> > > 	0x2013 => '-',		# EN DASH
> > > 	0x2014 => '-',		# EM DASH    
> 
> 
> > I might be performing bike shedding here, but wouldn't it be better to
> > replace those two with "--", as explained in
> > https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens
> > 
> > For EM DASH there seems to be even "---", but I'd say that is a bit too
> > much.  
> 
> Yeah, we can do, instead:
> 
>  	0x2013 => '--',		# EN DASH
>  	0x2014 => '---',	# EM DASH  
> 
> I was actually in doubt about those ;-)

On a quick test, I changed my script to use "--" and "---" for
EN/EM DASH chars.

The diff below is against both versions.

There are a couple of places where it got mathematically wrong, 
like this one:

	-operation over a temperature range of -40°C to +125°C.
	+operation over a temperature range of --40°C to +125°C.

On others, it is just a matter of personal taste. My personal opinion
is that, on most cases, a single "-" would be better.

Thanks,
Mauro

diff --git a/Documentation/ABI/testing/sysfs-class-net-cdc_ncm b/Documentation/ABI/testing/sysfs-class-net-cdc_ncm
index 41a1eef0d0e7..469325255887 100644
--- a/Documentation/ABI/testing/sysfs-class-net-cdc_ncm
+++ b/Documentation/ABI/testing/sysfs-class-net-cdc_ncm
@@ -93,7 +93,7 @@ Contact:	Bjørn Mork <bjorn@mork.no>
 Description:
 		- Bit 0: 16-bit NTB supported (set to 1)
 		- Bit 1: 32-bit NTB supported
-		- Bits 2 - 15: reserved (reset to zero; must be ignored by host)
+		- Bits 2 -- 15: reserved (reset to zero; must be ignored by host)
 
 What:		/sys/class/net/<iface>/cdc_ncm/dwNtbInMaxSize
 Date:		May 2014
diff --git a/Documentation/PCI/acpi-info.rst b/Documentation/PCI/acpi-info.rst
index 9b4b04039982..7a75f1f6e73c 100644
--- a/Documentation/PCI/acpi-info.rst
+++ b/Documentation/PCI/acpi-info.rst
@@ -140,8 +140,8 @@ address always corresponds to bus 0, even if the bus range below the bridge
     Extended Address Space Descriptor (.4)
       General Flags: Bit [0] Consumer/Producer:
 
-        * 1 - This device consumes this resource
-        * 0 - This device produces and consumes this resource
+        * 1 -- This device consumes this resource
+        * 0 -- This device produces and consumes this resource
 
 [5] ACPI 6.2, sec 19.6.43:
     ResourceUsage specifies whether the Memory range is consumed by
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
index d76c6bfdc659..34a12b12df51 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
@@ -215,7 +215,7 @@ newly arrived RCU callbacks against future grace periods:
    43 }
 
 But the only part of ``rcu_prepare_for_idle()`` that really matters for
-this discussion are lines 37-39. We will therefore abbreviate this
+this discussion are lines 37--39. We will therefore abbreviate this
 function as follows:
 
 .. kernel-figure:: rcu_node-lock.svg
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index a3493b34f3dd..a42dc3cf26bd 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2354,8 +2354,8 @@ which in practice also means that RCU must have an aggressive
 stress-test suite. This stress-test suite is called ``rcutorture``.
 
 Although the need for ``rcutorture`` was no surprise, the current
-immense popularity of the Linux kernel is posing interesting-and perhaps
-unprecedented-validation challenges. To see this, keep in mind that
+immense popularity of the Linux kernel is posing interesting---and perhaps
+unprecedented---validation challenges. To see this, keep in mind that
 there are well over one billion instances of the Linux kernel running
 today, given Android smartphones, Linux-powered televisions, and
 servers. This number can be expected to increase sharply with the advent
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index b1692643718d..1a6dbda71ad6 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -3,7 +3,7 @@ The Linux kernel user's and administrator's guide
 
 The following is a collection of user-oriented documents that have been
 added to the kernel over time.  There is, as yet, little overall order or
-organization here - this material was not written to be a single, coherent
+organization here --- this material was not written to be a single, coherent
 document!  With luck things will improve quickly over time.
 
 This initial section contains overall information, including the README
diff --git a/Documentation/admin-guide/module-signing.rst b/Documentation/admin-guide/module-signing.rst
index bd1d2fef78e8..0d185ba8b8b5 100644
--- a/Documentation/admin-guide/module-signing.rst
+++ b/Documentation/admin-guide/module-signing.rst
@@ -100,8 +100,8 @@ This has a number of options available:
      ``certs/signing_key.pem`` will disable the autogeneration of signing keys
      and allow the kernel modules to be signed with a key of your choosing.
      The string provided should identify a file containing both a private key
-     and its corresponding X.509 certificate in PEM form, or - on systems where
-     the OpenSSL ENGINE_pkcs11 is functional - a PKCS#11 URI as defined by
+     and its corresponding X.509 certificate in PEM form, or --- on systems where
+     the OpenSSL ENGINE_pkcs11 is functional --- a PKCS#11 URI as defined by
      RFC7512. In the latter case, the PKCS#11 URI should reference both a
      certificate and a private key.
 
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst
index 00445adf8708..66c2c62c1cd4 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/ras.rst
@@ -40,10 +40,10 @@ it causes data loss or system downtime.
 
 Among the monitoring measures, the most usual ones include:
 
-* CPU - detect errors at instruction execution and at L1/L2/L3 caches;
-* Memory - add error correction logic (ECC) to detect and correct errors;
-* I/O - add CRC checksums for transferred data;
-* Storage - RAID, journal file systems, checksums,
+* CPU -- detect errors at instruction execution and at L1/L2/L3 caches;
+* Memory -- add error correction logic (ECC) to detect and correct errors;
+* I/O -- add CRC checksums for transferred data;
+* Storage -- RAID, journal file systems, checksums,
   Self-Monitoring, Analysis and Reporting Technology (SMART).
 
 By monitoring the number of occurrences of error detections, it is possible
diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst
index f691930e13c0..af699015d266 100644
--- a/Documentation/admin-guide/reporting-issues.rst
+++ b/Documentation/admin-guide/reporting-issues.rst
@@ -824,7 +824,7 @@ and look a little lower at the table. At its top you'll see a line starting with
 mainline, which most of the time will point to a pre-release with a version
 number like '5.8-rc2'. If that's the case, you'll want to use this mainline
 kernel for testing, as that where all fixes have to be applied first. Do not let
-that 'rc' scare you, these 'development kernels' are pretty reliable - and you
+that 'rc' scare you, these 'development kernels' are pretty reliable --- and you
 made a backup, as you were instructed above, didn't you?
 
 In about two out of every nine to ten weeks, mainline might point you to a
@@ -866,7 +866,7 @@ How to obtain a fresh Linux kernel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 **Using a pre-compiled kernel**: This is often the quickest, easiest, and safest
-way for testing - especially is you are unfamiliar with the Linux kernel. The
+way for testing --- especially is you are unfamiliar with the Linux kernel. The
 problem: most of those shipped by distributors or add-on repositories are build
 from modified Linux sources. They are thus not vanilla and therefore often
 unsuitable for testing and issue reporting: the changes might cause the issue
@@ -1345,7 +1345,7 @@ about it to a chatroom or forum you normally hang out.
 
 **Be patient**: If you are really lucky you might get a reply to your report
 within a few hours. But most of the time it will take longer, as maintainers
-are scattered around the globe and thus might be in a different time zone - one
+are scattered around the globe and thus might be in a different time zone -- one
 where they already enjoy their night away from keyboard.
 
 In general, kernel developers will take one to five business days to respond to
@@ -1388,7 +1388,7 @@ Here are your duties in case you got replies to your report:
 
 **Check who you deal with**: Most of the time it will be the maintainer or a
 developer of the particular code area that will respond to your report. But as
-issues are normally reported in public it could be anyone that's replying -
+issues are normally reported in public it could be anyone that's replying ---
 including people that want to help, but in the end might guide you totally off
 track with their questions or requests. That rarely happens, but it's one of
 many reasons why it's wise to quickly run an internet search to see who you're
@@ -1716,7 +1716,7 @@ Maybe their test hardware broke, got replaced by something more fancy, or is so
 old that it's something you don't find much outside of computer museums
 anymore. Sometimes developer stops caring for their code and Linux at all, as
 something different in their life became way more important. In some cases
-nobody is willing to take over the job as maintainer - and nobody can be forced
+nobody is willing to take over the job as maintainer -- and nobody can be forced
 to, as contributing to the Linux kernel is done on a voluntary basis. Abandoned
 drivers nevertheless remain in the kernel: they are still useful for people and
 removing would be a regression.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 743a7c70fd83..639dd58518ca 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1285,7 +1285,7 @@ The soft lockup detector monitors CPUs for threads that are hogging the CPUs
 without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads
 from running. The mechanism depends on the CPUs ability to respond to timer
 interrupts which are needed for the 'watchdog/N' threads to be woken up by
-the watchdog timer function, otherwise the NMI watchdog - if enabled - can
+the watchdog timer function, otherwise the NMI watchdog --- if enabled --- can
 detect a hard lockup condition.
 
 
diff --git a/Documentation/dev-tools/testing-overview.rst b/Documentation/dev-tools/testing-overview.rst
index 8adffc26a2ec..381c571eb52c 100644
--- a/Documentation/dev-tools/testing-overview.rst
+++ b/Documentation/dev-tools/testing-overview.rst
@@ -18,8 +18,8 @@ frameworks. These both provide infrastructure to help make running tests and
 groups of tests easier, as well as providing helpers to aid in writing new
 tests.
 
-If you're looking to verify the behaviour of the Kernel - particularly specific
-parts of the kernel - then you'll want to use KUnit or kselftest.
+If you're looking to verify the behaviour of the Kernel --- particularly specific
+parts of the kernel --- then you'll want to use KUnit or kselftest.
 
 
 The Difference Between KUnit and kselftest
diff --git a/Documentation/doc-guide/contributing.rst b/Documentation/doc-guide/contributing.rst
index c2d709467c68..ac5c9f1d2311 100644
--- a/Documentation/doc-guide/contributing.rst
+++ b/Documentation/doc-guide/contributing.rst
@@ -76,7 +76,7 @@ comments that look like this::
 
 The problem is the missing "*", which confuses the build system's
 simplistic idea of what C comment blocks look like.  This problem had been
-present since that comment was added in 2016 - a full four years.  Fixing
+present since that comment was added in 2016 --- a full four years.  Fixing
 it was a matter of adding the missing asterisks.  A quick look at the
 history for that file showed what the normal format for subject lines is,
 and ``scripts/get_maintainer.pl`` told me who should receive it.  The
diff --git a/Documentation/driver-api/fpga/fpga-bridge.rst b/Documentation/driver-api/fpga/fpga-bridge.rst
index 8d650b4e2ce6..1d6e910c27df 100644
--- a/Documentation/driver-api/fpga/fpga-bridge.rst
+++ b/Documentation/driver-api/fpga/fpga-bridge.rst
@@ -4,11 +4,11 @@ FPGA Bridge
 API to implement a new FPGA bridge
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-* struct fpga_bridge - The FPGA Bridge structure
-* struct fpga_bridge_ops - Low level Bridge driver ops
-* devm_fpga_bridge_create() - Allocate and init a bridge struct
-* fpga_bridge_register() - Register a bridge
-* fpga_bridge_unregister() - Unregister a bridge
+* struct fpga_bridge --- The FPGA Bridge structure
+* struct fpga_bridge_ops --- Low level Bridge driver ops
+* devm_fpga_bridge_create() --- Allocate and init a bridge struct
+* fpga_bridge_register() --- Register a bridge
+* fpga_bridge_unregister() --- Unregister a bridge
 
 .. kernel-doc:: include/linux/fpga/fpga-bridge.h
    :functions: fpga_bridge
diff --git a/Documentation/driver-api/fpga/fpga-mgr.rst b/Documentation/driver-api/fpga/fpga-mgr.rst
index 4d926b452cb3..272161361c6a 100644
--- a/Documentation/driver-api/fpga/fpga-mgr.rst
+++ b/Documentation/driver-api/fpga/fpga-mgr.rst
@@ -101,12 +101,12 @@ in state.
 API for implementing a new FPGA Manager driver
 ----------------------------------------------
 
-* ``fpga_mgr_states`` -  Values for :c:expr:`fpga_manager->state`.
-* struct fpga_manager -  the FPGA manager struct
-* struct fpga_manager_ops -  Low level FPGA manager driver ops
-* devm_fpga_mgr_create() -  Allocate and init a manager struct
-* fpga_mgr_register() -  Register an FPGA manager
-* fpga_mgr_unregister() -  Unregister an FPGA manager
+* ``fpga_mgr_states`` ---  Values for :c:expr:`fpga_manager->state`.
+* struct fpga_manager ---  the FPGA manager struct
+* struct fpga_manager_ops ---  Low level FPGA manager driver ops
+* devm_fpga_mgr_create() ---  Allocate and init a manager struct
+* fpga_mgr_register() ---  Register an FPGA manager
+* fpga_mgr_unregister() ---  Unregister an FPGA manager
 
 .. kernel-doc:: include/linux/fpga/fpga-mgr.h
    :functions: fpga_mgr_states
diff --git a/Documentation/driver-api/fpga/fpga-programming.rst b/Documentation/driver-api/fpga/fpga-programming.rst
index fb4da4240e96..adc725855bad 100644
--- a/Documentation/driver-api/fpga/fpga-programming.rst
+++ b/Documentation/driver-api/fpga/fpga-programming.rst
@@ -84,10 +84,10 @@ will generate that list.  Here's some sample code of what to do next::
 API for programming an FPGA
 ---------------------------
 
-* fpga_region_program_fpga() -  Program an FPGA
-* fpga_image_info() -  Specifies what FPGA image to program
-* fpga_image_info_alloc() -  Allocate an FPGA image info struct
-* fpga_image_info_free() -  Free an FPGA image info struct
+* fpga_region_program_fpga() ---  Program an FPGA
+* fpga_image_info() ---  Specifies what FPGA image to program
+* fpga_image_info_alloc() ---  Allocate an FPGA image info struct
+* fpga_image_info_free() ---  Free an FPGA image info struct
 
 .. kernel-doc:: drivers/fpga/fpga-region.c
    :functions: fpga_region_program_fpga
diff --git a/Documentation/driver-api/fpga/fpga-region.rst b/Documentation/driver-api/fpga/fpga-region.rst
index 2636a27c11b2..6c0c2541de04 100644
--- a/Documentation/driver-api/fpga/fpga-region.rst
+++ b/Documentation/driver-api/fpga/fpga-region.rst
@@ -45,19 +45,19 @@ An example of usage can be seen in the probe function of [#f2]_.
 API to add a new FPGA region
 ----------------------------
 
-* struct fpga_region - The FPGA region struct
-* devm_fpga_region_create() - Allocate and init a region struct
-* fpga_region_register() -  Register an FPGA region
-* fpga_region_unregister() -  Unregister an FPGA region
+* struct fpga_region --- The FPGA region struct
+* devm_fpga_region_create() --- Allocate and init a region struct
+* fpga_region_register() ---  Register an FPGA region
+* fpga_region_unregister() ---  Unregister an FPGA region
 
 The FPGA region's probe function will need to get a reference to the FPGA
 Manager it will be using to do the programming.  This usually would happen
 during the region's probe function.
 
-* fpga_mgr_get() - Get a reference to an FPGA manager, raise ref count
-* of_fpga_mgr_get() -  Get a reference to an FPGA manager, raise ref count,
+* fpga_mgr_get() --- Get a reference to an FPGA manager, raise ref count
+* of_fpga_mgr_get() ---  Get a reference to an FPGA manager, raise ref count,
   given a device node.
-* fpga_mgr_put() - Put an FPGA manager
+* fpga_mgr_put() --- Put an FPGA manager
 
 The FPGA region will need to specify which bridges to control while programming
 the FPGA.  The region driver can build a list of bridges during probe time
@@ -66,11 +66,11 @@ the list of bridges to program just before programming
 (:c:expr:`fpga_region->get_bridges`).  The FPGA bridge framework supplies the
 following APIs to handle building or tearing down that list.
 
-* fpga_bridge_get_to_list() - Get a ref of an FPGA bridge, add it to a
+* fpga_bridge_get_to_list() --- Get a ref of an FPGA bridge, add it to a
   list
-* of_fpga_bridge_get_to_list() - Get a ref of an FPGA bridge, add it to a
+* of_fpga_bridge_get_to_list() --- Get a ref of an FPGA bridge, add it to a
   list, given a device node
-* fpga_bridges_put() - Given a list of bridges, put them
+* fpga_bridges_put() --- Given a list of bridges, put them
 
 .. kernel-doc:: include/linux/fpga/fpga-region.h
    :functions: fpga_region
diff --git a/Documentation/driver-api/iio/buffers.rst b/Documentation/driver-api/iio/buffers.rst
index 24569ff0cf79..906dfc10b7ef 100644
--- a/Documentation/driver-api/iio/buffers.rst
+++ b/Documentation/driver-api/iio/buffers.rst
@@ -2,11 +2,11 @@
 Buffers
 =======
 
-* struct iio_buffer - general buffer structure
-* :c:func:`iio_validate_scan_mask_onehot` - Validates that exactly one channel
+* struct iio_buffer --- general buffer structure
+* :c:func:`iio_validate_scan_mask_onehot` --- Validates that exactly one channel
   is selected
-* :c:func:`iio_buffer_get` - Grab a reference to the buffer
-* :c:func:`iio_buffer_put` - Release the reference to the buffer
+* :c:func:`iio_buffer_get` --- Grab a reference to the buffer
+* :c:func:`iio_buffer_put` --- Release the reference to the buffer
 
 The Industrial I/O core offers a way for continuous data capture based on a
 trigger source. Multiple data channels can be read at once from
diff --git a/Documentation/driver-api/iio/hw-consumer.rst b/Documentation/driver-api/iio/hw-consumer.rst
index 75986358fc02..06969fde2086 100644
--- a/Documentation/driver-api/iio/hw-consumer.rst
+++ b/Documentation/driver-api/iio/hw-consumer.rst
@@ -8,11 +8,11 @@ software buffer for data. The implementation can be found under
 :file:`drivers/iio/buffer/hw-consumer.c`
 
 
-* struct iio_hw_consumer - Hardware consumer structure
-* :c:func:`iio_hw_consumer_alloc` - Allocate IIO hardware consumer
-* :c:func:`iio_hw_consumer_free` - Free IIO hardware consumer
-* :c:func:`iio_hw_consumer_enable` - Enable IIO hardware consumer
-* :c:func:`iio_hw_consumer_disable` - Disable IIO hardware consumer
+* struct iio_hw_consumer --- Hardware consumer structure
+* :c:func:`iio_hw_consumer_alloc` --- Allocate IIO hardware consumer
+* :c:func:`iio_hw_consumer_free` --- Free IIO hardware consumer
+* :c:func:`iio_hw_consumer_enable` --- Enable IIO hardware consumer
+* :c:func:`iio_hw_consumer_disable` --- Disable IIO hardware consumer
 
 
 HW consumer setup
diff --git a/Documentation/driver-api/iio/triggered-buffers.rst b/Documentation/driver-api/iio/triggered-buffers.rst
index 7c37b2afa1ad..49831ff466c5 100644
--- a/Documentation/driver-api/iio/triggered-buffers.rst
+++ b/Documentation/driver-api/iio/triggered-buffers.rst
@@ -7,10 +7,10 @@ Now that we know what buffers and triggers are let's see how they work together.
 IIO triggered buffer setup
 ==========================
 
-* :c:func:`iio_triggered_buffer_setup` - Setup triggered buffer and pollfunc
-* :c:func:`iio_triggered_buffer_cleanup` - Free resources allocated by
+* :c:func:`iio_triggered_buffer_setup` --- Setup triggered buffer and pollfunc
+* :c:func:`iio_triggered_buffer_cleanup` --- Free resources allocated by
   :c:func:`iio_triggered_buffer_setup`
-* struct iio_buffer_setup_ops - buffer setup related callbacks
+* struct iio_buffer_setup_ops --- buffer setup related callbacks
 
 A typical triggered buffer setup looks like this::
 
diff --git a/Documentation/driver-api/iio/triggers.rst b/Documentation/driver-api/iio/triggers.rst
index a5d1fc15747c..5b3d475bc871 100644
--- a/Documentation/driver-api/iio/triggers.rst
+++ b/Documentation/driver-api/iio/triggers.rst
@@ -2,11 +2,11 @@
 Triggers
 ========
 
-* struct iio_trigger - industrial I/O trigger device
-* :c:func:`devm_iio_trigger_alloc` - Resource-managed iio_trigger_alloc
-* :c:func:`devm_iio_trigger_register` - Resource-managed iio_trigger_register
+* struct iio_trigger --- industrial I/O trigger device
+* :c:func:`devm_iio_trigger_alloc` --- Resource-managed iio_trigger_alloc
+* :c:func:`devm_iio_trigger_register` --- Resource-managed iio_trigger_register
   iio_trigger_unregister
-* :c:func:`iio_trigger_validate_own_device` - Check if a trigger and IIO
+* :c:func:`iio_trigger_validate_own_device` --- Check if a trigger and IIO
   device belong to the same device
 
 In many situations it is useful for a driver to be able to capture data based
@@ -63,7 +63,7 @@ Let's see a simple example of how to setup a trigger to be used by a driver::
 IIO trigger ops
 ===============
 
-* struct iio_trigger_ops - operations structure for an iio_trigger.
+* struct iio_trigger_ops --- operations structure for an iio_trigger.
 
 Notice that a trigger has a set of operations attached:
 
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 29eb9230b7a9..e07e0d39c7f0 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -4,7 +4,7 @@ The Linux driver implementer's API guide
 
 The kernel offers a wide variety of interfaces to support the development
 of device drivers.  This document is an only somewhat organized collection
-of some of those interfaces - it will hopefully get better over time!  The
+of some of those interfaces --- it will hopefully get better over time!  The
 available subsections can be seen below.
 
 .. class:: toc-title
diff --git a/Documentation/driver-api/media/drivers/vidtv.rst b/Documentation/driver-api/media/drivers/vidtv.rst
index abb454302ac5..c3821d82df17 100644
--- a/Documentation/driver-api/media/drivers/vidtv.rst
+++ b/Documentation/driver-api/media/drivers/vidtv.rst
@@ -458,8 +458,8 @@ Add a way to test video
 
 Currently, vidtv can only encode PCM audio. It would be great to implement
 a barebones version of MPEG-2 video encoding so we can also test video. The
-first place to look into is *ISO 13818-2: Information technology - Generic
-coding of moving pictures and associated audio information - Part 2: Video*,
+first place to look into is *ISO 13818-2: Information technology --- Generic
+coding of moving pictures and associated audio information --- Part 2: Video*,
 which covers the encoding of compressed video in MPEG Transport Streams.
 
 This might optionally use the Video4Linux2 Test Pattern Generator, v4l2-tpg,
diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst
index dd91a495e02e..1d2d9cd40def 100644
--- a/Documentation/driver-api/nvdimm/btt.rst
+++ b/Documentation/driver-api/nvdimm/btt.rst
@@ -91,7 +91,7 @@ Bit      Description
 	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
 	   0  1	  Zero state: Reads return zeroes
 	   1  0	  Error state: Reads fail; Writes clear 'E' bit
-	   1  1	  Normal Block - has valid postmap
+	   1  1	  Normal Block -- has valid postmap
 	   == ==  ====================================================
 
 29 - 0	 Mappings to internal 'postmap' blocks
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index 19d2cf477fc3..9b0e9abf8f88 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -42,7 +42,7 @@ areas on disk for fast writing, we divide  the log into segments and use a
 segment cleaner to compress the live information from heavily fragmented
 segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
 implementation of a log-structured file system", ACM Trans. Computer Systems
-10, 1, 26-52.
+10, 1, 26--52.
 
 Wandering Tree Problem
 ----------------------
diff --git a/Documentation/hwmon/tmp103.rst b/Documentation/hwmon/tmp103.rst
index b3ef81475cf8..051282bd88b7 100644
--- a/Documentation/hwmon/tmp103.rst
+++ b/Documentation/hwmon/tmp103.rst
@@ -21,10 +21,10 @@ Description
 The TMP103 is a digital output temperature sensor in a four-ball
 wafer chip-scale package (WCSP). The TMP103 is capable of reading
 temperatures to a resolution of 1°C. The TMP103 is specified for
-operation over a temperature range of -40°C to +125°C.
+operation over a temperature range of --40°C to +125°C.
 
 Resolution: 8 Bits
-Accuracy: ±1°C Typ (-10°C to +100°C)
+Accuracy: ±1°C Typ (--10°C to +100°C)
 
 The driver provides the common sysfs-interface for temperatures (see
 Documentation/hwmon/sysfs-interface.rst under Temperatures).
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 11cd806ea3a4..7ae88aa57d98 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -30,7 +30,7 @@ tree, as well as links to the full license text.
 User-oriented documentation
 ---------------------------
 
-The following manuals are written for *users* of the kernel - those who are
+The following manuals are written for *users* of the kernel --- those who are
 trying to get it to work optimally on a given system.
 
 .. toctree::
@@ -90,7 +90,7 @@ Kernel API documentation
 These books get into the details of how specific kernel subsystems work
 from the point of view of a kernel developer.  Much of the information here
 is taken directly from the kernel source, with supplemental material added
-as needed (or at least as we managed to add it - probably *not* all that is
+as needed (or at least as we managed to add it --- probably *not* all that is
 needed).
 
 .. toctree::
diff --git a/Documentation/infiniband/tag_matching.rst b/Documentation/infiniband/tag_matching.rst
index b89528a31d10..2c26f76e43f9 100644
--- a/Documentation/infiniband/tag_matching.rst
+++ b/Documentation/infiniband/tag_matching.rst
@@ -8,8 +8,8 @@ match the following source and destination parameters:
 
 *	Communicator
 *	User tag - wild card may be specified by the receiver
-*	Source rank - wild car may be specified by the receiver
-*	Destination rank - wild
+*	Source rank -- wild car may be specified by the receiver
+*	Destination rank -- wild
 
 The ordering rules require that when more than one pair of send and receive
 message envelopes may match, the pair that includes the earliest posted-send
diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index 64024c77c9ca..e3e52b0e6b5e 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -173,7 +173,7 @@ Director rule is added from ethtool (Sideband filter), ATR is turned off by the
 driver. To re-enable ATR, the sideband can be disabled with the ethtool -K
 option. For example::
 
-  ethtool -K [adapter] ntuple [off|on]
+  ethtool --K [adapter] ntuple [off|on]
 
 If sideband is re-enabled after ATR is re-enabled, ATR remains enabled until a
 TCP-IP flow is added. When all TCP-IP sideband rules are deleted, ATR is
@@ -688,7 +688,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
 Totals must be equal or less than port speed.
 
 For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar -n DEV [interval] [number of samples]
+monitoring tools such as ifstat or sar --n DEV [interval] [number of samples]
 
 2. Enable HW TC offload on interface::
 
diff --git a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
index 25e98494b385..44d2f85738b1 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -179,7 +179,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
 Totals must be equal or less than port speed.
 
 For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar -n DEV [interval] [number of samples]
+monitoring tools such as ifstat or sar --n DEV [interval] [number of samples]
 
 NOTE:
   Setting up channels via ethtool (ethtool -L) is not supported when the
diff --git a/Documentation/riscv/vm-layout.rst b/Documentation/riscv/vm-layout.rst
index 545f8ab51f1a..05615b3021bb 100644
--- a/Documentation/riscv/vm-layout.rst
+++ b/Documentation/riscv/vm-layout.rst
@@ -22,7 +22,7 @@ RISC-V Linux Kernel 64bit
 =========================
 
 The RISC-V privileged architecture document states that the 64bit addresses
-"must have bits 63-48 all equal to bit 47, or else a page-fault exception will
+"must have bits 63--48 all equal to bit 47, or else a page-fault exception will
 occur.": that splits the virtual address space into 2 halves separated by a very
 big hole, the lower half is where the userspace resides, the upper half is where
 the RISC-V Linux Kernel resides.
diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index 0ff353ecf24e..b261ec2ab2ef 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -515,7 +515,7 @@ Deadline Task Scheduling
       pp 760-768, 2005.
   10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
        Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
-       vol. 25, no. 2-3, pp. 187-205, 2003.
+       vol. 25, no. 2--3, pp. 187--205, 2003.
   11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
        Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
        http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
diff --git a/Documentation/userspace-api/media/v4l/biblio.rst b/Documentation/userspace-api/media/v4l/biblio.rst
index 6e07b78bd39d..7b8e6738ff9e 100644
--- a/Documentation/userspace-api/media/v4l/biblio.rst
+++ b/Documentation/userspace-api/media/v4l/biblio.rst
@@ -51,7 +51,7 @@ ISO 13818-1
 ===========
 
 
-:title:     ITU-T Rec. H.222.0 | ISO/IEC 13818-1 "Information technology - Generic coding of moving pictures and associated audio information: Systems"
+:title:     ITU-T Rec. H.222.0 | ISO/IEC 13818-1 "Information technology --- Generic coding of moving pictures and associated audio information: Systems"
 
 :author:    International Telecommunication Union (http://www.itu.ch), International Organisation for Standardisation (http://www.iso.ch)
 
@@ -61,7 +61,7 @@ ISO 13818-2
 ===========
 
 
-:title:     ITU-T Rec. H.262 | ISO/IEC 13818-2 "Information technology - Generic coding of moving pictures and associated audio information: Video"
+:title:     ITU-T Rec. H.262 | ISO/IEC 13818-2 "Information technology --- Generic coding of moving pictures and associated audio information: Video"
 
 :author:    International Telecommunication Union (http://www.itu.ch), International Organisation for Standardisation (http://www.iso.ch)
 
@@ -150,7 +150,7 @@ ITU-T.81
 ========
 
 
-:title:     ITU-T Recommendation T.81 "Information Technology - Digital Compression and Coding of Continous-Tone Still Images - Requirements and Guidelines"
+:title:     ITU-T Recommendation T.81 "Information Technology --- Digital Compression and Coding of Continous-Tone Still Images --- Requirements and Guidelines"
 
 :author:    International Telecommunication Union (http://www.itu.int)
 
@@ -310,7 +310,7 @@ ISO 12232:2006
 ==============
 
 
-:title:     Photography - Digital still cameras - Determination of exposure index, ISO speed ratings, standard output sensitivity, and recommended exposure index
+:title:     Photography --- Digital still cameras --- Determination of exposure index, ISO speed ratings, standard output sensitivity, and recommended exposure index
 
 :author:    International Organization for Standardization (http://www.iso.org)
 
diff --git a/Documentation/virt/kvm/running-nested-guests.rst b/Documentation/virt/kvm/running-nested-guests.rst
index e9dff3fea055..8b83b86560da 100644
--- a/Documentation/virt/kvm/running-nested-guests.rst
+++ b/Documentation/virt/kvm/running-nested-guests.rst
@@ -26,12 +26,12 @@ this document is built on this example)::
 
 Terminology:
 
-- L0 - level-0; the bare metal host, running KVM
+- L0 -- level-0; the bare metal host, running KVM
 
-- L1 - level-1 guest; a VM running on L0; also called the "guest
+- L1 -- level-1 guest; a VM running on L0; also called the "guest
   hypervisor", as it itself is capable of running KVM.
 
-- L2 - level-2 guest; a VM running on L1, this is the "nested guest"
+- L2 -- level-2 guest; a VM running on L1, this is the "nested guest"
 
 .. note:: The above diagram is modelled after the x86 architecture;
           s390x, ppc64 and other architectures are likely to have
@@ -39,7 +39,7 @@ Terminology:
 
           For example, s390x always has an LPAR (LogicalPARtition)
           hypervisor running on bare metal, adding another layer and
-          resulting in at least four levels in a nested setup - L0 (bare
+          resulting in at least four levels in a nested setup --- L0 (bare
           metal, running the LPAR hypervisor), L1 (host hypervisor), L2
           (guest hypervisor), L3 (nested guest).
 
@@ -167,11 +167,11 @@ Enabling "nested" (s390x)
     $ modprobe kvm nested=1
 
 .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
-          with the ``nested`` paramter - i.e. to be able to enable
+          with the ``nested`` paramter --- i.e. to be able to enable
           ``nested``, the ``hpage`` parameter *must* be disabled.
 
 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
-   feature - with QEMU, this can be done by using "host passthrough"
+   feature --- with QEMU, this can be done by using "host passthrough"
    (via the command-line ``-cpu host``).
 
 3. Now the KVM module can be loaded in the L1 (guest hypervisor)::
Edward Cree May 10, 2021, 1:16 p.m. UTC | #6
On 10/05/2021 12:55, Mauro Carvalho Chehab wrote:
> The main point on this series is to replace just the occurrences
> where ASCII represents the symbol equally well

> 	- U+2014 ('—'): EM DASH
Em dash is not the same thing as hyphen-minus, and the latter does not
 serve 'equally well'.  People use em dashes because — even in
 monospace fonts — they make text easier to read and comprehend, when
 used correctly.
I accept that some of the other distinctions — like en dashes — are
 needlessly pedantic (though I don't doubt there is someone out there
 who will gladly defend them with the same fervour with which I argue
 for the em dash) and I wouldn't take the trouble to use them myself;
 but I think there is a reasonable assumption that when someone goes
 to the effort of using a Unicode punctuation mark that is semantic
 (rather than merely typographical), they probably had a reason for
 doing so.

> 	- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> 	- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> 	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> 	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
(These are purely typographic, I have no problem with dumping them.)

> 	- U+00d7 ('×'): MULTIPLICATION SIGN
Presumably this is appearing in mathematical formulae, in which case
 changing it to 'x' loses semantic information.

> Using the above symbols will just trick tools like grep for no good
> reason.
NBSP, sure.  That one's probably an artefact of some document format
 conversion somewhere along the line, anyway.
But what kinds of things with × or — in are going to be grept for?

If there are em dashes lying around that semantically _should_ be
 hyphen-minus (one of your patches I've seen, for instance, fixes an
 *en* dash moonlighting as the option character in an `ethtool`
 command line), then sure, convert them.
But any time someone is using a Unicode character to *express
 semantics*, even if you happen to think the semantic distinction
 involved is a pedantic or unimportant one, I think you need an
 explicit grep case to justify ASCIIfying it.

-ed
Mauro Carvalho Chehab May 10, 2021, 1:38 p.m. UTC | #7
Em Mon, 10 May 2021 14:16:16 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:

> On 10/05/2021 12:55, Mauro Carvalho Chehab wrote:
> > The main point on this series is to replace just the occurrences
> > where ASCII represents the symbol equally well  
> 
> > 	- U+2014 ('—'): EM DASH  
> Em dash is not the same thing as hyphen-minus, and the latter does not
>  serve 'equally well'.  People use em dashes because — even in
>  monospace fonts — they make text easier to read and comprehend, when
>  used correctly.

True, but if you look at the diff, on several places, IMHO a single
hyphen would make more sensus. Maybe those places came from a converted
doc.

> I accept that some of the other distinctions — like en dashes — are
>  needlessly pedantic (though I don't doubt there is someone out there
>  who will gladly defend them with the same fervour with which I argue
>  for the em dash) and I wouldn't take the trouble to use them myself;
>  but I think there is a reasonable assumption that when someone goes
>  to the effort of using a Unicode punctuation mark that is semantic
>  (rather than merely typographical), they probably had a reason for
>  doing so.
> 
> > 	- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > 	- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > 	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > 	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK  
> (These are purely typographic, I have no problem with dumping them.)
> 
> > 	- U+00d7 ('×'): MULTIPLICATION SIGN  
> Presumably this is appearing in mathematical formulae, in which case
>  changing it to 'x' loses semantic information.
> 
> > Using the above symbols will just trick tools like grep for no good
> > reason.  
> NBSP, sure.  That one's probably an artefact of some document format
>  conversion somewhere along the line, anyway.
> But what kinds of things with × or — in are going to be grept for?

Actually, on almost all places, those aren't used inside math formulae, but
instead, they describe video some resolutions:

	$ git grep × Documentation/
	Documentation/devicetree/bindings/display/panel/asus,z00t-tm5p5-nt35596.yaml:title: ASUS Z00T TM5P5 NT35596 5.5" 1080×1920 LCD Panel
	Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml:        # LG ACX467AKM-7 4.95" 1080×1920 LCD Panel
	Documentation/devicetree/bindings/sound/tlv320adcx140.yaml:      1 - Mic bias is set to VREF × 1.096
	Documentation/userspace-api/media/v4l/crop.rst:of 16 × 16 pixels. The source cropping rectangle is set to defaults,
	Documentation/userspace-api/media/v4l/crop.rst:which are also the upper limit in this example, of 640 × 400 pixels at
	Documentation/userspace-api/media/v4l/crop.rst:offset 0, 0. An application requests an image size of 300 × 225 pixels,
	Documentation/userspace-api/media/v4l/crop.rst:The driver sets the image size to the closest possible values 304 × 224,
	Documentation/userspace-api/media/v4l/crop.rst:is 608 × 224 (224 × 2:1 would exceed the limit 400). The offset 0, 0 is
	Documentation/userspace-api/media/v4l/crop.rst:rectangle of 608 × 456 pixels. The present scaling factors limit
	Documentation/userspace-api/media/v4l/crop.rst:cropping to 640 × 384, so the driver returns the cropping size 608 × 384
	Documentation/userspace-api/media/v4l/crop.rst:and adjusts the image size to closest possible 304 × 192.
	Documentation/userspace-api/media/v4l/diff-v4l.rst:size bitmap of 1024 × 625 bits. Struct :c:type:`v4l2_window`
	Documentation/userspace-api/media/v4l/vidioc-cropcap.rst:       Assuming pixel aspect 1/1 this could be for example a 640 × 480
	Documentation/userspace-api/media/v4l/vidioc-cropcap.rst:       rectangle for NTSC, a 768 × 576 rectangle for PAL and SECAM

it is a way more likely that, if someone wants to grep, they would be 
doing something like this, in order to get video resolutions:

	$ git grep -E "\b[1-9][0-9]+\s*x\s*[0-9]+\b" Documentation/
	Documentation/ABI/obsolete/sysfs-driver-hid-roccat-koneplus:Description:        When read the mouse returns a 30x30 pixel image of the
	Documentation/ABI/obsolete/sysfs-driver-hid-roccat-konepure:Description:        When read the mouse returns a 30x30 pixel image of the
	Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7:               Provides access to the binary "24x7 catalog" provided by the
	Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7:               https://raw.githubusercontent.com/jmesmon/catalog-24x7/master/hv-24x7-	catalog.h
	Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7:               Exposes the "version" field of the 24x7 catalog. This is also
	Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7:               HCALLs to retrieve hv-24x7 pmu event counter data.
	Documentation/ABI/testing/sysfs-bus-vfio-mdev:          "2 heads, 512M FB, 2560x1600 maximum resolution"
	Documentation/ABI/testing/sysfs-driver-wacom:           of the device. The image is a 64x32 pixel 4-bit gray image. The
	Documentation/ABI/testing/sysfs-driver-wacom:           1024 byte binary is split up into 16x 64 byte chunks. Each 64
	Documentation/ABI/testing/sysfs-driver-wacom:           image has to contain 256 bytes (64x32 px 1 bit colour).
	Documentation/admin-guide/edid.rst:commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
	Documentation/admin-guide/edid.rst:1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
	Documentation/admin-guide/edid.rst:If you want to create your own EDID file, copy the file 1024x768.S,
	Documentation/admin-guide/kernel-parameters.txt:                        edid/1024x768.bin, edid/1280x1024.bin,
	Documentation/admin-guide/kernel-parameters.txt:                        edid/1680x1050.bin, or edid/1920x1080.bin is given
	Documentation/admin-guide/kernel-parameters.txt:                        2 - The VGA Shield is attached (1024x768)
	Documentation/admin-guide/media/dvb_intro.rst:signal encoded at a resolution of 768x576 24-bit color pixels over 25
	Documentation/admin-guide/media/imx.rst:1280x960 input frame to 640x480, and then /2 downscale in both
	Documentation/admin-guide/media/imx.rst:dimensions to 320x240 (assumes ipu1_csi0 is linked to ipu1_csi0_mux):
	Documentation/admin-guide/media/imx.rst:   media-ctl -V "'ipu1_csi0_mux':2[fmt:UYVY2X8/1280x960]"

which won't get the above, due to the usage of the UTF-8 alternative.

In any case, replacing all the above by 'x' seems to be the right thing,
at least on my eyes.

> If there are em dashes lying around that semantically _should_ be
>  hyphen-minus (one of your patches I've seen, for instance, fixes an
>  *en* dash moonlighting as the option character in an `ethtool`
>  command line), then sure, convert them.
> But any time someone is using a Unicode character to *express
>  semantics*, even if you happen to think the semantic distinction
>  involved is a pedantic or unimportant one, I think you need an
>  explicit grep case to justify ASCIIfying it.

Yeah, in the case of hyphen/dash it seems to make sense to double check
it.

Thanks,
Mauro
David Woodhouse May 10, 2021, 1:49 p.m. UTC | #8
On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> This patch series is doing conversion only when using ASCII makes
> more sense than using UTF-8. 
> 
> See, a number of converted documents ended with weird characters
> like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> character doesn't do any good.
> 
> Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> someone tries to use grep[1].

Replacing those makes sense. But replacing emdashes — which are a
distinct character that has no direct replacement in ASCII and which
people do *deliberately* use instead of hyphen-minus — does not.

Perhaps stick to those two, and any cases where an emdash or endash has
been used where U+002D HYPHEN-MINUS *should* have been used.

And please fix your cover letter which made no reference to 'grep', and
only presented a completely bogus argument for the change instead.
Edward Cree May 10, 2021, 1:58 p.m. UTC | #9
On 10/05/2021 14:38, Mauro Carvalho Chehab wrote:
> Em Mon, 10 May 2021 14:16:16 +0100
> Edward Cree <ecree.xilinx@gmail.com> escreveu:
>> But what kinds of things with × or — in are going to be grept for?
> 
> Actually, on almost all places, those aren't used inside math formulae, but
> instead, they describe video some resolutions:
Ehh, those are also proper uses of ×.  It's still a multiplication,
 after all.

> it is a way more likely that, if someone wants to grep, they would be 
> doing something like this, in order to get video resolutions:
Why would someone be grepping for "all video resolutions mentioned in
 the documentation"?  That seems contrived to me.

-ed
Matthew Wilcox (Oracle) May 10, 2021, 1:59 p.m. UTC | #10
On Mon, May 10, 2021 at 02:16:16PM +0100, Edward Cree wrote:
> On 10/05/2021 12:55, Mauro Carvalho Chehab wrote:
> > The main point on this series is to replace just the occurrences
> > where ASCII represents the symbol equally well
> 
> > 	- U+2014 ('—'): EM DASH
> Em dash is not the same thing as hyphen-minus, and the latter does not
>  serve 'equally well'.  People use em dashes because — even in
>  monospace fonts — they make text easier to read and comprehend, when
>  used correctly.
> I accept that some of the other distinctions — like en dashes — are
>  needlessly pedantic (though I don't doubt there is someone out there
>  who will gladly defend them with the same fervour with which I argue
>  for the em dash) and I wouldn't take the trouble to use them myself;
>  but I think there is a reasonable assumption that when someone goes
>  to the effort of using a Unicode punctuation mark that is semantic
>  (rather than merely typographical), they probably had a reason for
>  doing so.

I think you're overestimating the amount of care and typographical
knowledge that your average kernel developer has.  Most of these
UTF-8 characters come from latex conversions and really aren't
necessary (and are being used incorrectly).

You seem quite knowedgeable about the various differences.  Perhaps
you'd be willing to write a document for Documentation/doc-guide/
that provides guidance for when to use which kinds of horizontal
line?  https://www.punctuationmatters.com/hyphen-dash-n-dash-and-m-dash/
talks about it in the context of publications, but I think we need
something more suited to our needs for kernel documentation.
Ben Boeckel May 10, 2021, 2 p.m. UTC | #11
On Mon, May 10, 2021 at 13:55:18 +0200, Mauro Carvalho Chehab wrote:
>     $ git grep "CPU 0 has been" Documentation/RCU/
>       Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
>       Documentation/RCU/Design/Data-Structures/Data-Structures.rst:|    notices that CPU 0 has been in dyntick idle mode, which qualifies  |

The kernel documentation uses hard line wraps, so such a naive grep is
going to always fail unless such line wraps are taken into account. Not
saying this isn't an improvement in and of itself, but smarter searching
strategies are likely needed anyways.

--Ben
Edward Cree May 10, 2021, 2:33 p.m. UTC | #12
On 10/05/2021 14:59, Matthew Wilcox wrote:
> Most of these
> UTF-8 characters come from latex conversions and really aren't
> necessary (and are being used incorrectly).
I fully agree with fixing those.
The cover-letter, however, gave the impression that that was not the
 main purpose of this series; just, perhaps, a happy side-effect.

> You seem quite knowedgeable about the various differences.  Perhaps
> you'd be willing to write a document for Documentation/doc-guide/
> that provides guidance for when to use which kinds of horizontal
> line?I have Opinions about the proper usage of punctuation, but I also know
 that other people have differing opinions.  For instance, I place
 spaces around an em dash, which is nonstandard according to most
 style guides.  Really this is an individual enough thing that I'm not
 sure we could have a "kernel style guide" that would be more useful
 than general-purpose guidance like the page you linked.
Moreover, such a guide could make non-native speakers needlessly self-
 conscious about their writing and discourage them from contributing
 documentation at all.  I'm not advocating here for trying to push
 kernel developers towards an eats-shoots-and-leaves level of
 linguistic pedantry; rather, I merely think that existing correct
 usages should be left intact (and therefore, excising incorrect usage
 should only be attempted by someone with both the expertise and time
 to check each case).

But if you really want such a doc I wouldn't mind contributing to it.

-ed
Theodore Ts'o May 10, 2021, 7:22 p.m. UTC | #13
On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
> On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > This patch series is doing conversion only when using ASCII makes
> > more sense than using UTF-8. 
> > 
> > See, a number of converted documents ended with weird characters
> > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > character doesn't do any good.
> > 
> > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > someone tries to use grep[1].
> 
> Replacing those makes sense. But replacing emdashes — which are a
> distinct character that has no direct replacement in ASCII and which
> people do *deliberately* use instead of hyphen-minus — does not.

I regularly use --- for em-dashes and -- for en-dashes.  Markdown will
automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
hyphens to en-dashes.  It's much, much easier for me to type 2 or 3
hypens into my text editor of choice than trying to enter the UTF-8
characters.  If we can make sphinx do this translation, maybe that's
the best way of dealing with these two characters?

Cheers,

					- Ted
Adam Borowski May 10, 2021, 9:57 p.m. UTC | #14
On Mon, May 10, 2021 at 12:26:12PM +0200, Mauro Carvalho Chehab wrote:
> There are several UTF-8 characters at the Kernel's documentation.
[...]
> Other UTF-8 characters were added along the time, but they're easily
> replaceable by ASCII chars.
> 
> As Linux developers are all around the globe, and not everybody has UTF-8
> as their default charset

I'm not aware of a distribution that still allows selecting a non-UTF-8
charset in a normal flow in their installer.  And if they haven't purged
support for ancient encodings, that support is thoroughly bitrotten.
Thus, I disagree that this is a legitimate concern.

What _could_ be a legitimate reason is that someone is on a _terminal_
that can't display a wide enough set of glyphs.  Such terminals are:
 • Linux console (because of vgacon limitations; patchsets to improve
   other cons haven't been mainlined)
 • some Windows terminals (putty, old Windows console) that can't borrow
   glyphs from other fonts like fontconfig can

For the former, it's whatever your distribution ships in
/usr/share/consolefonts/ or an equivalent, which is based on historic
ISO-8859 and VT100 traditions.

For the latter, the near-guaranteed character set is WGL4.


Thus, at least two of your choices seem to disagree with the above:
[dropped]
> 	0xd7   => 'x',		# MULTIPLICATION SIGN
[retained]
> 	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

× is present in ISO-8859, V100, WGL4; I've found no font in
/usr/share/consolefonts/ on my Debian unstable box that lacks this
character.

⬍ is not found in any of the above.  You might want to at least
convert it to ↕ which is at least present in WGL4, and thus likely
to be supported in fonts heeding Windows/Mac/OpenType recommendations.
That still won't make it work on VT.


Meow!
Mauro Carvalho Chehab May 11, 2021, 9 a.m. UTC | #15
Em Mon, 10 May 2021 15:33:47 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:

> On 10/05/2021 14:59, Matthew Wilcox wrote:
> > Most of these
> > UTF-8 characters come from latex conversions and really aren't
> > necessary (and are being used incorrectly).  
> I fully agree with fixing those.
> The cover-letter, however, gave the impression that that was not the
>  main purpose of this series; just, perhaps, a happy side-effect.

Sorry for the mess. The main reason why I wrote this series is because
there are lots of UTF-8 left-over chars from the ReST conversion.
See:
  - https://lore.kernel.org/linux-doc/20210507100435.3095f924@coco.lan/

A large set of the UTF-8 letf-over chars were due to my conversion work,
so I feel personally responsible to fix those ;-)

Yet, this series has two positive side effects:

 - it helps people needing to touch the documents using non-utf8 locales[1];
 - it makes easier to grep for a text;

[1] There are still some widely used distros nowadays (LTS ones?) that
    don't set UTF-8 as default. Last time I installed a Debian machine
    I had to explicitly set UTF-8 charset after install as the default
    were using ASCII encoding (can't remember if it was Debian 10 or an
    older version).

Unintentionally, I ended by giving emphasis to the non-utf8 instead of
giving emphasis to the conversion left-overs.

FYI, this patch series originated from a discussion at linux-doc,
reporting that Sphinx breaks when LANG is not set to utf-8[2]. That's
why I probably ended giving the wrong emphasis at the cover letter.

[2] See https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/
    for the original report. I strongly suspect that the VM set by Michal 
    to build the docs was using a distro that doesn't set UTF-8 as default.

    PS.: 
      I intend to prepare afterwards a separate fix to avoid Sphinx
      logger to crash during Kernel doc builds when the locale charset
      is not UTF-8, but I'm not too fluent in python. So, I need some
      time to check if are there a way to just avoid python log crashes
      without touching Sphinx code and without needing to trick it to 
      think that the machine's locale is UTF-8.

See: while there was just a single document originally stored at the
Kernel tree as a LaTeX document during the time we did the conversion
(cdrom-standard.tex), there are several other documents stored as 
text that seemed to be generated by some tool like LaTeX, whose the
original version were not preserved. 

Also, there were other documents using different markdown dialects 
that were converted via pandoc (and/or other similar tools). That's 
not to mention the ones that were converted from DocBook. Such
tools tend to use some logic to use "neat" versions of some ASCII
characters, like what this tool does:

	https://daringfireball.net/projects/smartypants/

(Sphinx itself seemed to use this tool on its early versions)

All tool-converted documents can carry UTF-8 on unexpected places. See,
on this series, a large amount of patches deal with U+A0 (NO-BREAK SPACE)
chars. I can't see why someone writing a plain text document (or a ReST
one) would type a NO-BREAK SPACE instead of a normal white space.

The same applies, up to some sort, to curly commas: usually people just 
write ASCII "commas" on their documents, and use some tool like LaTeX
or a text editor like libreoffice in order to convert them into
 “utf-8 curly commas”[3].

[3] Sphinx will do such things at the produced output, doing something 
    similar to what smartypants does, nowadays using this:

	https://docutils.sourceforge.io/docs/user/smartquotes.html

    E. g.:
      - Straight quotes (" and ') turned into "curly" quote characters;
      - dashes (-- and ---) turned into en- and em-dash entities;
      - three consecutive dots (... or . . .) turned into an ellipsis char.

> > You seem quite knowedgeable about the various differences.  Perhaps
> > you'd be willing to write a document for Documentation/doc-guide/
> > that provides guidance for when to use which kinds of horizontal
> > line?
> I have Opinions about the proper usage of punctuation, but I also know  
>  that other people have differing opinions.  For instance, I place
>  spaces around an em dash, which is nonstandard according to most
>  style guides.  Really this is an individual enough thing that I'm not
>  sure we could have a "kernel style guide" that would be more useful
>  than general-purpose guidance like the page you linked.

> Moreover, such a guide could make non-native speakers needlessly self-
>  conscious about their writing and discourage them from contributing
>  documentation at all.

I don't think so. In a matter of fact, as a non-native speaker, I guess
this can actually help people willing to write documents.

>  I'm not advocating here for trying to push
>  kernel developers towards an eats-shoots-and-leaves level of
>  linguistic pedantry; rather, I merely think that existing correct
>  usages should be left intact (and therefore, excising incorrect usage
>  should only be attempted by someone with both the expertise and time
>  to check each case).
> 
> But if you really want such a doc I wouldn't mind contributing to it.

IMO, a document like that can be helpful. I can help reviewing it.

Thanks,
Mauro
David Woodhouse May 11, 2021, 9:19 a.m. UTC | #16
On Tue, 2021-05-11 at 11:00 +0200, Mauro Carvalho Chehab wrote:
> Yet, this series has two positive side effects:
> 
>  - it helps people needing to touch the documents using non-utf8 locales[1];
>  - it makes easier to grep for a text;
> 
> [1] There are still some widely used distros nowadays (LTS ones?) that
>     don't set UTF-8 as default. Last time I installed a Debian machine
>     I had to explicitly set UTF-8 charset after install as the default
>     were using ASCII encoding (can't remember if it was Debian 10 or an
>     older version).

This whole line of thinking is fundamentally wrong.

A given set of characters in a "text file" are encoded with a specific
character set / encoding. To interpret that file and convert the bytes
back to characters, we need to use the *same* charset.

That charset is a property of the text file, and each text file or
piece of text in a system (like this email, which will contain a
Content-Type: header indicating the charset) might be encoded with a
*different* character set.

In the days before you could connect computers together — or before you
could exchange data between computers in different countries, at least
— perhaps it made sense to store 'text' files without explicitly noting
their encoding. And to interpret them using some kind of "default"
character set.

Those days are long gone. You're trying to work around an egregiously
stupid bug, if you're trying to pander to "default" encodings. There
*is* no default encoding that even makes sense, except perhaps UTF-8.
To *speak* of them as you did shows a misunderstanding of how broken
they are. It's *precisely* that kind of half-baked thinking which
always used to lead to stupid assumptions and double conversions and
Mojibake. Before we just standardised on UTF-8 everywhere and it
stopped mattering so much.

Just don't.

Now, you *can* make this work if you really insist on it, even for
systems with EBCDIC as their default encoding. Just make git do the
"convert to local charset" on checkout, precisely the same way as it
does CRLF for Windows systems. But it's stupid and anachronistic, so I
don't really see the point.
Mauro Carvalho Chehab May 11, 2021, 9:25 a.m. UTC | #17
Em Mon, 10 May 2021 14:49:44 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:
> > This patch series is doing conversion only when using ASCII makes
> > more sense than using UTF-8. 
> > 
> > See, a number of converted documents ended with weird characters
> > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > character doesn't do any good.
> > 
> > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > someone tries to use grep[1].  
> 
> Replacing those makes sense. But replacing emdashes — which are a
> distinct character that has no direct replacement in ASCII and which
> people do *deliberately* use instead of hyphen-minus — does not.
> 
> Perhaps stick to those two, and any cases where an emdash or endash has
> been used where U+002D HYPHEN-MINUS *should* have been used.

Ok. I'll rework the series excluding EM/EN DASH chars from it.
I'll then apply manually the changes for EM/EN DASH chars 
(probably on a separate series) where it seems to fit. That should
make easier to discuss such replacements.

> And please fix your cover letter which made no reference to 'grep', and
> only presented a completely bogus argument for the change instead.

OK!

Regards,
Mauro
Mauro Carvalho Chehab May 11, 2021, 9:37 a.m. UTC | #18
Em Mon, 10 May 2021 15:22:02 -0400
"Theodore Ts'o" <tytso@mit.edu> escreveu:

> On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote:
> > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote:  
> > > This patch series is doing conversion only when using ASCII makes
> > > more sense than using UTF-8. 
> > > 
> > > See, a number of converted documents ended with weird characters
> > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific
> > > character doesn't do any good.
> > > 
> > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until
> > > someone tries to use grep[1].  
> > 
> > Replacing those makes sense. But replacing emdashes — which are a
> > distinct character that has no direct replacement in ASCII and which
> > people do *deliberately* use instead of hyphen-minus — does not.  
> 
> I regularly use --- for em-dashes and -- for en-dashes.  Markdown will
> automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII
> hyphens to en-dashes.  It's much, much easier for me to type 2 or 3
> hypens into my text editor of choice than trying to enter the UTF-8
> characters. 

Yeah, typing those UTF-8 chars are a lot harder than typing -- and ---
on several text editors ;-)

Here, I only type UTF-8 chars for accents (my US-layout keyboards are 
all set to US international, so typing those are easy).

> If we can make sphinx do this translation, maybe that's
> the best way of dealing with these two characters?

Sphinx already does that by default[1], using smartquotes:

	https://docutils.sourceforge.io/docs/user/smartquotes.html

Those are the conversions that are done there:

      - Straight quotes (" and ') turned into "curly" quote characters;
      - dashes (-- and ---) turned into en- and em-dash entities;
      - three consecutive dots (... or . . .) turned into an ellipsis char.

So, we can simply use single/double commas, hyphens and dots for
curly commas and ellipses.

[1] There's a way to disable it at conf.py, but at the Kernel this is
    kept on its default: to automatically do such conversions. 

Thanks,
Mauro