# Collating element in RE doesn't work



## Seeker (Jan 9, 2014)

From re_format(7)


> Within a bracket expression, a collating element (a character, a multi-
> character sequence that collates as if it were a single character, or a
> collating-sequence name for either) enclosed in `[.' and `.]' stands for
> the sequence of characters of that collating element.  The sequence is a
> ...




```
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[.chte.]]*'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[.chte.]+'
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[chte]+'
    Will give SAME output and matching (colored)
```

The same goes for an equivalence class:

```
# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[[=chte=]]+'
egrep: Invalid collation character

# echo 'lhhhocate chchccccclulzc dlen' | egrep --color '[=chte=]+'
```

All the same result! Is this a bug?


----------



## worldi (Jan 11, 2014)

Please note that the word _collating_ must not be taken literally here. It's a language/encoding thing: in some languages combinations of certain letters are part of the alphabet and are treated as single letters (which is important for sorting, etc.). So the current set of these "collating elements" is language specific. It can be changed via the LC_COLLATE environment variable.

re_format() contains at least two errors:

 It fails to mention that the 'ch' example is for Spanish. It requires LC_COLLATE to be set to a specific value (like "en_ES.UTF-8"), and
 It is outdated because the Spanish alphabet was redefined and 'ch' is not considered a "collating element" anymore.

That said, neither collating elements nor equivalence classes seem to be taken into account for regular expressions:

```
% (export LC_ALL=de_DE.ISO8859-1; echo Motörhead | sed -E 's/[[=o=]]/_/g')                           
M_törhead
% (export LC_ALL=hu_HU.ISO8859-2; echo ty | grep -E '^[s-u]*$')
%
```


----------



## Seeker (Jan 18, 2014)

Thank you for clarification.
I really had no idea, that collating had to do with language/encoding.


----------

