Introduction
It all started with a short script that was supposed to combine information about the e-mail addresses of employees received from the list of mailing list users with the positions of employees received from the personnel department database. Both lists were exported to Unicode UTF-8 text files and saved with Unix line endings.
Mail.txt content
;ia@example.com
Buhg.txt content
;
;
;
;
To merge, the files were sorted by the Unix sort command and submitted to the input of the Unix join program , which unexpectedly ended with an error:
$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: ;
Viewing the result of sorting by eyes showed that, in general, sorting is correct, but in case of coincidence of male and female surnames, women go before men:
$> sort buhg.txt
;
;
;
;
It looks like a Unicode sorting glitch or as a manifestation of feminism in a sorting algorithm. The first, of course, is believable.
For now , put aside join and focus on sort . Let's try to solve the problem by the method of scientific poking. First, change the locale from en_US to ru_RU . For sorting it would be enough to set the environment variable LC_COLLATE , but we will not trifle:
$> LANG=ru_RU.UTF-8 sort buhg.txt
;
;
;
;
.
:
$> iconv -f UTF-8 -t KOI8-R buhg.txt \
| LANG=ru_RU.KOI8-R sort \
| iconv -f KOI8-R -t UTF8
.
, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".
: "C" . :
$> LANG=C sort buhg.txt
;
;
;
;
- . , - . :
$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result
, . .
, β CP1251:
$> iconv -f UTF-8 -t CP1251 buhg.txt \
| LANG=ru_RU.CP1251 sort \
| iconv -f CP1251 -t UTF8
, , "C", , , . -.
, , , . sort LC_COLLATE .
:
- LANG=ru_RU.CP1251 LANG=C
- sort join
- ,
β 10 Unicode collation algorithm unicode.org. , .
Collation β "" β . ("", "", ""), , .
β . , - , , . Γ P, CP850 ΓΏ Γ.
"" , , . UTF8, UTF16 KOI8-R ( ) , .
, , . , , . , Γ AE. Γ , Z. , Γ , . Ch, H I.
, . : , ? . (ΒΏTe gusta la mΓΊsica?). , , ?
. , , , , . , ( ) . , , , , - .
, :
- ;
- , , (A + , Γ
);
- , , (Ch ) (Γ );
- (, /, , ) ();
- , , ( {β¦ } bash);
- .
, , :
- ( );
- ;
- , ;
- ( x < y , xz < yz);
- , , . , ;
- , , . β , (. );
- / .
, . , , (Beatles, The).
, ( ) .
, . , . ( ), -. IGNORED (0x0) , . , . .
, β "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .
DUCET :
: ; , ; , , .
, . , , .
( ) β 10 β "Unicode Collation Algorithm" (UCA).
. .
UCA , , DUCET. . , , ( 1F000 ). β , β ,2,3β¦ .
DUCET , , , - β "International Components for Unicode" (ICU).
, IBM, , . , , .
;
;
;
;
, ICU . Collation FAQ .
, sort Linux - .
glibc
sort GNU Core Utils , LC_COLLATE :
$ sort --debug buhg.txt > buhg.srt
sort: using βen_US.UTF8β sorting rules
strcoll, glibc.
wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .
wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).
, , , .
ISO 14651/14652
CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .
iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.
:
, \ , # . , :
escape_char /
comment_char %
<Uxxxx> <Uxxxxxxxx> ( x β ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .
LC_COLLATE , , .
. , , . collating-symbol ( ), , , .
900 . , .
LC_COLLATE
collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"
- collating-symbol <OSMANYA> OSMANYA
- collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
- FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
- <U0413> UCS-4
- collating-element <U0413_0301> from "<U0413><U0301>" .
, . -, . "" , "". , . . , , , .
% Symbolic weight assignments
% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE
, .
order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .
:
order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end
. :
<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...
, ASCII ( ) . , , , . ( ) :
, ( <CAP> <MIN>), .
LC_COLLATE=C ,
static const uint32_t collseqwc[] =
{
8, 1, 8, 0x0, 0xff,
/* 1st-level table */
6 * sizeof (uint32_t),
/* 2nd-level table */
7 * sizeof (uint32_t),
/* 3rd-level table */
L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',
...
L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};
, , .
, , , CTT . localedef.
localedef ( -i), , ( -f). , .
Glibc : "" "".
, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .
/usr/lib/locale/locale-archive, , glibc. β , . ru_RU.KOI8-R, ru_RU.koi8r.
, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .
,
localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC
/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic
LANG=en_US.UTF-8 glibc :
/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/
, .
locale -a.
, , . , , ASCII.
: localedef.
, , ISO 14652 . reorder-after , . reorder-end. , .
iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE
, LC_IDENTIFICATION ru_MY, , locale-archive.
localedef I18NPATH , :
$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8
POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .
:
$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
;
;
;
;
! !
, , β .
.
sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .
"C" , , . ( , ), . , join , , . :
$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result
CP1251 . , Linux ru_RU.CP1251. , sort , .
, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .
$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using βen_US.UTF-8β sorting rules
$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison
, , , LC_COLLATE=C.
, .
, , ls -a , , , , Midnight Commander, , , , .
β10 Unicode collation algorithm
unicode.org
ICU β Unicode IBM.
ICU
ISO 14651
Description of file format with weights ISO 14652
Glibc string comparison discussion