introduction

Tout a commencé par un court script qui devait combiner des informations sur les adresses e-mail des employés reçues de la liste des utilisateurs de la liste de diffusion avec les positions des employés reçues de la base de données du service du personnel. Les deux listes ont été exportées vers des fichiers texte Unicode UTF-8 et enregistrées avec des fins de ligne Unix.

Contenu Mail.txt

 ;ia@example.com

Contenu Buhg.txt

 ;
 ;
 ;
 ;

Pour fusionner, les fichiers ont été triés par la commande de tri Unix et soumis à l'entrée du programme de jointure Unix , qui s'est terminé de manière inattendue par une erreur:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

La visualisation du résultat du tri par les yeux a montré que, en général, le tri est correct, mais en cas de coïncidence des noms masculins et féminins, les femmes passent avant les hommes:

$> sort buhg.txt
 ;
 ;
 ;
 ;

Il ressemble à un problème de tri Unicode ou à une manifestation du féminisme dans un algorithme de tri. Le premier, bien sûr, est crédible.

Pour l'instant , mettez de côté join et concentrez-vous sur le tri . Essayons de résoudre le problème par la méthode de piquer scientifique. Tout d'abord, modifiez les paramètres régionaux de en_US à ru_RU . Pour le tri, il suffirait de définir la variable d'environnement LC_COLLATE , mais nous ne triflerons pas:

$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".

: "C" . :

$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .

, — CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8

, , "C", , , . -.

, , , . sort LC_COLLATE .

LANG=ru_RU.CP1251 LANG=C
sort join
,

№ 10 Unicode collation algorithm unicode.org. , .

Collation — "" — . ("", "", ""), , .

— . , - , , . Ö P, CP850 ÿ Ü.

"" , , . UTF8, UTF16 KOI8-R ( ) , .

, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.

, . : , ? . (¿Te gusta la música?). , , ?

. , , , , . , ( ) . , , , , - .

, :

;
, , (A + , Å);
, , (Ch ) (Æ );
(, /, , ) ();
, , ( {… } bash);
.

, , :

( );
;
, ;
( x < y , xz < yz);
, , . , ;
, , . — , (. );
/ .

, . , , (Beatles, The).

, ( ) .

, . , . ( ), -. IGNORED (0x0) , . , . .

, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .

DUCET :

, , ( ) ;
;
;
.

: ; , ; , , .

, . , , .

( ) № 10 — "Unicode Collation Algorithm" (UCA).

. .

UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .

DUCET , , , - — "International Components for Unicode" (ICU).

, IBM, , . , , .

 ;
 ;
 ;
 ;

, ICU . Collation FAQ .

, sort Linux - .

glibc

sort GNU Core Utils , LC_COLLATE :

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

strcoll, glibc.

wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .

wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).

, , , .

ISO 14651/14652

CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .

iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.

, \ , # . , :

escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .

LC_COLLATE , , .

. , , . collating-symbol ( ), , , .

900 . , .

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

collating-symbol <OSMANYA> OSMANYA
collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
<U0413> UCS-4
collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .

order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .

,
, , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :

, ( <CAP> <MIN>), .

LC_COLLATE=C ,

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .

, , , CTT . localedef.

localedef ( -i), , ( -f). , .

Glibc : "" "".

, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .

/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.

, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic

LANG=en_US.UTF-8 glibc :

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .

locale -a.

, , . , , ASCII.

: localedef.

, , ISO 14652 . reorder-after , . reorder-end. , .

iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.

localedef I18NPATH , :

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !

, , — .

sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .

"C" , , . ( , ), . , join , , . :

$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .

, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

, , , LC_COLLATE=C.

, .

, , ls -a , , , , Midnight Commander, , , , .

№10 Unicode collation algorithm

Description du format de fichier avec les poids ISO 14652

Discussion sur la comparaison des chaînes Glibc

Comment Linux trie les chaînes

introduction

glibc

ISO 14651/14652

More articles: