Como a classificação do Linux classifica seqüências de caracteres

Introdução


Tudo começou com um script curto que deveria combinar informações sobre os endereços de email dos funcionários recebidos da lista de usuários da lista de emails com as posições dos funcionários recebidos do banco de dados do departamento de pessoal. Ambas as listas foram exportadas para arquivos de texto Unicode UTF-8 e salvas com finais de linha Unix.


Conteúdo Mail.txt


 ;ia@example.com

Conteúdo Buhg.txt


 ;
 ;
 ;
 ;

Para mesclar, os arquivos foram classificados pelo comando de classificação Unix e enviados à entrada do programa de junção Unix , que terminou inesperadamente com um erro:


$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

Observar o resultado da classificação pelos olhos mostrou que, em geral, a classificação é correta, mas em caso de coincidência de sobrenomes masculino e feminino, as mulheres vão antes dos homens:


$> sort buhg.txt
 ;
 ;
 ;
 ;

Parece uma falha de classificação Unicode ou uma manifestação do feminismo em um algoritmo de classificação. O primeiro, é claro, é crível.


Por enquanto , deixe de lado a junção e concentre-se na classificação . Vamos tentar resolver o problema pelo método da cutucada científica. Primeiro, altere o código do idioma de en_US para ru_RU . Para classificar, seria suficiente definir a variável de ambiente LC_COLLATE , mas não vamos brincar:


$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

.


:


$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

.


, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".


: "C" . :


$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :


$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .


, — CP1251:


$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8 

, , "C", , , . -.


, , , . sort LC_COLLATE .


:


  • LANG=ru_RU.CP1251 LANG=C
  • sort join
  • ,


№ 10 Unicode collation algorithm unicode.org. , .


Collation — "" — . ("", "", ""), , .


— . , - , , . Ö P, CP850 ÿ Ü.


"" , , . UTF8, UTF16 KOI8-R ( ) , .


, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.


, . : , ? . (¿Te gusta la música?). , , ?


. , , , , . , ( ) . , , , , - .


, :


  • ;
  • , , (A + , Å);
  • , , (Ch ) (Æ );
  • (, /, , ) ();
  • , , ( {… } bash);
  • .

, , :


  • ( );
  • ;
  • , ;
  • ( x < y , xz < yz);
  • , , . , ;
  • , , . — , (. );
  • / .

, . , , (Beatles, The).


, ( ) .


, . , . ( ), -. IGNORED (0x0) , . , . .


, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .


DUCET :


  • , , ( ) ;
  • ;
  • ;
  • .

: ; , ; , , .


, . , , .


( ) № 10 — "Unicode Collation Algorithm" (UCA).


. .


UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .


DUCET , , , - — "International Components for Unicode" (ICU).


, IBM, , . , , .


 ;
 ;
 ;
 ;

, ICU . Collation FAQ .


, sort Linux - .


glibc


sort GNU Core Utils , LC_COLLATE :


$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

strcoll, glibc.


wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .


wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).


, , , .


ISO 14651/14652


CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .


iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.


:


, \ , # . , :


escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .


LC_COLLATE , , .


. , , . collating-symbol ( ), , , .


900 . , .


LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating-symbol <OSMANYA> OSMANYA
  • collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
  • FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
  • <U0413> UCS-4
  • collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .


% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .


order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .


:


  • ,
  • , , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :


<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :






, ( <CAP> <MIN>), .


LC_COLLATE=C ,


static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .



, , , CTT . localedef.


localedef ( -i), , ( -f). , .


Glibc : "" "".


, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .


/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.


, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .


,


localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic


LANG=en_US.UTF-8 glibc :


/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .


locale -a.



, , . , , ASCII.


: localedef.


, , ISO 14652 . reorder-after , . reorder-end. , .


iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .


LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.


localedef I18NPATH , :


$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .


:


$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !



, , — .


.


sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .


"C" , , . ( , ), . , join , , . :


$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .


, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .


$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison


, , , LC_COLLATE=C.


, .


, , ls -a , , , , Midnight Commander, , , , .



№10 Unicode collation algorithm


unicode.org


ICU — Unicode IBM.


ICU


ISO 14651


Descrição do formato do arquivo com pesos ISO 14652


Discussão de comparação de cordas Glibc


All Articles