Cómo ordena Linux las cadenas

Introducción


Todo comenzó con una secuencia de comandos breve, que consistía en combinar información sobre las direcciones de correo electrónico de los empleados recibidos de la lista de usuarios de la lista de correo con las posiciones de los empleados recibidos de la base del departamento de personal. Ambas listas se exportaron a archivos de texto Unicode UTF-8 y se guardaron con terminaciones de línea Unix.


Contenido de mail.txt


 ;ia@example.com

Contenido Buhg.txt


 ;
 ;
 ;
 ;

Para fusionar, los archivos se ordenaron mediante el comando de clasificación Unix y se enviaron a la entrada del programa de unión Unix , que inesperadamente terminó con un error:


$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

Ver el resultado de la clasificación por ojos mostró que, en general, la clasificación es correcta, pero en caso de coincidencia de apellidos masculinos y femeninos, las mujeres van antes que los hombres:


$> sort buhg.txt
 ;
 ;
 ;
 ;

Parece un problema de clasificación Unicode o como una manifestación del feminismo en un algoritmo de clasificación. El primero, por supuesto, es creíble.


Por ahora , deja de lado unirte y concéntrate en ordenar . Tratemos de resolver el problema mediante el método de búsqueda científica. En primer lugar, cambiar la configuración regional de en_US a es_ES . Para ordenar sería suficiente establecer la variable de entorno LC_COLLATE , pero no vamos a jugar :


$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

.


:


$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

.


, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".


: "C" . :


$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :


$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .


, — CP1251:


$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8 

, , "C", , , . -.


, , , . sort LC_COLLATE .


:


  • LANG=ru_RU.CP1251 LANG=C
  • sort join
  • ,


№ 10 Unicode collation algorithm unicode.org. , .


Collation — "" — . ("", "", ""), , .


— . , - , , . Ö P, CP850 ÿ Ü.


"" , , . UTF8, UTF16 KOI8-R ( ) , .


, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.


, . : , ? . (¿Te gusta la música?). , , ?


. , , , , . , ( ) . , , , , - .


, :


  • ;
  • , , (A + , Å);
  • , , (Ch ) (Æ );
  • (, /, , ) ();
  • , , ( {… } bash);
  • .

, , :


  • ( );
  • ;
  • , ;
  • ( x < y , xz < yz);
  • , , . , ;
  • , , . — , (. );
  • / .

, . , , (Beatles, The).


, ( ) .


, . , . ( ), -. IGNORED (0x0) , . , . .


, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .


DUCET :


  • , , ( ) ;
  • ;
  • ;
  • .

: ; , ; , , .


, . , , .


( ) № 10 — "Unicode Collation Algorithm" (UCA).


. .


UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .


DUCET , , , - — "International Components for Unicode" (ICU).


, IBM, , . , , .


 ;
 ;
 ;
 ;

, ICU . Collation FAQ .


, sort Linux - .


glibc


sort GNU Core Utils , LC_COLLATE :


$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

strcoll, glibc.


wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .


wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).


, , , .


ISO 14651/14652


CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .


iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.


:


, \ , # . , :


escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .


LC_COLLATE , , .


. , , . collating-symbol ( ), , , .


900 . , .


LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating-symbol <OSMANYA> OSMANYA
  • collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
  • FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
  • <U0413> UCS-4
  • collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .


% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .


order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .


:


  • ,
  • , , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :


<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :






, ( <CAP> <MIN>), .


LC_COLLATE=C ,


static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .



, , , CTT . localedef.


localedef ( -i), , ( -f). , .


Glibc : "" "".


, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .


/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.


, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .


,


localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic


LANG=en_US.UTF-8 glibc :


/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .


locale -a.



, , . , , ASCII.


: localedef.


, , ISO 14652 . reorder-after , . reorder-end. , .


iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .


LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.


localedef I18NPATH , :


$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .


:


$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !



, , — .


.


sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .


"C" , , . ( , ), . , join , , . :


$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .


, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .


$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison


, , , LC_COLLATE=C.


, .


, , ls -a , , , , Midnight Commander, , , , .



№10 Unicode collation algorithm


unicode.org


ICU — Unicode IBM.


ICU


ISO 14651


Descripción del formato de archivo con pesos ISO 14652


Discusión de comparación de cadenas Glibc


All Articles