Bagaimana Linux mengurutkan string

pengantar


Semuanya dimulai dengan skrip pendek yang seharusnya menggabungkan informasi tentang alamat email karyawan yang diterima dari daftar pengguna milis dengan posisi karyawan yang diterima dari database departemen personalia. Kedua daftar diekspor ke file teks Unicode UTF-8 dan disimpan dengan akhiran baris Unix.


Konten Mail.txt


 ;ia@example.com

Konten Buhg.txt


 ;
 ;
 ;
 ;

Untuk menggabungkan, file diurutkan berdasarkan perintah pengurutan Unix dan dikirimkan ke input dari program bergabung Unix , yang secara tak terduga berakhir dengan kesalahan:


$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

Melihat hasil penyortiran dengan mata menunjukkan bahwa, secara umum, penyortiran adalah benar, tetapi dalam kasus kebetulan nama pria dan wanita, wanita lebih dulu daripada pria:


$> sort buhg.txt
 ;
 ;
 ;
 ;

Itu terlihat seperti kesalahan penyortiran Unicode atau sebagai manifestasi feminisme dalam algoritma penyortiran. Yang pertama, tentu saja, bisa dipercaya.


Untuk saat ini , kesampingkan bergabung dan fokus pada semacam . Mari kita coba pecahkan masalah dengan metode poking ilmiah. Pertama, ubah lokal dari en_US ke ru_RU . Untuk mengurutkannya cukup untuk mengatur variabel lingkungan LC_COLLATE , tetapi kami tidak akan menganggap enteng :


$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

.


:


$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

.


, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".


: "C" . :


$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :


$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .


, β€” CP1251:


$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8 

, , "C", , , . -.


, , , . sort LC_COLLATE .


:


  • LANG=ru_RU.CP1251 LANG=C
  • sort join
  • ,


β„– 10 Unicode collation algorithm unicode.org. , .


Collation β€” "" β€” . ("", "", ""), , .


β€” . , - , , . Γ– P, CP850 ΓΏ Ü.


"" , , . UTF8, UTF16 KOI8-R ( ) , .


, , . , , . , Γ† AE. Γ† , Z. , Γ† , . Ch, H I.


, . : , ? . (ΒΏTe gusta la mΓΊsica?). , , ?


. , , , , . , ( ) . , , , , - .


, :


  • ;
  • , , (A + , Γ…);
  • , , (Ch ) (Γ† );
  • (, /, , ) ();
  • , , ( {… } bash);
  • .

, , :


  • ( );
  • ;
  • , ;
  • ( x < y , xz < yz);
  • , , . , ;
  • , , . β€” , (. );
  • / .

, . , , (Beatles, The).


, ( ) .


, . , . ( ), -. IGNORED (0x0) , . , . .


, β€” "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .


DUCET :


  • , , ( ) ;
  • ;
  • ;
  • .

: ; , ; , , .


, . , , .


( ) β„– 10 β€” "Unicode Collation Algorithm" (UCA).


. .


UCA , , DUCET. . , , ( 1F000 ). β€” , β€” ,2,3… .


DUCET , , , - β€” "International Components for Unicode" (ICU).


, IBM, , . , , .


 ;
 ;
 ;
 ;

, ICU . Collation FAQ .


, sort Linux - .


glibc


sort GNU Core Utils , LC_COLLATE :


$ sort --debug buhg.txt > buhg.srt
sort: using β€˜en_US.UTF8’ sorting rules

strcoll, glibc.


wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .


wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).


, , , .


ISO 14651/14652


CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .


iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.


:


, \ , # . , :


escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x β€” ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .


LC_COLLATE , , .


. , , . collating-symbol ( ), , , .


900 . , .


LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating-symbol <OSMANYA> OSMANYA
  • collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
  • FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
  • <U0413> UCS-4
  • collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .


% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .


order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .


:


  • ,
  • , , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :


<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :






, ( <CAP> <MIN>), .


LC_COLLATE=C ,


static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .



, , , CTT . localedef.


localedef ( -i), , ( -f). , .


Glibc : "" "".


, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .


/usr/lib/locale/locale-archive, , glibc. β€” , . ru_RU.KOI8-R, ru_RU.koi8r.


, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .


,


localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic


LANG=en_US.UTF-8 glibc :


/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .


locale -a.



, , . , , ASCII.


: localedef.


, , ISO 14652 . reorder-after , . reorder-end. , .


iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .


LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.


localedef I18NPATH , :


$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .


:


$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !



, , β€” .


.


sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .


"C" , , . ( , ), . , join , , . :


$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .


, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .


$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using β€˜en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison


, , , LC_COLLATE=C.


, .


, , ls -a , , , , Midnight Commander, , , , .



β„–10 Unicode collation algorithm


unicode.org


ICU β€” Unicode IBM.


ICU


ISO 14651


Deskripsi format file dengan bobot ISO 14652


Diskusi perbandingan string Glibc


All Articles