Linux如何对字符串排序

介绍


所有这些都从一个简短的脚本开始,该脚本应该将有关从邮件列表用户列表中接收到的员工的电子邮件地址的信息与从人事部门数据库中接收到的员工的位置进行组合。这两个列表都导出到Unicode UTF-8文本文件,并以Unix行尾保存。


Mail.txt内容


 ;ia@example.com

Buhg.txt内容


 ;
 ;
 ;
 ;

为了合并,这些文件通过Unix sort命令排序,并提交给Unix join程序的输入,该程序意外地以错误结束:


$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

肉眼观察排序结果表明,排序通常是正确的,但如果男女姓氏一致,则女性要比男性先:


$> sort buhg.txt
 ;
 ;
 ;
 ;

它看起来像是Unicode排序故障,或者是排序算法中女权主义的体现。当然,第一个是可信的。


现在搁置加入并专注于排序让我们尝试通过科学戳法解决问题。首先,将语言环境从en_US更改ru_RU为了进行排序,设置环境变量LC_COLLATE就足够了,但是我们不会费力:


$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

.


:


$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

.


, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".


: "C" . :


$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :


$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .


, — CP1251:


$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8 

, , "C", , , . -.


, , , . sort LC_COLLATE .


:


  • LANG=ru_RU.CP1251 LANG=C
  • sort join
  • ,


№ 10 Unicode collation algorithm unicode.org. , .


Collation — "" — . ("", "", ""), , .


— . , - , , . Ö P, CP850 ÿ Ü.


"" , , . UTF8, UTF16 KOI8-R ( ) , .


, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.


, . : , ? . (¿Te gusta la música?). , , ?


. , , , , . , ( ) . , , , , - .


, :


  • ;
  • , , (A + , Å);
  • , , (Ch ) (Æ );
  • (, /, , ) ();
  • , , ( {… } bash);
  • .

, , :


  • ( );
  • ;
  • , ;
  • ( x < y , xz < yz);
  • , , . , ;
  • , , . — , (. );
  • / .

, . , , (Beatles, The).


, ( ) .


, . , . ( ), -. IGNORED (0x0) , . , . .


, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .


DUCET :


  • , , ( ) ;
  • ;
  • ;
  • .

: ; , ; , , .


, . , , .


( ) № 10 — "Unicode Collation Algorithm" (UCA).


. .


UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .


DUCET , , , - — "International Components for Unicode" (ICU).


, IBM, , . , , .


 ;
 ;
 ;
 ;

, ICU . Collation FAQ .


, sort Linux - .


glibc


sort GNU Core Utils , LC_COLLATE :


$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

strcoll, glibc.


wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .


wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).


, , , .


ISO 14651/14652


CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .


iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.


:


, \ , # . , :


escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .


LC_COLLATE , , .


. , , . collating-symbol ( ), , , .


900 . , .


LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

  • collating-symbol <OSMANYA> OSMANYA
  • collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
  • FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
  • <U0413> UCS-4
  • collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .


% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .


order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .


:


  • ,
  • , , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :


<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :






, ( <CAP> <MIN>), .


LC_COLLATE=C ,


static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .



, , , CTT . localedef.


localedef ( -i), , ( -f). , .


Glibc : "" "".


, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .


/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.


, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .


,


localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic


LANG=en_US.UTF-8 glibc :


/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .


locale -a.



, , . , , ASCII.


: localedef.


, , ISO 14652 . reorder-after , . reorder-end. , .


iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .


LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.


localedef I18NPATH , :


$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .


:


$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !



, , — .


.


sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .


"C" , , . ( , ), . , join , , . :


$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .


, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .


$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison


, , , LC_COLLATE=C.


, .


, , ls -a , , , , Midnight Commander, , , , .



№10 Unicode collation algorithm


unicode.org


ICU — Unicode IBM.


ICU


ISO 14651


具有权重ISO 14652的文件格式的说明


Glibc字符串比较讨论


All Articles