Einführung
Alles begann mit einem kurzen Skript, das Informationen über die E-Mail- Adressen der Mitarbeiter aus der Liste der Benutzer der Mailingliste mit den Positionen der Mitarbeiter aus der Basis der Personalabteilung kombinierte . Beide Listen wurden in Unicode UTF-8- Textdateien exportiert und mit Unix-Zeilenenden gespeichert.
Mail.txt- Inhalt
;ia@example.com
Buhg.txt Inhalt
;
;
;
;
Zum Zusammenführen wurden die Dateien mit dem Befehl Unix sort sortiert und an die Eingabe des Unix- Join- Programms gesendet , das unerwartet mit einem Fehler endete:
$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted: ;
Das Betrachten des Ergebnisses der Sortierung nach Augen zeigte, dass die Sortierung im Allgemeinen korrekt ist, aber im Falle des Zusammentreffens von männlichen und weiblichen Nachnamen gehen Frauen vor Männern:
$> sort buhg.txt
;
;
;
;
Es sieht aus wie ein Unicode-Sortierfehler oder als Manifestation des Feminismus in einem Sortieralgorithmus. Das erste ist natürlich glaubwürdig.
Legen Sie vorerst Join zusammen und konzentrieren Sie sich auf das Sortieren . Versuchen wir, das Problem mit der Methode des wissenschaftlichen Stocherns zu lösen. Ändern Sie zunächst das Gebietsschema von en_US in ru_RU . Zum Sortieren würde es ausreichen, die Umgebungsvariable LC_COLLATE zu setzen , aber wir werden nicht spielen:
$> LANG=ru_RU.UTF-8 sort buhg.txt
;
;
;
;
.
:
$> iconv -f UTF-8 -t KOI8-R buhg.txt \
| LANG=ru_RU.KOI8-R sort \
| iconv -f KOI8-R -t UTF8
.
, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".
: "C" . :
$> LANG=C sort buhg.txt
;
;
;
;
- . , - . :
$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result
, . .
, — CP1251:
$> iconv -f UTF-8 -t CP1251 buhg.txt \
| LANG=ru_RU.CP1251 sort \
| iconv -f CP1251 -t UTF8
, , "C", , , . -.
, , , . sort LC_COLLATE .
:
- LANG=ru_RU.CP1251 LANG=C
- sort join
- ,
№ 10 Unicode collation algorithm unicode.org. , .
Collation — "" — . ("", "", ""), , .
— . , - , , . Ö P, CP850 ÿ Ü.
"" , , . UTF8, UTF16 KOI8-R ( ) , .
, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.
, . : , ? . (¿Te gusta la música?). , , ?
. , , , , . , ( ) . , , , , - .
, :
- ;
- , , (A + , Å);
- , , (Ch ) (Æ );
- (, /, , ) ();
- , , ( {… } bash);
- .
, , :
- ( );
- ;
- , ;
- ( x < y , xz < yz);
- , , . , ;
- , , . — , (. );
- / .
, . , , (Beatles, The).
, ( ) .
, . , . ( ), -. IGNORED (0x0) , . , . .
, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .
DUCET :
: ; , ; , , .
, . , , .
( ) № 10 — "Unicode Collation Algorithm" (UCA).
. .
UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .
DUCET , , , - — "International Components for Unicode" (ICU).
, IBM, , . , , .
;
;
;
;
, ICU . Collation FAQ .
, sort Linux - .
glibc
sort GNU Core Utils , LC_COLLATE :
$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules
strcoll, glibc.
wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .
wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).
, , , .
ISO 14651/14652
CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .
iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.
:
, \ , # . , :
escape_char /
comment_char %
<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .
LC_COLLATE , , .
. , , . collating-symbol ( ), , , .
900 . , .
LC_COLLATE
collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"
- collating-symbol <OSMANYA> OSMANYA
- collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
- FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
- <U0413> UCS-4
- collating-element <U0413_0301> from "<U0413><U0301>" .
, . -, . "" , "". , . . , , , .
% Symbolic weight assignments
% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE
, .
order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .
:
order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end
. :
<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...
, ASCII ( ) . , , , . ( ) :
, ( <CAP> <MIN>), .
LC_COLLATE=C ,
static const uint32_t collseqwc[] =
{
8, 1, 8, 0x0, 0xff,
/* 1st-level table */
6 * sizeof (uint32_t),
/* 2nd-level table */
7 * sizeof (uint32_t),
/* 3rd-level table */
L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',
...
L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};
, , .
, , , CTT . localedef.
localedef ( -i), , ( -f). , .
Glibc : "" "".
, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .
/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.
, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .
,
localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC
/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic
LANG=en_US.UTF-8 glibc :
/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/
, .
locale -a.
, , . , , ASCII.
: localedef.
, , ISO 14652 . reorder-after , . reorder-end. , .
iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE
, LC_IDENTIFICATION ru_MY, , locale-archive.
localedef I18NPATH , :
$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8
POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .
:
$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
;
;
;
;
! !
, , — .
.
sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .
"C" , , . ( , ), . , join , , . :
$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result
CP1251 . , Linux ru_RU.CP1251. , sort , .
, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .
$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison
, , , LC_COLLATE=C.
, .
, , ls -a , , , , Midnight Commander, , , , .
№10 Unicode collation algorithm
unicode.org
ICU — Unicode IBM.
ICU
ISO 14651
Beschreibung des Dateiformats mit Gewichten ISO 14652
Glibc String Vergleichsdiskussion