Introduction

It all started with a short script that was supposed to combine information about the e-mail addresses of employees received from the list of mailing list users with the positions of employees received from the personnel department database. Both lists were exported to Unicode UTF-8 text files and saved with Unix line endings.

Mail.txt content

 ;ia@example.com

Buhg.txt content

 ;
 ;
 ;
 ;

To merge, the files were sorted by the Unix sort command and submitted to the input of the Unix join program , which unexpectedly ended with an error:

$> sort buhg.txt > buhg.srt
$> sort mail.txt > mail.srt
$> join buhg.srt mail.srt > result
join: buhg.srt:4: is not sorted:  ;

Viewing the result of sorting by eyes showed that, in general, sorting is correct, but in case of coincidence of male and female surnames, women go before men:

$> sort buhg.txt
 ;
 ;
 ;
 ;

It looks like a Unicode sorting glitch or as a manifestation of feminism in a sorting algorithm. The first, of course, is believable.

For now , put aside join and focus on sort . Let's try to solve the problem by the method of scientific poking. First, change the locale from en_US to ru_RU . For sorting it would be enough to set the environment variable LC_COLLATE , but we will not trifle:

$> LANG=ru_RU.UTF-8 sort buhg.txt
 ;
 ;
 ;
 ;

$> iconv -f UTF-8 -t KOI8-R buhg.txt \
 | LANG=ru_RU.KOI8-R sort \
 | iconv -f KOI8-R -t UTF8

, . , . , , : unix sort treats '-' (dash) characters as invisible. , "a-b", "aa", "ac" "aa", "a-b", "ac".

: "C" . :

$> LANG=C sort buhg.txt
 ;
 ;
 ;
 ;

- . , - . :

$> LANG=C sort buhg.txt > buhg.srt
$> LANG=C sort mail.txt > mail.srt
$> LANG=C join buhg.srt mail.srt > result

, . .

, — CP1251:

$> iconv -f UTF-8 -t CP1251 buhg.txt \
 | LANG=ru_RU.CP1251 sort \
 | iconv -f CP1251 -t UTF8

, , "C", , , . -.

, , , . sort LC_COLLATE .

LANG=ru_RU.CP1251 LANG=C
sort join
,

№ 10 Unicode collation algorithm unicode.org. , .

Collation — "" — . ("", "", ""), , .

— . , - , , . Ö P, CP850 ÿ Ü.

"" , , . UTF8, UTF16 KOI8-R ( ) , .

, , . , , . , Æ AE. Æ , Z. , Æ , . Ch, H I.

, . : , ? . (¿Te gusta la música?). , , ?

. , , , , . , ( ) . , , , , - .

, :

;
, , (A + , Å);
, , (Ch ) (Æ );
(, /, , ) ();
, , ( {… } bash);
.

, , :

( );
;
, ;
( x < y , xz < yz);
, , . , ;
, , . — , (. );
/ .

, . , , (Beatles, The).

, ( ) .

, . , . ( ), -. IGNORED (0x0) , . , . .

, — "Default Unicode Collation Element Table" (DUCET). , LC_COLLATE .

DUCET :

, , ( ) ;
;
;
.

: ; , ; , , .

, . , , .

( ) № 10 — "Unicode Collation Algorithm" (UCA).

. .

UCA , , DUCET. . , , ( 1F000 ). — , — ,2,3… .

DUCET , , , - — "International Components for Unicode" (ICU).

, IBM, , . , , .

 ;
 ;
 ;
 ;

, ICU . Collation FAQ .

, sort Linux - .

glibc

sort GNU Core Utils , LC_COLLATE :

$ sort --debug buhg.txt > buhg.srt
sort: using ‘en_US.UTF8’ sorting rules

strcoll, glibc.

wiki glibc . , glibc UCA (The Unicode collation algorithm) / ISO 14651 (International string ordering and comparison). , standards.iso.org ISO 14651 , . , , - PDF. , UCA, , .

wiki glibc. , glibc ISO The Common Template Table (CTT), A ISO 14651. 2000 2015 glibc ( ) . 2015 2018 (CentOS 8), (CentOS 7).

, , , .

ISO 14651/14652

CTT Linux /usr/share/i18n/locales/. iso14651_t1_common. copy iso14651_t1_common iso14651_t1, , , , en_US ru_RU. Linux , , .

iso14651_t1 , , , . ISO 14652, open-std.org. POSIX OpenGroup. collate_read glibc/locale/programs/ld-collate.c.

, \ , # . , :

escape_char /
comment_char %

<Uxxxx> <Uxxxxxxxx> ( x — ). UCS-4 (UTF-32). ( <Uxxxx_xxxx>, <2> ), , .

LC_COLLATE , , .

. , , . collating-symbol ( ), , , .

900 . , .

LC_COLLATE

collating-symbol <RES-1>
collating-symbol <BLK>
collating-symbol <MIN>
collating-symbol <WIDE>
...
collating-symbol <ARABIC>
collating-symbol <ETHPC>
collating-symbol <OSMANYA>
...
collating-symbol <S1D000>..<S1D35F>
collating-symbol <SFFFF> % Guaranteed largest symbol value. Keep at end of this list
...
collating-element <U0413_0301> from "<U0413><U0301>"
collating-element <U0413_0341> from "<U0413><U0341>"

collating-symbol <OSMANYA> OSMANYA
collating-symbol <S1D000>..<S1D35F> , S 1D000 1D35F.
FFFF collating-symbol <SFFFF> , <SFFFF> , <VERYBIGVAL>
<U0413> UCS-4
collating-element <U0413_0301> from "<U0413><U0301>" .

, . -, . "" , "". , . . , , , .

% Symbolic weight assignments

% Third-level weight assignments
<RES-1>
<BLK>
<MIN>
<WIDE>
...
% Second-level weight assignments
<BASE>
<LOWLINE> % COMBINING LOW LINE
<PSILI> % COMBINING COMMA ABOVE
<DASIA> % COMBINING REVERSED COMMA ABOVE
...
% First-level weight assignments
<S0009> % HORIZONTAL TABULATION 
<S000A> % LINE FEED
<S000B> % VERTICAL TABULATION
...
<S0434> % CYRILLIC SMALL LETTER DE
<S0501> % CYRILLIC SMALL LETTER KOMI DE
<S0452> % CYRILLIC SMALL LETTER DJE
<S0503> % CYRILLIC SMALL LETTER KOMI DJE
<S0453> % CYRILLIC SMALL LETTER GJE
<S0499> % CYRILLIC SMALL LETTER ZE WITH DESCENDER
<S0435> % CYRILLIC SMALL LETTER IE
<S04D7> % CYRILLIC SMALL LETTER IE WITH BREVE
<S0454> % CYRILLIC SMALL LETTER UKRAINIAN IE
<S0436> % CYRILLIC SMALL LETTER ZHE

, .

order_start order_end. order_start , . forward. , . , , . , . , ( ). , . IGNORE , .

,
, , , , .

order_start forward;forward;forward;forward,position
<U0000> IGNORE;IGNORE;IGNORE;IGNORE % NULL (in 6429)
<U0001> IGNORE;IGNORE;IGNORE;IGNORE % START OF HEADING (in 6429)
<U0002> IGNORE;IGNORE;IGNORE;IGNORE % START OF TEXT (in 6429)
...
<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<UFF13> <S0033>;<BASE>;<WIDE>;<UFF13> % FULLWIDTH DIGIT THREE
<U2476> <S0033>;<BASE>;<COMPAT>;<U2476> % PARENTHESIZED DIGIT THREE
<U248A> <S0033>;<BASE>;<COMPAT>;<U248A> % DIGIT THREE FULL STOP
<U1D7D1> <S0033>;<BASE>;<FONT>;<U1D7D1> % MATHEMATICAL BOLD DIGIT THREE
...
<U0430> <S0430>;<BASE>;<MIN>;<U0430> % CYRILLIC SMALL LETTER A
<U0410> <S0430>;<BASE>;<CAP>;<U0410> % CYRILLIC CAPITAL LETTER A
<U04D1> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
<U0430_0306> <S04D1>;<BASE>;<MIN>;<U04D1> % CYRILLIC SMALL LETTER A WITH BREVE
...
<U0431> <S0431>;<BASE>;<MIN>;<U0431> % CYRILLIC SMALL LETTER BE
<U0411> <S0431>;<BASE>;<CAP>;<U0411> % CYRILLIC CAPITAL LETTER BE
<U0432> <S0432>;<BASE>;<MIN>;<U0432> % CYRILLIC SMALL LETTER VE
<U0412> <S0432>;<BASE>;<CAP>;<U0412> % CYRILLIC CAPITAL LETTER VE
...
order_end

. :

<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
<U0021> IGNORE;IGNORE;IGNORE;<U0021> % EXCLAMATION MARK
<U0022> IGNORE;IGNORE;IGNORE;<U0022> % QUOTATION MARK
...

, ASCII ( ) . , , , . ( ) :

, ( <CAP> <MIN>), .

LC_COLLATE=C ,

static const uint32_t collseqwc[] =
{
  8, 1, 8, 0x0, 0xff,
  /* 1st-level table */
  6 * sizeof (uint32_t),
  /* 2nd-level table */
  7 * sizeof (uint32_t),
  /* 3rd-level table */
  L'\x00', L'\x01', L'\x02', L'\x03', L'\x04', L'\x05', L'\x06', L'\x07',
  L'\x08', L'\x09', L'\x0a', L'\x0b', L'\x0c', L'\x0d', L'\x0e', L'\x0f',

...
  L'\xf8', L'\xf9', L'\xfa', L'\xfb', L'\xfc', L'\xfd', L'\xfe', L'\xff'
};

, , .

, , , CTT . localedef.

localedef ( -i), , ( -f). , .

Glibc : "" "".

, , /usr/lib/locale/. LC_COLLATE, LC_CTYPE, LC_TIME .. LC_IDENTIFICATION ( ) .

/usr/lib/locale/locale-archive, , glibc. — , . ru_RU.KOI8-R, ru_RU.koi8r.

, /usr/share/i18n/locales/ /usr/share/i18n/charmaps/ CTT .

localedef -i ru_RU -f MAC-CYRILLIC ru_RU.MAC-CYRILLIC

/usr/share/i18n/locales/ru_RU /usr/share/i18n/charmaps/MAC-CYRILLIC.gz /usr/lib/locale/locale-archive ru_RU.maccyrillic

LANG=en_US.UTF-8 glibc :

/usr/lib/locale/locale-archive
/usr/lib/locale/en_US.UTF-8/
/usr/lib/locale/en_US/
/usr/lib/locale/enUTF-8/
/usr/lib/locale/en/

, .

locale -a.

, , . , , ASCII.

: localedef.

, , ISO 14652 . reorder-after , . reorder-end. , .

iso14651_t1_common ru_RU glibc ~/.local/share/i18n/locales/ LC_COLLATE ru_RU. glibc. , , .

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
reorder-after <U000D>
<U0020> <S0020>;<BASE>;<MIN>;<U0020> % SPACE
<U0021> <S0021>;<BASE>;<MIN>;<U0021> % EXCLAMATION MARK
<U0022> <S0022>;<BASE>;<MIN>;<U0022> % QUOTATION MARK
...
<U007D> <S007D>;<BASE>;<MIN>;<U007D> % RIGHT CURLY BRACKET
<U007E> <S007E>;<BASE>;<MIN>;<U007E> % TILDE
reorder-end
END LC_COLLATE

, LC_IDENTIFICATION ru_MY, , locale-archive.

localedef I18NPATH , :

$> I18NPATH=~/.local/share/i18n localedef -i ru_RU -f UTF-8 ~/.local/lib/locale/ru_MY.UTF-8

POSIX , LANG , , glibc Linux , LOCPATH. LOCPATH=~/.local/lib/locale/ , . LOCPATH .

$> LANG=ru_MY.UTF-8 LOCPATH=~/.local/lib/locale/ sort buhg.txt
 ;
 ;
 ;
 ;

! !

, , — .

sort join glibc. , join , sort en_US.UTF-8? : sort , join , , . .

"C" , , . ( , ), . , join , , . :

$> sort -t \; -k 1 buhg.txt > buhg.srt
$> sort -t \; -k 1 mail.txt > mail.srt
$> join -t \; buhg.srt mail.srt > result

CP1251 . , Linux ru_RU.CP1251. , sort , .

, . LOCPATH=/tmp locale -a locale-archive, LOCPATH ( locale) .

$> LOCPATH=/tmp locale -a | grep en_US
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

$> LC_COLLATE=en_US.UTF-8 sort --debug
sort: using ‘en_US.UTF-8’ sorting rules

$> LOCPATH=/tmp LC_COLLATE=en_US.UTF-8 sort --debug
sort: using simple byte comparison

, , , LC_COLLATE=C.

, .

, , ls -a , , , , Midnight Commander, , , , .

№10 Unicode collation algorithm

Description of file format with weights ISO 14652

Glibc string comparison discussion

How Linux sort sorts strings

Introduction

glibc

ISO 14651/14652

More articles: