Another article about wc

Good day to all.


Recently an article appeared on Habr Winning With twenty lines of Haskell: we write our wc from @ 0xd34df00d . The author, known for his sympathy for functional programming, implemented an analogue of the wc utility on Haskell and subjected it to optimization, resulting in an option that works more than 7 times faster than the standard Unix utility.


Being a conscientious benchmarker, 0xd34df00dAt the end of the article, he indicated that his version works with a greatly simplified TK compared to the original (for example, Unicode was ignored), and therefore its superiority in performance does not mean anything in fact. But it is clear that in spite of this, the format of the article generated a lot of controversy in the comments (and I often don’t read further than the headline myself, which is a sin to hide). Under articleyleoHe cited the results of testing a trivial solution written by him in C (under a simplified condition), showing that even a naive C-implementation bypasses the optimized Haskell-version. After that, the gentlemen began to throw themselves at each otherkacomments, in which they found fault with small differences in the possibilities of their options (this is exclusively my personal vision of the situation).


After some time, an article was published comrade chapuza, which introduced the implementation of simplified wc on Elixir, which also works faster than the standard one. The discussion continued in the comments from the first post, although in my opinion the former fuse was lost.


In this article we will not try to defeat one language / programming paradigm with another / another. I think an unbiased person understands that C is much more flexible in terms of manual optimization than Haskell and most other programming languages, and therefore C code, although it may be slower than Haskell, can also be much faster - it all depends on who writes this code. On the other hand, I personally see a lot of possibilities of the functional paradigm that have not yet been discovered, and I do not exclude the possibility that someday the functional code will be optimized so cool that writing such an analog in C would be too expensive. But even so, this time has not come yet.


, , ( ) , C-. . , , , 0xd34df00d "simd-".



, , .


#include <stdint.h>
#include <stdio.h>
#include <malloc.h>
#include <Windows.h>
#include <intrin.h>

using namespace std;

struct stats
{
    uint32_t char_count;
    uint32_t line_count;
    uint32_t word_count;
};

void word_count( unsigned char* const str, uint32_t size, stats& result );

int main( )
{
    FILE* f;
    fopen_s( &f, "input.txt", "rb" );
    fseek( f, 0, SEEK_END );
    long size = ftell( f );
    rewind( f );

    long alloc_size = size + 63 - ( size + 63 ) % 64;
    unsigned char* const data = (unsigned char*)_aligned_malloc( alloc_size, 64 );
    fread( data, 1, size, f );
    fclose( f );

    LARGE_INTEGER fr, t0, t1;
    QueryPerformanceFrequency( &fr );
    QueryPerformanceCounter( &t0 );

    stats result;
    word_count2( data, size, result );

    QueryPerformanceCounter( &t1 );

    _aligned_free( data );

    double const time = (double)( t1.QuadPart - t0.QuadPart ) / (double)fr.QuadPart;
    printf(
        "Words: %d; lines: %d; chars: %d\nElapsed time: %lfms\n",
        result.word_count, result.line_count, result.char_count, time * 1000.0
    );
    system( "pause" );
}

β€” , Windows Visual Studio. , Linux' , Linux- . , β€” VS.


, . , . 64 64 β€” SIMD-, . , . ( , , , .) , .


, Unicode, 32. , . , , .


Ryzen 5 3600 c 2x8 Gb 3200 MHz. , , .



:


void word_count( unsigned char* const str, uint32_t size, stats& result )
{
    uint32_t lines = 1;
    uint32_t words = 0;

    bool was_space = false;

    for ( uint32_t i = 0; i < size; ++i )
    {
        unsigned char const c = str[i];
        bool const is_space = c <= ' ';
        lines += ( c == '\n' ) ? 1 : 0;
        words += ( is_space && !was_space ) ? 1 : 0;
        was_space = is_space;
    }

    words += !was_space ? 1 : 0;

    result.char_count = size;
    result.line_count = lines;
    result.word_count = words;
}

0xd34df00d. :


Words: 44774640; lines: 15000010; chars: 1871822228
Elapsed time: 1825.704800ms

(SIMD- )


, ? , (, , ). 16 SSE-?


SSE:


void word_count_sse( unsigned char* const str, uint32_t size, stats& result )
{
    uint32_t lines = 1;
    uint32_t words = 0;

    bool was_space = false;

    for ( uint32_t i = 0; i < size; i += 16 )
    {
        __m128i const c = _mm_load_si128( (__m128i*)( str + i ) );
        uint32_t is_newline_mask = _mm_movemask_epi8(
            _mm_cmpeq_epi8( c, _mm_set1_epi8( '\n' ) )
        );

        uint32_t is_space_mask = _mm_movemask_epi8(
            _mm_andnot_si128( c, _mm_cmplt_epi8( c, _mm_set1_epi8( ' ' + 1 ) ) )
        );
        uint32_t was_space_mask = ( was_space ? 1 : 0 ) | ( is_space_mask << 1 );

        lines += __popcnt( is_newline_mask );
        words += __popcnt( is_space_mask & ~was_space_mask );
        was_space = is_space_mask & 0x8000;
    }

    words += !was_space ? 1 : 0;

    result.char_count = size;
    result.line_count = lines;
    result.word_count = words;
}

. was_space, . , 16 , 15- , .


_mm_load_si128 __m128i ( 128- XMM-).


. β€” 16- , 128- . SSE- XMM-, 16- _mm_movemask_epi8. β€” 128- , 16 , 16- , XMM-.


_mm_cmpeq_epi8, 128- , 0, , 255 . 128- , ( _mm_set1_epi8).


. , / SSE . _mm_cmplt_epi8 β€” , _mm_cmpeq_epi8. , , 128 , . _mm_andnot_si128. 128- "" . _mm_cmplt_epi8 . 128 , β€” , movemask'.


. was_space_mask, is_space_mask was_space, , __popcnt , . was_space , .


SIMD- x86 .


:


Words: 44774640; lines: 15000010; chars: 1871822228
Elapsed time: 165.704300ms

11 ! .


SIMD-


, , SSE AVX . 32 AVX/AVX2:


void word_count_avx( unsigned char* const str, uint32_t size, stats& result )
{
    uint32_t lines = 1;
    uint32_t words = 0;

    bool was_space = false;

    for ( uint32_t i = 0; i < size; i += 32 )
    {
        _mm_prefetch( (char*)str + i + 32 * 64, _MM_HINT_NTA );

        __m256i const c = _mm256_load_si256( (__m256i*)( str + i ) );
        uint32_t is_newline_mask = _mm256_movemask_epi8(
            _mm256_cmpeq_epi8( c, _mm256_set1_epi8( '\n' ) )
        );

        uint32_t is_space_mask = _mm256_movemask_epi8(
            _mm256_andnot_si256( c, _mm256_sub_epi8( c, _mm256_set1_epi8( ' ' + 1 ) ) )
        );
        uint32_t was_space_mask = ( was_space ? 1 : 0 ) | ( is_space_mask << 1 );

        lines += __popcnt( is_newline_mask );
        words += __popcnt( is_space_mask & ~was_space_mask );
        was_space = is_space_mask & 0x80000000;
    }

    words += !was_space ? 1 : 0;

    result.char_count = size;
    result.line_count = lines;
    result.word_count = words;
}

, SSE- . AVX/AVX2, , . 16, 32, 32-.


:


Words: 44774640; lines: 15000010; chars: 1871822228
Elapsed time: 109.606000ms

.


Prefetch


, , , , , _mm_prefetch. , , (, , ). AVX-:


void word_count2( unsigned char* const str, uint32_t size, stats& result )
{
    // ..............

    for ( uint32_t i = 0; i < size; i += 32 )
    {
        _mm_prefetch( (char*)str + i + 32 * 64, _MM_HINT_NTA );

        // ..............
    }

    // ..............
}

Words: 44774640; lines: 15000010; chars: 1871822228
Elapsed time: 73.564900ms

, 64 . , ; .


, RAM. β€” , 70ms. ( ).



, , . , , , 1.8 Gb 10 .


In any case, I hope the article was interesting, and someone could learn something new for themselves.


Happy Holivars!


All Articles