Hello, Habr.
The other day Siemarglinvited me to translate a curious article about the victory over Unix wcwith the help of Haskell. Of course, I will not translate it, and for several reasons:
- the author squeezed out of the single-threaded version by no means everything, and the single-threaded version was much slower wc,
- in that article, to win, it was necessary to use multithreading (which in itself is a bit of cheating and victory rather over common sense rather than over wc),
- for this, the author had to delve into trichomonads and monoids - not, this is an excellent illustration of the charms of monoidal thinking, but IMHO a little too much for such a task, especially since because of this
- the code turned out to be too voluminous,
- and indeed, to compete with wc, which has a bunch of options and features, realizing it is a very toy analogue, in general itβs somehow strange and even a little stupid.
Nevertheless, doing strange things is a good thing, so today we will try to fix the first of the points above and improve the result of Chris (the name of the author of the original article).
Again, as we found out last time, I canβt write C code, so I wonβt write it either, and as a competitor to the Haskell implementation, I (like Chris) have a synergistic one wcfrom GNU Coreutils. Those dudes certainly know how to write in C, this code is not one decade old, and they took care of performance, judging by such pieces:
Spoiler: we will overtake the system by wcabout an order of magnitude without any problems, having received a completely idiomatic code and spending less than half an hour on changing and optimizing the original code.
Experiment Conditions
So, first we discuss the experimental setup.
Data
I downloaded this file and stuck it with me, so the total size is about 1.8 gigabytes:
% for i in `seq 1 10`; cat part.txt >> test.txt
% du -sh test.txt
1.8G    test.txt
, test.txt tmpfs-, , IO.
Gentoo Linux Core i7 4770 32 .
- ghc 8.8.2.
wc coreutils 8.31, gcc 9.2 -O2 -march=native:
% wc --version
wc (GNU coreutils) 8.31
Packaged by Gentoo (8.31-r1 (p0))
, -march=native β C, - - , : x86_64-, wc coreutils ( SIMD-, ).
, -O3 -O2 β , -O2.
, time. time , , ,
- 0.3 ,
- , , .
, ( ) , .
? wc!
Unicode ( ), C-, wc LANG=C LC_ALL=C wc /path/to/test.txt. ? 10.4 . .
, (ru_RU.UTF-8) wc (7.20 ), , , , C- β , .
, ( ), 0.2 β , IO-bound-.
Haskell
, ?
, cs bs ( , ):
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
wc :: BS.ByteString -> (Int, Int, Int)
wc s =
    let (bs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s
     in (bs, ws, ls)
  where
    go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
    go (!bs, !ws, !ls, !wasSpace) c =
        let addLine | c == '\n' = 1
                    | otherwise = 0
            addWord | wasSpace = 0
                    | isSpace c = 1
                    | otherwise = 0
         in (bs + 1, ws + addWord, ls + addLine, isSpace c)
, , ( ).
, , 9 wc. : 31.2 , 306% wc. 9 , , .
, ( , ), , , , .
?
-, , β . :
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Bool
  }
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 False) s
    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) (isSpace c)
      where
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace = 0
                | isSpace c = 1
                | otherwise = 0
bang-, - {-# LANGUAGE Strict #-}. , ?
, , β 4 ! 7.56 , 75% wc, ! ?
, : , , . -
data State = State
  { bs :: {-# UNPACK #-} !Int
  , ws :: {-# UNPACK #-} !Int
  , ls :: {-# UNPACK #-} !Int
  , wasSpace :: !Bool
  }
Int- . , nursery area, , , β , .
CSE
, isSpace c , β . , , go :
    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp = isSpace c
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace = 0
                | isSp = 1
                | otherwise = 0
? β 2.93 , 28% wc.
? ghc , , ? β isSpace ( ) , common subexpression elimination.
β ( ). , wc , . - , :
- LLVM- ( -fllvm) β , .
- ( -O2) β-O,-O2, ?
- ( {-# INLINE wc #-}). , - , , , - . , , , .
, , , .
:
. -, . , -O2 , , LLVM- , ( ).
, . , C (19%!). , .
.
, State, .
, Bool , . , Int, 1 True, 0 β False? , , !
data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Int
  }
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 0) s
    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp | isSpace c = 1
             | otherwise = 0
        addLine | c == '\n' = 1
                | otherwise = 0
        addWord | wasSpace == 1 = 0
                | isSp == 1 = 1
                | otherwise = 0
(, Bool), . , , addWord guard' ( if) :
        addWord = (1 - wasSpace) * isSp
.
? , , 1.91 , 18% wc. , .
, ?
ASCII, isSpace. , . , , , .
, Data.ByteString.Lazy Data.ByteString.Lazy.Char8, Char.
:
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
module Data.WordCount where
import qualified Data.ByteString.Lazy as BS
data State = State
  { bs :: Int
  , ws :: Int
  , ls :: Int
  , wasSpace :: Int
  }
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 0) s
    go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
      where
        isSp | c == 32 || c - 9 <= 4 = 1
             | otherwise = 0
        addLine | c == 10 = 1
                | otherwise = 0
        addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}
, 1.45 , 14% wc.
, β , , .
, , . , : wasSpace , , :
wc s = (bs, ws + 1 - wasSpace, ls)
  where
    State { .. } = BS.foldl' go (State 0 0 0 1) s
    ...
, .
:
0.35 , ? - - ?
, , wc. :
main :: IO ()
main = do
  [path] <- getArgs
  contents <- BS.readFile path
  print $ wc contents
readFile . mmap. bytestring-mmap, ByteString' mmap- , main
main :: IO ()
main = do
  [path] <- getArgs
  contents <- unsafeMMapFile path
  print $ wc contents
, unsafeMMapFile ByteString, wc, . , unsafeMMapFile, , mmap- , .
β , 0.04-0.06 0.35 . , .
mmap , : , , . , : , .
?
, , Unix-, . , , ST .
, , : , wc, , Β« Β» wc , , . , .
Well, so as not to torment, we will discuss plans for the future: in the next article we will deal with something substantially more interesting, where the Haskell will really shine. More specifically, we modularize what we have today, see how to make our toy a haskell wclittle less toy, and also evaluate how this will affect performance.