哈Ha
另一天 暹粒邀请我翻译一篇有关在wc
Haskell的帮助下战胜Unix 的有趣文章。当然,由于以下几个原因,我不会翻译它:
- 作者绝不退出单线程版本,而单线程版本要慢得多
wc
, - 在那篇文章中,有必要使用多线程来赢得(这本身就是一个有点欺骗和胜利,而在常识而不是通过
wc
) - 为此,作者不得不深入研究
chomonads和monoids-不是,这很好地说明了monoidal思想的魅力,但是恕我直言,对于这样的任务而言,它有点太多了,特别是因为 - 结果证明代码太繁琐,
- 实际上,要与
wc
具有很多选项和功能的竞争,要意识到它是一个非常像玩具的模拟游戏,总的来说,这有点奇怪,甚至有些愚蠢。
尽管如此,做一些奇怪的事情是一件好事,所以今天我们将尝试解决上述问题中的第一个问题,并改善Chris(原始文章作者的名字)的结果。
同样,正如我们上次发现的那样,我无法编写C代码,因此也不会编写C代码。作为Haskell实现的竞争对手,我(像Chris一样)wc
从GNU Coreutils 获得了一个协同的代码。那些家伙肯定知道如何用C编写代码,这段代码已经使用了10年了,并且通过以下方面来评估性能:
破坏者:我们将在wc
没有任何问题的情况下将系统超越一个数量级,获得完全惯用的代码,并花费不到半小时的时间来更改和优化原始代码。
实验条件
因此,首先我们讨论实验设置。
数据
我下载了此文件并将其粘贴在文件中,因此总大小约为1.8 GB:
% for i in `seq 1 10`; cat part.txt >> test.txt
% du -sh test.txt
1.8G test.txt
, test.txt
tmpfs
-, , IO.
Gentoo Linux Core i7 4770 32 .
- ghc 8.8.2.
wc
coreutils 8.31, gcc 9.2 -O2 -march=native
:
% wc --version
wc (GNU coreutils) 8.31
Packaged by Gentoo (8.31-r1 (p0))
, -march=native
— C, - - , : x86_64-, wc
coreutils ( SIMD-, ).
, -O3
-O2
— , -O2
.
, time
. time
, , ,
- 0.3 ,
- , , .
, ( ) , .
? wc
!
Unicode ( ), C-, wc
LANG=C LC_ALL=C wc /path/to/test.txt
. ? 10.4 . .
, (ru_RU.UTF-8
) wc
(7.20 ), , , , C- — , .
, ( ), 0.2 — , IO-bound-.
Haskell
, ?
, cs
bs
( , ):
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
wc :: BS.ByteString -> (Int, Int, Int)
wc s =
let (bs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s
in (bs, ws, ls)
where
go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool)
go (!bs, !ws, !ls, !wasSpace) c =
let addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSpace c = 1
| otherwise = 0
in (bs + 1, ws + addWord, ls + addLine, isSpace c)
, , ( ).
, , 9 wc
. : 31.2 , 306% wc
. 9 , , .
, ( , ), , , , .
?
-, , — . :
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Char
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Bool
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 False) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) (isSpace c)
where
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSpace c = 1
| otherwise = 0
bang-, - {-# LANGUAGE Strict #-}
. , ?
, , — 4 ! 7.56 , 75% wc
, ! ?
, : , , . -
data State = State
{ bs :: {-# UNPACK #-} !Int
, ws :: {-# UNPACK #-} !Int
, ls :: {-# UNPACK #-} !Int
, wasSpace :: !Bool
}
Int
- . , nursery area, , , — , .
CSE
, isSpace c
, — . , , go
:
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp = isSpace c
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace = 0
| isSp = 1
| otherwise = 0
? — 2.93 , 28% wc
.
? ghc , , ? — isSpace
( ) , common subexpression elimination.
— ( ). , wc
, . - , :
- LLVM- (
-fllvm
) — , . - (
-O2
) — -O
, -O2
, ? - (
{-# INLINE wc #-}
). , - , , , - . , , , .
, , , .
:
. -, . , -O2
, , LLVM- , ( ).
, . , C (19%!). , .
.
, State
, .
, Bool
, . , Int
, 1
True
, 0
— False
? , , !
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 0) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | isSpace c = 1
| otherwise = 0
addLine | c == '\n' = 1
| otherwise = 0
addWord | wasSpace == 1 = 0
| isSp == 1 = 1
| otherwise = 0
(, Bool
), . , , addWord
guard' ( if
) :
addWord = (1 - wasSpace) * isSp
.
? , , 1.91 , 18% wc
. , .
, ?
ASCII, isSpace
. , . , , , .
, Data.ByteString.Lazy
Data.ByteString.Lazy.Char8
, Char
.
:
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
module Data.WordCount where
import qualified Data.ByteString.Lazy as BS
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 0) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addLine | c == 10 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}
, 1.45 , 14% wc
.
, — , , .
, , . , : wasSpace
, , :
wc s = (bs, ws + 1 - wasSpace, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 1) s
...
, .
:
0.35 , ? - - ?
, , wc
. :
main :: IO ()
main = do
[path] <- getArgs
contents <- BS.readFile path
print $ wc contents
readFile
. mmap
. bytestring-mmap, ByteString
' mmap
- , main
main :: IO ()
main = do
[path] <- getArgs
contents <- unsafeMMapFile path
print $ wc contents
, unsafeMMapFile
ByteString
, wc
, . , unsafeMMapFile
, , mmap
- , .
— , 0.04-0.06 0.35 . , .
mmap
, : , , . , : , .
?
, , Unix-, . , , ST
.
, , : , wc
, , « » wc
, , . , .
好吧,为了避免折磨,我们将讨论未来的计划:在下一篇文章中,我们将处理更有趣的事情,Haskell将会真正发光。更具体地说,我们将当今的产品模块化,了解如何使玩具haskell wc
少一点玩具,并评估这将如何影响性能。