Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and ex...
Get out of the habit of using while read
as an idiom and instead use xargs to process arguments when you're doing batch compute stuff.
For example, imagine you're piping some data out with cat
:
$ time ( cat data.txt | while read line; do echo $line; done )
a
b
c
d
e
f
g
0.00s user 0.00s system 85% cpu 0.005 total
This starts multiple processes. Using xargs, our data is processed altogether at once:
$ time ( cat data.txt | xargs echo )
a b c d e f g
0.00s user 0.01s system 110% cpu 0.008 total
But now consider this same idiom while processing a large amount of data with hundreds or thousands of lines, etc. Here, we'll run our benchmark again:
$ time ( head -n 10000 Notes/all.txt | while read line; do echo $line; done )
..
.. // omitted for brevity
0.10s user 0.18s system 105% cpu 0.267 total
$ time ( head -n 10000 Notes/all.txt | xargs echo; )
..
.. // snipped again
0.02s user 0.08s system 100% cpu 0.100 total
When possible, use xargs. You'll likely save time and CPU cycles.
Comments
Post a Comment