Skip to main content

Posts

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older appplications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and e...
Recent posts

Too much efficiency makes everything worse

From "Overfitting and the strong version of Goodhart's law" : Increased efficiency can sometimes, counterintuitively, lead to worse outcomes. This is true almost everywhere. We will name this phenomenon the strong version of Goodhart's law. As one example, more efficient centralized tracking of student progress by standardized testing seems like such a good idea that well-intentioned laws mandate it. However, testing also incentivizes schools to focus more on teaching students to test well, and less on teaching broadly useful skills. As a result, it can cause overall educational outcomes to become worse. Similar examples abound, in politics, economics, health, science, and many other fields. [...] This same counterintuitive relationship between efficiency and outcome occurs in machine learning, where it is called overfitting. [...] If we keep on optimizing the proxy objective, even after our goal stops improving, something more worrying happens. The goal often sta...

Bootstrapping and the Central Limit Theorem

If you've ever seen a data visualization, you've probably seen a Bell Curve or a normal distribution. But this emergent property of many data visualizations is actually a result of the law of large numbers and the central limit theorem. The central limit theorem tells us that the distribution of a normalized version of any sample mean will eventually converge to a standard normal distribution . For example, let's say that we wish to chart the first fifty most popular science-fiction books on Goodreads by the number of pages they contain. Our initial sample will look something like this: pageCounts = np.array([ 324, 216, 384, 194, 480, 368, 374, 268, 244, 258, 476, 472, 391, 390, 144, 288, 118, 592, 224, 342, 382, 336, 450, 500, 304, 297, 192, 320, 487, 260, 250, 525, 182, 275, 400, 576, 518, 318, 208, 256 ]) If we want to plot our original sample of books, we could do something like: import numpy as np import seaborn as sns import matplotlib....

Unlearning, or Proof by Contradiction

Sometimes, we have to unlearn the things we initially learned. And I don't mean this in the sense of having been  deliberately deceived. Rather, I mean that to some extent, there are actually many situations in life that involve necessary lies —or believing things that are wrong for perfectly rational reasons . Sometimes it is only after we have consumed and digested such a falsehood that we can see the truth at all. Really, this form of learning is not unlike some parts of math. Consider a mathematical proof in which we begin by assuming that something is one way. But by the end of the proof, we may realize, through contradiction, that it's actually another way. Let us take the number 2 and generously hypothesize that the square root of 2 is actually rational . If this assumption were true, we should be able to prove it with an equation. Let the square root of 2 be the lowest form of $\frac{p}{q}$. Since squares of even numbers are even, and squares of odd numbers ...

Patterns and the Stock Market

On the random walk hypothesis and post-hoc explanations for describing natural processes, from " Patterns and the Stock Market ": While it's certainly entertaining to spin post-hoc explanations of market activity, it's also utterly futile. The market, after all, is a classic example of a "random walk," since the past movement of any particular stock cannot be used to predict its future movement. This inherent randomness was first proposed by the economist Eugene Fama, in the early 1960's. Fama looked at decades of stock market data in order to prove that no amount of rational analysis or knowledge (unless it was illicit insider information) could help you figure out what would happen next. All of the esoteric tools and elaborate theories used by investors to make sense of the market were pure nonsense. Wall Street was like a slot machine. Alas, the human mind can't resist the allure of explanations, even if they make no sense. We're so eager t...

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format. AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. [ 1 ] yt-dlp , the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging: $ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best...

Subshells in Powershell

Previously, I wrote a post about how it's possible to create a "subshell" in Windows analogous to the subshell feature available in Bash on Linux—because Microsoft Windows doesn't actually have native subshell capability the same way that Linux does. The script below is an improvement on the same previous method of using the .NET System.Diagnostics trick. But this new version correctly redirects the standard output: $x = New-Object System.Diagnostics.ProcessStartInfo $x.FileName = "cmd.exe" $x.Arguments = "/c echo %PATH%" $x.UseShellExecute = $false $x.RedirectStandardOutput = $true $x.EnvironmentVariables.Remove("Path") $x.EnvironmentVariables.Add("PATH", "C:\custom\path") $p = New-Object System.Diagnostics.Process $p.StartInfo = $x $p.Start() | Out-Null $output = $p.StandardOutput.ReadToEnd() $p.WaitForExit() Write-Output $output Real-World Example $customPath2 = "C:\custom\path\2" $data = @{ ...