Zip files support many different compression methods, however, and although Deflate is the one most commonly used today, it was not added until several years after the introduction of the Zip file format. While the earlier methods are not in themselves relevant anymore, the techniques involved are still both interesting and relevant. For example, the first method used LZW compression, which popularized dictionary compression, gained notoriety due to patent issues, and is still widely used in GIF files. From a historical perspective, the older methods allow us to trace the evolution from the roots of PKZip to the Deflate method that we use today.

This article describes and implements the Shrink, Reduce, and Implode compression methods. The previous article is not required reading, but provides useful background for readers without previous knowledge about Zip files. All the code is available in hwzip-2.0.zip.

Let's do data compression like it's 1989!

]]>This article explains how the Zip file format and its compression scheme work in great detail: LZ77 compression, Huffman coding, Deflate and all. It tells some of the history, and provides a reasonably efficient example implementation written from scratch in C. The source code is available in hwzip-2.0.zip.

]]>Until then I had only programmed in QBasic, but if one could make fire with Pascal, I was determined to learn it.

My uncle supplied me with his university text-book on the language, and I stepped to it. Unfortunately, the book turned out to be extremely thin on the subject of making fire. Also it was mainly concerned with programming the PDP-10 as opposed to the IBM PC that I was using. And so, I never learned the skill.

(There's a lesson here about the social aspects of programming: I could have asked more and much better questions of our neighbour.)

I've wanted to revisit this for years. Having acquired better programming, English, and web search skills, it's time to fill this gap in my education. This post contains a walk-through of the classic MS-DOS *firedemo*, a port of it to SDL, and an implementation of the fire effect that runs on bare-metal PCs.

For those who managed to find the answer, a second problem awaited on the secret web site, and those who solved that were then encouraged to send in a job application.

Effectively nerd sniped, I started playing with this problem sometime last year, and it led down a path to some excellent programming exercises.

This post describes a few different ways of solving the problem; from a Perl one-liner, to using hand-rolled fixed-point arithmetic (including an implementation of Improved division by invariant integers) or using binary splitting with GMP to compute a billion decimals of *e*.

The code is available in eprime.c.

- Finding the 10-digit Prime With Perl
- Computing
*e*Ourselves - Computing
*e*Ourselves Without GMP - Computing
*e*Using Binary Splitting - Further Reading

Searching for "first 10-digit prime in e" quickly yields the answer: 7427466391 is the number we're looking for. But let's assume we found this problem early on, and that the solution had yet to be posted online. We can still benefit from the web by searching for "many digits of e". The first hit provides two million digits in which we can search for the solution, for example with a Perl one-liner:

```
$ curl -s https://apod.nasa.gov/htmltest/gifcity/e.2mil | tr -d '[:space:]' | \
perl -MMath::Prime::Util=is_prime -MList::Util=first -nle \
'print first { is_prime($_) } /(?=([1-9]\d{9}))/g'
7427466391
```

(To install the required Perl module on a Debian system: sudo apt-get install libmath-prime-util-perl)

How does this work?

- curl downloads the e.2mil file, which is then piped to the next command. The -s (for silent) flag makes it not print any other output.
- tr deletes (-d) all whitespace characters, making sure the digits of
*e*from the file end up on one line with no spaces in between them. - With Perl's -M flag, we import two Perl module subroutines: Math::Prime::Util::is_prime and List::Util::first.
- -n makes Perl loop over the input, executing our code for each line, with the current line in $_.
- -l causes a newline to be added to each print statement.
- -e specifies the code we want to run.
- The /.../g match operator matches a regular expression against $_, which holds the current line. The /g (global matching) modifier makes it return a list of all the strings matched by capture groups.
- The capture group in our regex, ([1-9]\d{9}), matches a non-zero digit followed by nine digits, i.e. a ten-digit number (leading zeros wouldn't count).
- Perl starts looking for the next regex match where the previous match ends, and since we want to find all ten-digit numbers in the string, including those that overlap, we put the pattern in a lookahead assertion: (?=..). Our regex matches an empty string followed by a ten-digit number; the number isn't considered part of the match, but it does get captured.
- first returns the first element from the list of regex matches for which is_prime returns true.
- Finally, that element is printed.

Computing mathematical constants to a large number of digits has been a popular sport (in select circles) for a long time. Pi is especially popular, with the current record at 22.5 trillion digits, but *e* too has been computed to extreme precision (at least to 5 trillion digits).

One way to approach the computation is from Taylor's theorem, which for *e* gives us

where *R*, the remainder term, is bounded by

The factorial in the denominator of the remainder term means the series converges quickly.

By choosing a large enough *n* and adding the terms of the series, we can approximate *e* to any precision we want. To get *k* correct decimals, we need an *n* such that

Taking the logarithm of both sides and using Stirling's approximation for factorials gives us

Since the left-hand side is monotonic, we can use binary search to quickly find the smallest *n* that fulfills the inequality:

```
uint64_t compute_n(uint64_t k)
{
/* Find n such that
(n + 1) * ln(n + 1) - (n + 1) > 1 + k * ln(10)
using binary search.
*/
uint64_t hi, lo, n;
hi = UINT64_MAX;
lo = 0;
while (hi - lo > 1) {
n = lo + (hi - lo) / 2;
if ((n + 1) * log(n + 1) - (n + 1) <= 1 + k * log(10)) {
lo = n;
} else {
hi = n;
}
}
return hi;
}
```

To compute 10,000 decimals, we need 3,249 terms.

For the computation of *e*, we cannot use the regular double data type, since it only has enough precision for a handful of decimals. Instead, we will use GMP's arbitrary-precision floating-point functions.

How many bits of precision do we need? Each decimal digit requires

bits, so to compute 10,000 decimals we define:

```
#ifndef NUM_DECIMALS
#define NUM_DECIMALS 10000ULL
#endif
#define NUM_DIGITS (1 + NUM_DECIMALS)
#define NUM_BITS (NUM_DIGITS * 3322 / 1000 + 999)
```

(Note that we have to be careful not to lose any bits due to the truncating integer division.)

We need two floating-point variables: one for the sum of the terms (which approaches *e*), and one for the current term. The current term starts at 1, and then we divide it by 1, 2, 3, etc. so that in each loop iteration, its value is *1/i!*

```
void eprime_gmp(void)
{
uint64_t i, n;
mpf_t e, term;
char *s;
mp_exp_t strexp;
n = compute_n(NUM_DECIMALS);
/* Compute e. */
mpf_set_default_prec(NUM_BITS);
mpf_init_set_ui(e, 1);
mpf_init_set_ui(term, 1);
for (i = 1; i <= n; i++) {
mpf_div_ui(term, term, i);
mpf_add(e, e, term);
}
mpf_clear(term);
```

We convert the final value to a string of decimal digits using mpf_get_str:

```
/* Convert to string of decimal digits. */
s = mpf_get_str(NULL, &strexp, 10, NUM_DIGITS, e);
mpf_clear(e);
assert(strexp == 1);
assert(strlen(s) == NUM_DIGITS);
#ifdef PRINT_E
printf("2.%s\n", &s[1]);
#endif
```

The string we get back starts with "2718..", i.e. there is an implicit decimal point after the first character.

Finally, we iterate over the string, checking for 10-digit primes with mpz_probab_prime_p:

```
find_prime_gmp(&s[1]);
free(s);
}
/* Find the first 10-digit prime in string s with length NUM_DECIMALS. */
void find_prime_gmp(const char *s)
{
int i;
mpz_t p;
mpz_init(p);
for (i = 1; i + 9 < NUM_DECIMALS; i++) {
if (s[i] == '0' || (s[i + 9] - '0') % 2 == 0) {
/* Skip leading zeros and even numbers. */
continue;
}
gmp_sscanf(&s[i], "%10Zd", p);
if (mpz_probab_prime_p(p, 20)) {
gmp_printf("%Zd is prime\n", p);
break;
}
}
mpz_clear(p);
}
```

To install GMP, build and run the program on Debian:

```
$ sudo apt-get install libgmp-dev
$ gcc -O3 -DNDEBUG eprime.c -lm -lgmp
$ ./a.out
7427466391 is prime
```

On Mac, to install GMP from MacPorts, build and run the program:

```
$ sudo port install gmp
$ clang -O3 -DNDEBUG -I/opt/local/include -L/opt/local/lib eprime.c -lm -lgmp
$ ./a.out
7427466391 is prime
```

To print the digits, pass along -DPRINT_E:

```
$ gcc -O3 -DNDEBUG -DPRINT_E eprime.c -lm -lgmp
$ ./a.out
2.718281828459045235360287471352662497757247093699959574966967627724076630353547
59457138217852516642742746639193200305992181741359662904357290033429526059563073
81323286279434907632338298807531952510190115738341879307021540891499348841675092
44761460668082264800168477411853742345442437107539077744992069551702761838606261
33138458300075204493382656029760673711320070932870912744374704723069697720931014
...
91674197015125597825727074064318086014281490241467804723275976842696339357735429
30186739439716388611764209004068663398856841681003872389214483176070116684503887
21236436704331409115573328018297798873659091665961240202177855885487617616198937
07943800566633648843650891448055710397652146960276625835990519870423001794655367
89
7427466391 is prime
```

The result can be compared against for example this page.

(Note that mpf_get_str has rounded the last decimal up to 9, and so has that page. If you print more decimals, you will see that the 10,000th decimal is actually 8, followed by 5674..)

As it turns out, the 10-digit prime occurs early in *e*, starting at the 99th decimal (in the second row above), so there is no need to compute all 10,000 decimals. But more decimals is more fun!

It seems amazing that just by repeatedly dividing and adding some numbers together, we end up with this sophisticated mathematical constant.

GMP did all the work in the code above. Perhaps if we implemented all of it ourselves, it would be even more satisfying? This seems like a worthwhile programming exercise.

Remember how numbers are represented in our regular decimal system. For example:

As long as we remember the position of the decimal point, we could simply store the number as an array of digits:

```
int nbr[6] = {1,2,3,4,5,6};
```

This is called fixed-point representation because the implicit decimal point occurs in a fixed position of the number (as opposed to floating-point, where it can change).

Since the instructions in modern computers perform arithmetic on 64-bit values, it's much more efficient to store numbers as arrays of 64-bit values rather than decimals.

As we know that the integer part of *e* is 2, we only concern ourselves with computing and storing the fractional part. We will represent it as an array of *n* 64-bit words:

To compute *e*, we only need to perform two operations: addition and division (what we did with mpf_add and mpf_div_ui in the GMP version).

Addition is straight-forward; we add the numbers place by place, starting at the least significant end and carrying the one as necessary (this is Algorithm A in The Art of Computer Programming, Section 4.3.1):

```
/* Add n-place integers u and v into w. */
void addn(int n, const uint64_t *u, const uint64_t *v, uint64_t *w)
{
bool carry, carry_a, carry_b;
uint64_t sum_a, sum_b;
int j;
carry = false;
for (j = 0; j < n; j++) {
sum_a = u[j] + carry;
carry_a = (sum_a < u[j]);
sum_b = sum_a + v[j];
carry_b = (sum_b < sum_a);
w[j] = sum_b;
assert(carry_a + carry_b <= 1);
carry = carry_a + carry_b;
}
}
```

Performing the division is trickier. Luckily, we only need to divide our n-word term number by a single word. That means the algorithm is straight-forward, essentially how we do short division in school.

However, the algorithm relies on being able to divide a two-word divisor by a one-word dividend (dividing 128 bits by 64 bits in our case), "two-by-one division". Some CPUs have an instruction for that; for example, Intel x86's DIV does exactly what we need as long as the result fits in a single word (which it always will in our case). But it is not possible to express that division in standard C, and many CPUs don't have an instruction for it.

Instead, we will perform two-by-one division using a clever technique that relies on multiplying the divisor with the approximate reciprocal of the dividend. Not only does this solve the division problem for us, it solves it efficiently because we can reuse the reciprocal when dividing by the same value multiple times. Using X86's DIV instruction would be much slower since it can take up to 100 cycles.

The algorithm is described in Möller and Granlund "Improved division by invariant integers" (IEEE Trans. Comput. 2011). The paper is an improved version of Granlund and Montgomery "Division by Invariant Integers using Multiplication" (PLDI'94) which is also the subject of Chapter 10 in Hacker's Delight.

The approximate reciprocal is defined as

which when *d* is "normalised", meaning that its highest bit is set, fits in a 64-bit word. Basically it's *1/d* shifted up 128 bits, with some adjustments to make it fit in a 64-bit word.

The algorithms in the paper rely on performing 64-bit by 64-bit multiplications, the results of which can be up to 128 bits. Standard C only provides the lower 64 bits of such multiplications, which the paper refers to as umullo:

```
uint64_t umullo(uint64_t a, uint64_t b)
{
return a * b;
}
```

Most CPUs do provide the full result of the multiplication. With GCC or Clang we can use the non-standard __uint128_t type to implement umulhi (the high 64-bits of the result) and umul (both the high and low 64-bits). With Microsoft Visual C++ we can use intrinsics, and when none of those options are available, we can compute the result by hand using four 32-bit multiplications:

```
#if defined(__GNUC__)
void umul(uint64_t a, uint64_t b, uint64_t *p1, uint64_t *p0)
{
__uint128_t p = (__uint128_t)a * b;
*p1 = (uint64_t)(p >> 64);
*p0 = (uint64_t)p;
}
uint64_t umulhi(uint64_t a, uint64_t b)
{
return (uint64_t)(((__uint128_t)a * b) >> 64);
}
#elif defined(_MSC_VER) && defined(_M_X64)
#include <intrin.h>
void umul(uint64_t a, uint64_t b, uint64_t *p1, uint64_t *p0)
{
*p0 = _umul128(a, b, p1);
}
uint64_t umulhi(uint64_t a, uint64_t b)
{
return __umulh(a, b);
}
#else
void umul(uint64_t x, uint64_t y, uint64_t *p1, uint64_t *p0)
{
uint32_t x0 = (uint32_t)x, x1 = (uint32_t)(x >> 32);
uint32_t y0 = (uint32_t)y, y1 = (uint32_t)(y >> 32);
uint64_t p;
uint32_t res0, res1, res2, res3;
p = (uint64_t)x0 * y0;
res0 = (uint32_t)p;
res1 = (uint32_t)(p >> 32);
p = (uint64_t)x0 * y1;
res1 = (uint32_t)(p += res1);
res2 = (uint32_t)(p >> 32);
p = (uint64_t)x1 * y0;
res1 = (uint32_t)(p += res1);
p >>= 32;
res2 = (uint32_t)(p += res2);
res3 = (uint32_t)(p >> 32);
p = (uint64_t)x1 * y1;
res2 = (uint32_t)(p += res2);
res3 += (uint32_t)(p >> 32);
*p0 = ((uint64_t)res1 << 32) | res0;
*p1 = ((uint64_t)res3 << 32) | res2;
}
uint64_t umulhi(uint64_t a, uint64_t b)
{
uint64_t p0, p1;
umul(a, b, &p1, &p0);
return p1;
}
#endif
```

With these functions in place, we can proceed to implement reciprocal_word, which computes *v* using carefully implemented (and hard to follow in the paper) Newton iteration.

```
/* Algorithm 2 from Möller and Granlund
"Improved division by invariant integers". */
uint64_t reciprocal_word(uint64_t d)
{
uint64_t d0, d9, d40, d63, v0, v1, v2, ehat, v3, v4, hi, lo;
static const uint64_t table[] = {
/* Generated with:
for (int i = (1 << 8); i < (1 << 9); i++)
printf("0x%03x,\n", ((1 << 19) - 3 * (1 << 8)) / i); */
0x7fd, 0x7f5, 0x7ed, 0x7e5, 0x7dd, 0x7d5, 0x7ce, 0x7c6, 0x7bf, 0x7b7,
0x7b0, 0x7a8, 0x7a1, 0x79a, 0x792, 0x78b, 0x784, 0x77d, 0x776, 0x76f,
0x768, 0x761, 0x75b, 0x754, 0x74d, 0x747, 0x740, 0x739, 0x733, 0x72c,
0x726, 0x720, 0x719, 0x713, 0x70d, 0x707, 0x700, 0x6fa, 0x6f4, 0x6ee,
0x6e8, 0x6e2, 0x6dc, 0x6d6, 0x6d1, 0x6cb, 0x6c5, 0x6bf, 0x6ba, 0x6b4,
0x6ae, 0x6a9, 0x6a3, 0x69e, 0x698, 0x693, 0x68d, 0x688, 0x683, 0x67d,
0x678, 0x673, 0x66e, 0x669, 0x664, 0x65e, 0x659, 0x654, 0x64f, 0x64a,
0x645, 0x640, 0x63c, 0x637, 0x632, 0x62d, 0x628, 0x624, 0x61f, 0x61a,
0x616, 0x611, 0x60c, 0x608, 0x603, 0x5ff, 0x5fa, 0x5f6, 0x5f1, 0x5ed,
0x5e9, 0x5e4, 0x5e0, 0x5dc, 0x5d7, 0x5d3, 0x5cf, 0x5cb, 0x5c6, 0x5c2,
0x5be, 0x5ba, 0x5b6, 0x5b2, 0x5ae, 0x5aa, 0x5a6, 0x5a2, 0x59e, 0x59a,
0x596, 0x592, 0x58e, 0x58a, 0x586, 0x583, 0x57f, 0x57b, 0x577, 0x574,
0x570, 0x56c, 0x568, 0x565, 0x561, 0x55e, 0x55a, 0x556, 0x553, 0x54f,
0x54c, 0x548, 0x545, 0x541, 0x53e, 0x53a, 0x537, 0x534, 0x530, 0x52d,
0x52a, 0x526, 0x523, 0x520, 0x51c, 0x519, 0x516, 0x513, 0x50f, 0x50c,
0x509, 0x506, 0x503, 0x500, 0x4fc, 0x4f9, 0x4f6, 0x4f3, 0x4f0, 0x4ed,
0x4ea, 0x4e7, 0x4e4, 0x4e1, 0x4de, 0x4db, 0x4d8, 0x4d5, 0x4d2, 0x4cf,
0x4cc, 0x4ca, 0x4c7, 0x4c4, 0x4c1, 0x4be, 0x4bb, 0x4b9, 0x4b6, 0x4b3,
0x4b0, 0x4ad, 0x4ab, 0x4a8, 0x4a5, 0x4a3, 0x4a0, 0x49d, 0x49b, 0x498,
0x495, 0x493, 0x490, 0x48d, 0x48b, 0x488, 0x486, 0x483, 0x481, 0x47e,
0x47c, 0x479, 0x477, 0x474, 0x472, 0x46f, 0x46d, 0x46a, 0x468, 0x465,
0x463, 0x461, 0x45e, 0x45c, 0x459, 0x457, 0x455, 0x452, 0x450, 0x44e,
0x44b, 0x449, 0x447, 0x444, 0x442, 0x440, 0x43e, 0x43b, 0x439, 0x437,
0x435, 0x432, 0x430, 0x42e, 0x42c, 0x42a, 0x428, 0x425, 0x423, 0x421,
0x41f, 0x41d, 0x41b, 0x419, 0x417, 0x414, 0x412, 0x410, 0x40e, 0x40c,
0x40a, 0x408, 0x406, 0x404, 0x402, 0x400
};
assert(d > UINT64_MAX / 2 && "d must be normalized.");
d0 = d & 1;
d9 = d >> 55;
d40 = (d >> 24) + 1;
d63 = (d >> 1) + d0;
v0 = table[d9 - (1 << 8)];
v1 = (v0 << 11) - (umullo(umullo(v0, v0), d40) >> 40) - 1;
v2 = (v1 << 13) + (umullo(v1, (1ULL << 60) - umullo(v1, d40)) >> 47);
ehat = (v2 >> 1) * d0 - umullo(v2, d63);
v3 = (v2 << 31) + (umulhi(v2, ehat) >> 1);
umul(v3, d, &hi, &lo);
v4 = v3 - (hi + d + (lo + d < lo));
#if defined(__GNUC__) && defined(__x86_64__) && !defined(NDEBUG)
uint64_t asmq, asmr;
__asm__("divq %2" : "=a"(asmq), "=d"(asmr)
: "r"(d), "d"(UINT64_MAX-d), "a"(UINT64_MAX) : "cc");
assert(v4 == asmq);
#endif
return v4;
}
```

If we do have access to the X86 DIV instruction, it can be used to compute the approximate reciprocal directly. We use this in the assert above to check our work.

Using the reciprocal, 2-by-1 division is implemented as below, using two multiplications and some adjustments.

```
/* Algorithm 4 from Möller and Granlund
"Improved division by invariant integers".
Divide u1:u0 by d, returning the quotient and storing the remainder in r.
v is the approximate reciprocal of d, as computed by reciprocal_word. */
uint64_t div2by1(uint64_t u1, uint64_t u0, uint64_t d, uint64_t *r, uint64_t v)
{
uint64_t q0, q1;
assert(u1 < d && "The quotient must fit in one word.");
assert(d > UINT64_MAX / 2 && "d must be normalized.");
umul(v, u1, &q1, &q0);
q0 = q0 + u0;
q1 = q1 + u1 + (q0 < u0);
q1++;
*r = u0 - umullo(q1, d);
q1 = (*r > q0) ? q1 - 1 : q1;
*r = (*r > q0) ? *r + d : *r;
if (*r >= d) {
q1++;
*r -= d;
}
#if defined(__GNUC__) && defined(__x86_64__) && !defined(NDEBUG)
uint64_t asmq, asmr;
__asm__("divq %2" : "=a"(asmq), "=d"(asmr)
: "r"(d), "d"(u1), "a"(u0) : "cc");
assert(q1 == asmq && *r == asmr);
#endif
return q1;
}
```

And with that in place, we can finally implement our n-by-1 division.

To normalize the dividend, it is convenient to have a function for getting the number of leading zeros in a word. Many CPUs have an instruction for that (X86 has BSR), and otherwise we can do it ourselves.

```
/* Count leading zeros. */
#if defined(__GNUC__)
int clz(uint64_t x)
{
assert(x != 0);
return __builtin_clzll(x);
}
#elif defined(_MSC_VER) && defined(_M_X64)
#include <intrin.h>
int clz(uint64_t x)
{
assert(x != 0);
return __lzcnt64(x);
}
#else
int clz(uint64_t x)
{
int n = 0;
assert(x != 0);
while ((x << n) <= UINT64_MAX / 2) n++;
return n;
}
#endif
/* Right-shift that also handles the 64 case. */
uint64_t shr(uint64_t x, int n)
{
return n < 64 ? (x >> n) : 0;
}
```

Since we shift the dividend to normalize it, we must also shift the divisor the same amount to get the correct result. We can do this in-place while performing the division (this "short division" algorithm is Exercise 16 in The Art of Computer Programming, Section 4.3.1):

```
/* Divide n-place integer u by d, yielding n-place quotient q. */
void divnby1(int n, const uint64_t *u, uint64_t d, uint64_t *q)
{
uint64_t v, k, ui;
int l, i;
assert(d != 0);
assert(n > 0);
/* Normalize d, storing the shift amount in l. */
l = clz(d);
d <<= l;
/* Compute the reciprocal. */
v = reciprocal_word(d);
/* Perform the division. */
k = shr(u[n - 1], 64 - l);
for (i = n - 1; i >= 1; i--) {
ui = (u[i] << l) | shr(u[i - 1], 64 - l);
q[i] = div2by1(k, ui, d, &k, v);
}
q[0] = div2by1(k, u[0] << l, d, &k, v);
}
```

Now we can finally proceed to computing *e*. Note that since we're only computing the fraction, we skip the first terms (which add up to 2), and start with the third one. To initialize efrac and term to 0.5, we need to figure out what that is in the base we're using:

```
void eprime_manual(void)
{
int n, i;
uint64_t *efrac;
uint64_t *term;
uint64_t d;
char *s;
efrac = calloc(NUM_WORDS, sizeof(*efrac));
term = calloc(NUM_WORDS, sizeof(*efrac));
/* Start efrac and term at 0.5. */
efrac[NUM_WORDS - 1] = (1ULL << 63);
term[NUM_WORDS - 1] = (1ULL << 63);
/* Sum the series. */
n = compute_n(NUM_DECIMALS);
for (i = 3; i <= n; i++) {
divnby1(NUM_WORDS, term, (uint64_t)i, term);
addn(NUM_WORDS, efrac, term, efrac);
}
```

The code above leaves us with an array of carefully computed 64-bit words that represent the fractional part of *e*. How do we turn that into a decimal string?

We have:

If we multiply it by 10, the most significant digit, 7, gets moved to the integer position (in our case there is no integer position, so the 7 arrives as an overflow of the multiplication). We can repeat the process to extract one decimal at the time.

Instead of multiplying by 10 to get one decimal, we can multiply by 100 to get two, or any power of 10 to extract multiple decimals at the time. The largest power of 10 that fits in a 64-bit word is 10^19, so we will use that to extract 19 decimals at a time.

(The time complexity of this is quadratic: *O(n)* multiplications are performed *O(n)* times. GMP implements faster algorithms for binary to decimal conversion.)

```
/* Multiply n-place integer u by x in place, returning the overflow word. */
uint64_t mulnby1(int n, uint64_t *u, uint64_t x)
{
uint64_t k, p1, p0;
int i;
k = 0;
for (i = 0; i < n; i++) {
umul(u[i], x, &p1, &p0);
u[i] = p0 + k;
k = p1 + (u[i] < p0);
}
return k;
}
```

Back in eprime_manual:

```
/* Convert to decimal. */
s = malloc(19 * (NUM_DECIMALS / 19 + 18) + 1);
for (i = 0; i < NUM_DECIMALS; i += 19) {
d = mulnby1(NUM_WORDS, efrac, 10000000000000000000ULL);
sprintf(&s[i], "%019" PRIu64, d);
}
#ifdef PRINT_E
printf("2.%.*s\n", NUM_DECIMALS, s);
#endif
```

To check whether a number is prime, we will implement Algorithm P from The Art of Computer Programming, Section 4.5.4, also known as the Miller-Rabin primality test.

This is a probabilistic test, which means it cannot tell us that a number is prime with absolute certainty, but it's good enough for our purposes, and it's fast.

The algorithm relies on modular exponentiation, which we can implement with repeated squaring (also known as binary exponentiation). What makes it a little difficult is that we're dealing with 64-bit numbers, so we need to be able to compute 128-bit products and divide those by a 64-bit dividend. Luckily, we implemented exactly the necessary tools with umul and div2by1 above.

```
/* Compute x * y mod n, where n << s is normalized and
v is the approximate reciprocal of n << s. */
uint64_t mulmodn(uint64_t x, uint64_t y, uint64_t n, int s, uint64_t v)
{
uint64_t hi, lo, r;
assert(s >= 0 && s < 64);
assert((n << s) > UINT64_MAX / 2 && "n << s is normalized.");
assert(x < n && y < n);
umul(x, y, &hi, &lo);
div2by1((hi << s) | shr(lo, 64 - s), lo << s, n << s, &r, v);
return r >> s;
}
/* Compute x^p mod n by means of left-to-right binary exponentiation. */
uint64_t powmodn(uint64_t x, uint64_t p, uint64_t n)
{
uint64_t res, v;
int i, l, s;
assert(x > 0 && x < n);
s = clz(n);
v = reciprocal_word(n << s);
res = x;
l = 63 - clz(p);
for (i = l - 1; i >= 0; i--) {
res = mulmodn(res, res, n, s, v);
if (p & (1ULL << i)) {
res = mulmodn(res, x, n, s, v);
}
}
return res;
}
/* Miller-Rabin primality test a.k.a. "Algorithm P" in TAOCP 4.5.4. */
bool is_prob_prime(uint64_t n)
{
uint64_t q, x, y;
int i, j, k;
if (n == 2) {
return true;
} else if (n < 2 || n % 2 == 0) {
return false;
}
/* Find q and k such that n = 1 + 2^k * q, where q is odd. */
k = 0;
q = n - 1;
while (q % 2 == 0) {
k++;
q /= 2;
}
for (i = 0; i < 25; i++) {
x = ((uint64_t)rand() << 32) | (uint64_t)rand();
x = x % (n - 2) + 2;
assert(x > 1 && x < n);
j = 0;
y = powmodn(x, q, n);
for (;;) {
if (y == n - 1 || (j == 0 && y == 1)) {
/* Maybe prime; try another x. */
break;
} else if (y == 1 && j > 0) {
return false;
}
j++;
if (j >= k) {
return false;
}
y = powmodn(y, 2, n);
}
}
return true;
}
```

Back in eprime_manual:

```
/* Find the first prime. */
for (i = 0; i + 9 < NUM_DECIMALS; i++) {
if (s[i] == '0' || (s[i + 9] - '0') % 2 == 0) {
/* Skip leading zeros and even numbers. */
continue;
}
sscanf(&s[i], "%10" SCNu64, &d);
if (is_prob_prime(d)) {
printf("%" PRIu64 " is prime\n", d);
break;
}
}
free(term);
free(efrac);
free(s);
}
```

We can use our program to compute a million decimals:

```
$ gcc -O3 -march=native -DNDEBUG -DNUM_DECIMALS=1000000ULL -lm -lgmp eprime.c
$ time ./a.out
7427466391 is prime
real 2m0.800s
user 2m0.872s
sys 0m0.000s
```

On my laptop, the "manual" version takes about 2 minutes. If I use X86's DIV instruction instead of multiplying by the reciprocal, it takes 4 minutes. The GMP version takes just over 1 minute; it uses fancier and more carefully implemented algorithms.

While the algorithms above suffice for computing a million decimals, computing a billion decimals calls for a faster approach. The programs used for calculating *e*, *pi*, and other constants to trillions of decimals normally use a technique called *binary splitting*. My implementation is based on the description (PostScript version) by Xavier Gourdon and Pascal Sebah.

Consider the series below.

When *a = 0* and *b = n*, it is equivalent to the Taylor series for *e* except that it's missing the first term, *1/0!*

To compute the fraction resulting from adding the terms together, we need to rewrite each term to use the common denominator, *Q(a,b)*:

To compute *Q(a,b)*, we can use binary splitting. Instead of iteratively multiplying with *a+1*, *a+2*, and so on, we can use a recursive approach, splitting the computation into two halves at each step and combining the results (binary splitting is really just another name for divide and conquer):

When *Q(a,b)* is small, this isn't any faster than the iterative computation, because binary splitting performs the same number of operations. However, when the numbers are larger, the time required for each operation grows (often super-linearly) with the size of the numbers involved, and then we benefit from the binary splitting method reducing the size of the sub-problems.

(Besides reducing the size of the sub-problems, binary splitting is also a nice way to parallelize the computation by computing the sub-problems on separate threads.)

Computing the numerator of the series, *P(a,b)* is a little trickier. To rewrite each term on the common denominator, we need to multiply them with the factors missing from their denominators. For example, for the first term, we need to multibly by *a+2*, *a+3*, and so on all the way to *b*. For the second term, we start with *a+3*, and so on:

If we split *P(a,b)* into *P(a,m)* and *P(m,b)*, how can we combine them? Look at what we get:

*P(m,b)* matches the last terms of *P(a,b)*, so that seems fine.

*P(a,m)* looks like the first terms of *P(a,b)* except that the numerators are wrong; we need them to be *b!*. To fix this, we can multiply them by *b!/m!*, which we also know as *Q(m,b)*. So, we end up with:

In order for the computation to be fast, it's necessary to use a multiplication algorithm with good time complexity (in particular, the naive schoolbook algorithm with *O(mn)* time won't do). GMP implements fancy multiplication algorithms, so we will rely on that for performance:

```
void binsplit(uint64_t a, uint64_t b, mpz_t p, mpz_t q)
{
uint64_t m;
mpz_t pmb, qmb;
assert(b > a);
if (b - a == 1) {
mpz_set_ui(p, 1);
mpz_set_ui(q, b);
return;
}
m = a + (b - a) / 2;
mpz_init(pmb);
mpz_init(qmb);
/* Compute p(a, m) and q(a, m), storing the results in p and q to avoid
* allocating extra variables. */
binsplit(a, m, p, q);
/* Compute p(m, b) and q(m, b) */
binsplit(m, b, pmb, qmb);
/* p(a,b) = p(a,m) * q(m,b) + p(m,b) */
mpz_mul(p, p, qmb);
mpz_add(p, p, pmb);
/* q(a,b) = q(a,m) * q(m,b) */
mpz_mul(q, q, qmb);
mpz_clear(pmb);
mpz_clear(qmb);
}
void eprime_binsplit(void)
{
uint64_t n;
mpz_t p, q;
mpf_t e, qf;
char *s;
mp_exp_t strexp;
n = compute_n(NUM_DECIMALS);
/* Compute the fraction. */
mpz_init(p);
mpz_init(q);
binsplit(0, n, p, q);
/* Divide to rational. */
mpf_set_default_prec(NUM_BITS);
mpf_init(e);
mpf_init(qf);
mpf_set_z(e, p);
mpf_set_z(qf, q);
mpf_div(e, e, qf);
mpz_clear(p);
mpz_clear(q);
mpf_clear(qf);
/* Add the missing 1 term. */
mpf_add_ui(e, e, 1);
/* Convert to string of decimal digits. */
s = mpf_get_str(NULL, &strexp, 10, NUM_DIGITS, e);
mpf_clear(e);
assert(strexp == 1);
assert(strlen(s) == NUM_DIGITS);
#ifdef PRINT_E
printf("2.%s\n", &s[1]);
#endif
find_prime_gmp(&s[1]);
free(s);
}
```

Using this version to compute a million decimals is significantly faster than what we saw above:

```
$ gcc -O3 -march=native -DNDEBUG -DNUM_DECIMALS=1000000ULL -lm -lgmp eprime.c
$ time ./a.out
7427466391 is prime
real 0m0.379s
user 0m0.376s
sys 0m0.004s
```

How about a billion decimals? On my laptop, it takes about 20 minutes:

```
$ gcc -O3 -march=native -DNDEBUG -DNUM_DECIMALS=1000000000ULL -lm -lgmp eprime.c
$ time ./a.out
7427466391 is prime
real 21m27.575s
user 20m57.308s
sys 0m13.956s
```

- A 1981 Byte article by Steve Wozniak, The Impossible Dream: Computing
*e*to 116,000 Places with a Personal Computer (pp. 392-407) - Brent and Zimmerman, Modern Computer Arithmetic (Cambridge University Press, 2010) provides a concise overview of algorithms for arbitrary-precision arithmetic. Online drafts are available on the author's website.

Artificial intelligence (AI) is one of those subjects I really wish I'd taken while at university. Philosophical questions about intelligence aside, much of what AI comes down to in practice is techniques for solving problems with computers, which is really what computer science is all about.

A classic AI endeavour is programming computers to play chess. Shannon wrote an article about it already in 1948, and in 1997 the computer Deep Blue defeated humankind represented by world champion Garry Kasparov. A similar but much simpler board game is Othello (sometimes called Reversi), and writing an Othello program has become a popular exercise in AI classes.

I wanted to try my hand at writing an Othello game, but just doing a text-based version seemed lame; it really needs a graphical user interface. I never did much GUI programming (we did some in Java at university, but those weren't real programs), so it would be a good exercise for that too. The question was what platform to target. I use Linux, but most desktop computers run Windows or macOS. Also, most people run programs on smart-phones these days, and isn't the web the future of computing anyway? It would be interesting to learn about all of these platforms. The project scope had suddenly expanded.

This post describes the implementation of a basic Othello game engine in C with native user interfaces for Linux (X11), Windows, Mac, iOS, Android, and the web (asm.js and WebAssembly).

The source code is available in othello.tar.gz, the Windows executable in othello.exe, macOS application in MacOthello.zip, the iOS game in the App Store, the Android game on Google Play, and the web version right here.

Thanks to Nico who provided valuable feedback on drafts of this post!

**Update 2019-11-05:** Added dark mode support to the iOS port.

Things are simple enough with C. As explained in the first section of K&R, one might write this as a first program:

```
#include <stdio.h>
int main()
{
printf("Hello, world!\n");
return 0;
}
```

and turn it into an executable by simply invoking the system compiler as below.

```
$ cc hello.c
$ ./a.out
Hello, world!
```

This diagram illustrates the full process:

For Android, however, the official way to write Hello World is to fire up Android Studio, use its wizard to create a new project, and the application will then be generated and built automagically in a few minutes.

This is of course intended as a convenience for the developer, but for someone who wants to know what's going on, it makes things difficult instead. What actually happened? Which of all these files are really necessary for Hello World?

Others have expressed similar concerns:

(For me, it generated a 50 MB directory containing 1,348 files spread across 630 subdirectories.)

Perhaps it is the control freak in me speaking (a good trait for programmers about their programs), but I simply don't feel comfortable not understanding how to build my own program.

Below are my notes on how to build an Android application by hand from the command line. The instructions are for Linux, but they should be easy to adapt to Mac or Windows. The full source code and a build script is available in command_line_android.tar.gz.

]]>However, one annoying aspect is that the system is not always defined very well. The way I was taught about Roman numerals in school was something like the following. Roman numerals are written left-to-right using letters that each represent different values:

Letter | M | D | C | L | X | V | I |
---|---|---|---|---|---|---|---|

Value | 1000 | 500 | 100 | 50 | 10 | 5 | 1 |

The values are usually just added together (an additive system); however, when a smaller value is followed by a larger value, the smaller one is subtracted from the larger (subtractive notation). For example, 16 is written as XVI, and 14 as XIV.

I soon learned, after being told off for burning IIV onto the face of the clock I made in woodworking class, that there are rules for when one is supposed to use the subtractive notation. But it was never made clear exactly what those rules were.

This article examines how some popular programs implement conversion to Roman numerals.

]]>Computers have instructions for doing arithmetic on binary numbers of certain sizes, most commonly 32 or 64 bits. The largest 32-bit number is *2^32-1*, or *4,294,967,295*; computations on larger numbers have to be performed in multiple steps operating on smaller pieces.

This is similar to how humans compute: we can handle numbers up to a certain size in our heads, but for larger numbers, we use methods that break the computation into smaller steps. For example, to calculate *5678 + 9012*, we might use pen and paper to add one decimal position at a time from right to left, "carrying the one" as necessary.

Those same methods, or *algorithms*, we learned at school are used by computers as well (and computers are usually better at executing them than we are). This post shows implementations of these algorithms, and how they can be used to build a simple calculator.

These events started with John Regehr's Nibble Sort Contest about a year ago. The fastest solution (explained here) uses a sorting network, specifically this one:

(See for example this chapter from Introduction to Algorithms for information on sorting networks.)

This network can be found in TAOCP Section 5.3.4 (Vol 3, page 227), where Knuth describes its history:

"A 62-comparator sorting network for 16 elements was found by G. Shapiro in 1969, and this was rather surprising since Batcher's method (63 comparisons) would appear to be at its best when n is a power of 2. Soon after hearing of Shapiro's construction, M. W. Green tripled the amount of surprise by finding the 60-comparison sorter in Fig. 49."

Surprising indeed! How did Green come up with this network?

I found articles referencing a paper by Green: *"Some improvements in nonadaptive sorting algorithms" in Proceedings of the Sixth Annual Princeton Conference on Information Sciences and Systems, pages 387–391, 1972.* Would that have the explanation?

It turned out that this paper is impossible to find online. I tried to check local univeristy libraries for a copy of the proceedings, but the WorldCat entry says the nearest copy is in Denver, Colorado, which would be a long trip for me.

I then had the idea of emailing authors of more recent articles which reference Green's paper; surely they must have read it, and maybe they could give me a copy. I had only sent two emails when one author got back to me and said he couldn't provide a copy, but he did point out that the physical proceedings were available at on-line used bookstores!

Eventually, the volume arrived:

Reading the paper was a bit of a disappointment. Green explains how the "front part" of the network (the comparators before the dotted line in the figure) is constructed, calling it an *orthogonal network*. However, the paper does not show the remaining 28 comparators, nor does it explain how they were found:

"For the case p = 4, where the orthogonal network has 16 inputs, there are 168 output vectors. The structure of these vectors is sufficiently complex that it takes some ingenuity to find a minimal set of cells that will resolve the residual ambiguities. We believe that 28 cells are necessary, leading to a 16-sorter with 32 + 28 = 60 cells. The latter 60-cell network (whose correctness has been verified by a computer program) contains three fewer cells than Batcher's 16-sorter."

Since Knuth includes Green's network in his book, maybe he would have some more information about this network and how Green came up with it. And even though Green's paper doesn't provide a full explanation for the network, perhaps it should still be referenced in the text?

I emailed Knuth on the evening of 15 December. Two days later, I received a message from his secretary, asking to confirm my postal address as Knuth had a reply for me. She said the reply would go out with the post that day.

Then followed a long wait, with unusually frequent checking of the mail.

Mid-January, I wrote to ask about the status of the letter. The secretary confirmed that the letter had been sent on 18 December, and there was no way to track it.

A few hours later, I received another email from a mysterious address:

Dear Hans,

Maybe that letter was lost in the mail --- however, the US Postal Service tends to get behind at that time of year, and Stanford also closes down for two or three weeks! So there still is hope.

I don't recall what I wrote, but I think it basically said that I happen to own Jeff Ullman's former copies of those Princeton symposia, and I had read the issue but didn't flag Mike Green's paper at the time. Now I've add a reference to it in the answer to exercise 5.3.4--32, and I also wrote a "check" for $2.56 as a thank-you for the omission. [See http://www-cs-faculty.stanford.edu/~knuth/boss.html for your account info.]

If another month goes by and you still haven't received the check, I can send a replacement (although neither is actually "negotiable"!). But I doubt if I can reproduce anything else that I might have written in my reply --- I have dozens of these things going out the door all the time. My files don't contain any other hints of how he came up with his network.

Happy 2016 to you! -- Don Knuth

This made me very happy (and considering his email policy, perhaps an email from Knuth is even more rare than a cheque?).

Another month went by, and the replacement cheque arrived:

It is now sitting in a frame, acting kind of as a substitute for the Ph.D. diploma I never pursued.

As for Green's sorting network, the mystery still remains: how was the network constructed? If I understand correctly, no one has been able to reproduce the same network since, and only more recently were other networks of the same size discovered (see Hugues Juillé's web page). With more ingenuity and 21st century computational power, is it possible to find a smaller network?

**Update 2021-05-13:** Jannis Harder has an interesting blog post about proving the optimality of Shapiro and Green's 11- and 12-element sorting networks.

I first started programming in QBasic. My father had brought home a used IBM PS/2 Model 50 from work, and I spent many hours writing programs that printed text in different colours whilst beeping the internal speaker.

Those programs could only run in the QBasic environment, which made them harder to share with friends. I wanted to turn my programs into executable files, like real programs, but couldn't figure out how. Since then, I have learned many things, including using a compiler to build executables, but I was always curious: how do executable files *really* work?

Programming with Ones And Zeros showed how computer programs are translated into ones and zeros, and how to run them in memory. This post explores how to put those ones and zeros into an executable, a runnable file. Like the previous post, it assumes basic knowledge of C, x86 assembly language and hexadecimal numbers. There are examples for Linux, Mac, and Windows.

]]>*Update (July 2015):* Added analyses of the two winning solutions.

Being susceptible to nerd sniping, this problem stuck in my head and I ended up spending two Saturday afternoons trying to implement a fast solution.

The first time, I set up my timing function wrong, and GCC optimized the whole thing away. It seemed as my solution was no faster than John's reference implementation, and I gave up. Lesson 1: check the assembly to see what's going on.

Unable to let go, I returned to the problem the next weekend, fixed the timing function, and ended up with this solution:

```
#include <stdint.h>
uint64_t nibble_sort_word(uint64_t arg)
{
const uint64_t ones = 0x1111111111111111ULL;
if (arg == (arg & 0xf) * ones) {
return arg;
}
uint64_t count = 0;
for (int i = 0; i < 16; i++) {
count += 1ULL << (4 * (arg & 0xf));
arg >>= 4;
}
uint64_t mask = ~0ULL;
uint64_t result = 0;
for (int i = 0; i < 16; i++) {
int n = count & 0xf;
count >>= 4;
result = (result & ~mask) | ((ones & mask) * i);
mask <<= 4 * n;
}
return result;
}
```

I have always been fascinated by computers. As a child, I heard that they run "ones and zeros", and was curious as to what that actually meant. Eventually, I learned a bit of x86 assembly language. I think it was from an early version of Randall Hyde's The Art of Assembly Language and the now long-out-of-print Vitaly Maljugin et al., Revolutionary Guide to Assembly Language. For example, this x86 instruction moves the value 42 into the eax register:

```
movl $42, %eax
```

However, that is still a textual, human-readable, representation of the instruction. Where are the ones and zeros that the machine *actually* runs?

This post explores how to write machine code with 1's and 0's and run it on a computer. It assumes some familiarity with x86 assembly language, C, binary and hexadecimal numbers, how to use a compiler, etc. The examples have been tested on Linux and Mac OS X.

]]>Get the source:

```
$ git clone git://gcc.gnu.org/git/gcc.git
$ cd gcc
```

To use a specific version, check out one of the tags. For example:

```
$ git checkout gcc-4_9_0-release
```

One common problem when building GCC from source is not having the right versions of the GMP, MPFR and MPC libraries installed (or not being able to convince the build system that you do). The trick is to use this magic script which downloads the required library sources into the source tree, where they will be built together with the rest of it: (Also see the Prerequisites for GCC page.)

```
$ contrib/download_prerequisites
```

(On my machine I didn't have flex and bison installed, so note to self: # apt-get install flex bison.)

Now the fun part: configure and build. Configuration options are documented here.

```
$ mkdir build
$ cd build
$ ../configure --prefix=/path/to/gccprefix --enable-languages=c,c++ --enable-checking=release
$ make -j32
$ make install
```

After plenty of build output, there should now be a freshly baked gcc available in your selected prefix directory:

```
$ echo 'int main() { puts("hello, world!"); }' | /path/to/gccprefix/bin/gcc -xc - && ./a.out
```

Key concepts: there can be multiple *sessions* running at the same time. Each session has a number of separate *windows* (usually numbered 0, 1, 2, ...), which can be split into *panes*.

Commands are prefixed with CTRL-b by default (as opposed to screen which uses CTRL-a).

d | Detach |

$ | Rename current session |

s | List sessions |

L | Go to last session |

, | Rename the window |

n | Next window |

w | Previous window |

l | Last selected window |

w | List windows |

? | Show commands |

" | Split pane horizontally |

% | Split pane vertically |

o | Move to other pane |

; | Move to previous pane |

CTRL-o | Rotate panes |

! | Close the other panes |

:new | Create new session |

In ~/.tmux.conf:

# Save more scrollback (default is 2000 lines). set-option -g history-limit 100000

To attach to a specific session when starting: tmux attach -t session_name.

]]>Pictures from the unboxing of a Happy Hacking Keyboard Professional 2:

]]>- The Count of Corpus Christi (TX) (January 2007)
- The Adventure of the Spicy Blonde (November 2003)
- The Adventure of the Runaway Files (June 2003)
- The Mystery of the Red Worm (May 2003)
- April is the Cruelest Month (April 2003)
- Good Enough For Government Work (March 2003)
- The Adventure of the Arbitrary Archives (February 2003)
- The Case of the Evil Spambots (January 2003)
- The Case of the Duplicate UIDs (December 2002)
- The Adventure of the Misnamed Files (November 2002)

I went with the Kryptonite New York Fahgettaboudit Mini. At 2 kilos, the lock is a serious chunk of metal; it feels like complete overkill, which is just the way I like it.

I was worried that the lock would be too small for my favourite way of locking up the bike: through the seat stays, rear wheel and around the bike stand. Turns out it works fine:

And the message is clear:

This is how I validate my HTML files with curl:

```
$ curl --form "uploaded_file=@file.html;type=text/html" --silent \
--show-error http://validator.w3.org/check | grep -q '[Valid]'
```

and CSS:

```
$ curl --form "file=@file.css;type=text/css" --silent --show-error \
http://jigsaw.w3.org/css-validator/validator | \
grep -q 'Congratulations! No Error Found.'
```

*Update (August 2016):* Updated the plugin for Clang 3.9.

Steps to download and build LLVM and Clang: (For more info, see Clang - Getting Started.)

```
$ git clone -b release_39 http://llvm.org/git/llvm.git
$ cd llvm
$ git clone -b release_39 http://llvm.org/git/clang.git tools/clang
$ mkdir build
$ cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Release ..
$ ninja
```

This example plugin can be used to check that redeclarations of a function use the same parameter names, avoiding bugs like this:

```
/* In the .h file: */
int divide(int numerator, int denominator);
/* In the .c file: */
int divide(int denominator, int numerator)
{
return numerator / denominator;
}
```

It is common that a function is declared more than once, for example first in a .h file, and then again when it is defined in a .c file. It is not uncommon to use different names for the parameters in those cases; for example, the declaration in the header file might use more descriptive parameter names, and the definition might use shorter names. However, different names could also suggest that there is a bug, as in the example above. One way to check for this kind of parameter name mismatch is to use plugin.

To write a plugin, one must implement the PluginASTAction interface. The code below implements our example plugin: (The source is available for download here: clang-plugin-example.tar.gz.)

```
#include "clang/AST/AST.h"
#include "clang/AST/ASTConsumer.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Frontend/FrontendPluginRegistry.h"
#include "llvm/Support/raw_ostream.h"
using namespace clang;
namespace {
// RecursiveASTVisitor does a pre-order depth-first traversal of the
// AST. We implement VisitFoo() methods for the types of nodes we are
// interested in.
class FuncDeclVisitor : public RecursiveASTVisitor<FuncDeclVisitor> {
public:
explicit FuncDeclVisitor(DiagnosticsEngine &d) : m_diag(d) {}
// This function gets called for each FunctionDecl node in the AST.
// Returning true indicates that the traversal should continue.
bool VisitFunctionDecl(const FunctionDecl *funcDecl) {
if (const FunctionDecl *prevDecl = funcDecl->getPreviousDecl()) {
// If one of the declarations is without prototype, we can't compare them.
if (!funcDecl->hasPrototype() || !prevDecl->hasPrototype())
return true;
assert(funcDecl->getNumParams() == prevDecl->getNumParams());
for (unsigned i = 0, e = funcDecl->getNumParams(); i != e; ++i) {
const ParmVarDecl *paramDecl = funcDecl->getParamDecl(i);
const ParmVarDecl *previousParamDecl = prevDecl->getParamDecl(i);
// Ignore the case of unnamed parameters.
if (paramDecl->getName() == "" || previousParamDecl->getName() == "")
return true;
if (paramDecl->getIdentifier() != previousParamDecl->getIdentifier()) {
unsigned warn = m_diag.getCustomDiagID(DiagnosticsEngine::Warning,
"parameter name mismatch");
m_diag.Report(paramDecl->getLocation(), warn);
unsigned note = m_diag.getCustomDiagID(DiagnosticsEngine::Note,
"parameter in previous function declaration was here");
m_diag.Report(previousParamDecl->getLocation(), note);
}
}
}
return true;
}
private:
DiagnosticsEngine &m_diag;
};
// An ASTConsumer is a client object that receives callbacks as the AST is
// built, and "consumes" it.
class FuncDeclConsumer : public ASTConsumer {
public:
explicit FuncDeclConsumer(DiagnosticsEngine &d)
: m_visitor(FuncDeclVisitor(d)) {}
// Called by the parser for each top-level declaration group.
// Returns true to continue parsing, or false to abort parsing.
virtual bool HandleTopLevelDecl(DeclGroupRef dg) override {
for (Decl *decl : dg) {
m_visitor.TraverseDecl(decl);
}
return true;
}
private:
FuncDeclVisitor m_visitor;
};
class ParameterNameChecker : public PluginASTAction {
protected:
// Create the ASTConsumer that will be used by this action.
// The StringRef parameter is the current input filename (which we ignore).
std::unique_ptr<ASTConsumer> CreateASTConsumer(CompilerInstance &ci,
llvm::StringRef) override {
return llvm::make_unique<FuncDeclConsumer>(ci.getDiagnostics());
}
// Parse command-line arguments. Return true if parsing succeeded, and
// the plugin should proceed; return false otherwise.
bool ParseArgs(const CompilerInstance&,
const std::vector<std::string>&) override {
// We don't care about command-line arguments.
return true;
}
};
} // end namespace
// Register the PluginASTAction in the registry.
// This makes it available to be run with the '-plugin' command-line option.
static FrontendPluginRegistry::Add<ParameterNameChecker>
X("check-parameter-names", "check for parameter names mismatch");
```

To compile the plugin, some flags need to be specified so the compiler can find the LLVM and Clang header files, etc. The following Makefile does the job:

```
# The name of the plugin.
PLUGIN = ParameterNameChecker
# LLVM paths. Note: you probably need to update these.
LLVM_DIR = /home/hans/llvm
LLVM_BUILD_DIR = $(LLVM_DIR)/build
CLANG_DIR = $(LLVM_DIR)/tools/clang
CLANG = $(LLVM_BUILD_DIR)/bin/clang
# Compiler flags.
CXXFLAGS = -I$(LLVM_DIR)/include -I$(CLANG_DIR)/include
CXXFLAGS += -I$(LLVM_BUILD_DIR)/include -I$(LLVM_BUILD_DIR)/tools/clang/include
CXXFLAGS += -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -Wno-long-long
CXXFLAGS += -fPIC -fvisibility-inlines-hidden
CXXFLAGS += -fno-exceptions -fno-rtti -std=c++11
CXXFLAGS += -Wall
# Linker flags.
LDFLAGS = -shared -Wl,-undefined,dynamic_lookup
$(PLUGIN).so : $(PLUGIN).o
$(CXX) $(LDFLAGS) -o $(PLUGIN).so $(PLUGIN).o
$(PLUGIN).o : $(PLUGIN).cc
$(CXX) $(CXXFLAGS) -c $(PLUGIN).cc -o $(PLUGIN).o
check : $(PLUGIN).so
$(CLANG) -c -Xclang -load -Xclang ./$(PLUGIN).so \
-Xclang -add-plugin -Xclang check-parameter-names test.c
clean :
rm -fv $(PLUGIN).o $(PLUGIN).so test.o
```

Mac users may want to modify the file to use -dynamiclib rather than -shared in LDFLAGS, and change the file extension from .so to .dylib.

The plugin can be run like this: (-Xclang is used to prefix Clang's internal frontend options.)

```
$ clang -c -Xclang -load -Xclang ./ParameterNameChecker.so -Xclang -add-plugin \
-Xclang check-parameter-names test.c
```

Bash's history facilities are described in Chapter 9 of the Bash manual (PDF).

Bash keeps track of the previously issued commands. When Bash starts, it reads the command history into memory from a file, typically ~/.bash_history, and makes those commands available via the up arrow, etc. (see below).

When Bash terminates, it writes the last $HISTSIZE commands into the history file. The file gets overwritten by default. This means that if you open two shells, issue some commands in the first, close that, issue some commands in the second and close it, the commands issued in the first shell instance will not be in the history file, as they have been overwritten when the second shell closed.

If we do not want to save only the commands from the last exiting shell, Bash provides the histappend option. When this option is set, commands are *appended* to the history file instead of overwriting it when the shell is closed:

shopt -s histappend

That might be worth adding to ~/.bashrc. (Debian and Ubuntu seem to do this by default.)

We might also want to increase the number of commands saved (it was set to 500 by default on my system):

export HISTSIZE=10000 export HISTFILESIZE=10000

The history command can be used to inspect and modify the contents of the command list:

history # Shows the history list. history -c # Clears the history list. history -d <line> # Delete a specific history line.

The most basic way of using the command history is to use the up and down arrow keys (or CTRL-p and CTRL-n) to select a previous command.

Another way is to use the exclamation mark (the history expansion character):

!foo # Runs the most recent command starting with "foo" !! # Runs the previous command !n # Runs command line n !-n # Runs to the nth previous command (!! equals !-1)

Those are called *event designators*. We can combine an event designator with a *word designator* to access only part of a previous command, and maybe reuse it in another one. For example, we might do something like this:

vim /some/file # Edit a file. git add !$ # Add that file to the git index. git commit -m "Message" # Commit.

In the example above, !$ gets expanded to the last argument of the previous command, i.e. /foo/bar/some/file.

We put a : after an event designator and use a word designator to select a specific part of a command. Some of the word designators are:

n # The nth word (0 is typically the command name) ^ # The first argument (i.e. 1) $ # The last word x-y # Words x to y * # All words except the first

So to get the third word of the previous command, we can use !!:3.

The colon can be omitted if the word designator starts with ^, $ or *. If we do not provide an event designator, the previous command is used as the event. This allows for shortcuts as:

!$ # The last argument of the previous command !^ # The first argument of the previous command !* # All arguments of the previous command

Another, more interactive, way of using the history is to use incremental search.

Start searching with CTRL-r. Bash will search the history as you type. Press CTRL-r again to go to an earlier match, and CTRL-s for a more recent match.

I use this all the time. For example, rather than doing !make to execute the previous make command, I tend to do CTRL-r make, which lets me inspect the command before executing it. If I then realize that I actually wanted the second-to-last make command, I can just press CTRL-r again to get it.

This is described in Section 3.6 of the Bash manual.

It is relatively easy to remember that > file is used to redirect stdout to file by overwriting it, and that >> file appends to the file.

To redirect stderr instead, we put the number of that file descriptor (2) in front of the operator: 2> file and 2>> file.

The one that is harder to remember is how to redirect one stream to the other: 2>&1 will redirect stderr to stdout.

To redirect both stderr and stdout to the same file: >file 2>&1. Note that it is easy to get this backwards.

This is described in Chapter 8. Bash uses a library called *Readline* for this. The same library is used in other applications as well, so these commands are useful beyone Bash.

CTRL-_ | Undo |

CTRL-e | Move to end of line |

CTRL-a | Move to start of line |

CTRL-f | Move forward one word |

CTRL-b | Move backward one word |

CTRL-k | Kill the text until end of line |

My favourite: fc ("fix command") opens the previous command in $EDITOR and then runs it after editing.

I can never remember where to put the semicolons in a Bash for-loop. They are described here, and are supposed to look like this:

for name [ [in [words ...] ] ; ] do commands; done

Process substitution is handy when diffing the output of two commands. For example:

diff -u <(gcc -S -O2 -o - a.c | head) <(gcc -S -O3 -o - a.c | head)]]>

Pictures from the unboxing of a Soekris net5501-70:

]]>The coloured output is achieved using ANSI escape codes. To capture the diagnostics, I ran Clang like this:

```
$ clang -Xclang -fcolor-diagnostics -c my_program.c 2> foo.diag
```

This command tries to compile the file my_program.c, and outputs any warnings or errors to foo.diag. The -Xclang -fcolor-diagnostics flags are used to force the compiler to output coloured diagnostics even though the output is going to a file rather than the terminal.

I was learning about Flex, a lexical analyser generator, at the time, and wrote the following program: ansi2tex.l.

To compile it, first use flex to generate the C program, and then compile that with your favourite compiler:

```
$ flex -o ansi2tex.c ansi2tex.l
$ gcc -o ansi2tex ansi2tex.c
```

To use the program to convert a diagnostic to TeX, run it like this:

```
$ ./ansi2tex < foo.diag > foo.tex
```

That will write the resulting code to foo.tex. To include that in your document, you might do something like:

```
\documentclass{article}
\usepackage{color}
\begin{document}
\input{foo.tex}
\end{document}
```

The results look like this: (PDF available here.)

I am not a TeX expert, so I am sure there are more elegant ways of converting text with ansi escape characters to LaTeX, but this has served my purposes so far. Feel free to use, modify and distribute this program as you wish if you find it useful.

]]>The data looked like this (results.dat):

"Benchmark" "Compile" "HW" "Exec" "$A$" 12.25 165.36 80.77 "$A_{opt}$" 15.18 149.68 57.16 "$B$" 12.28 85.46 38.2 "$B_{opt}$" 15.15 78.13 26.38 "$C$" 11.07 0 188.06 "$C_{opt}$" 14.00 0 110.98 "$D$" 9.99 0 335.11 "$D_{opt}$" 12.65 0 198.63

The first row contains the column headers. The subsequent rows contain three measurements, "Compile", "HW", and "Exec", for different benchmarks. For each benchmark, we want the three time values stacked on top of each other to form the bar for that benchmark.

Note that TeX notation is used, i.e. we can write $A_{opt}$, and LaTeX will typeset it as expected.

The Gnuplot code looked like this (results.gnuplot):

```
set terminal epslatex newstyle size 12cm,8cm 10
set output "results.gnuplot.tex"
set boxwidth 0.9 absolute
set xtics nomirror
set grid ytics
set style data histogram
set style histogram rowstacked
set style fill pattern 0 border -1
set key left top vertical invert
set ylabel "Time (s)"
set key autotitle columnheader height 1
plot "results.dat" using 2, '' using 3, '' using 4 :xticlabels(1)
```

Running the program (gnuplot results.gnuplot) produces two output files: results.gnuplot.eps and results.gnuplot.tex. To use it in a LaTeX document, one might write something like this:

```
\documentclass{article}
\begin{document}
\begin{figure}[ht]
\begin{center}
\resizebox{\width}{!}{\input{results.gnuplot.tex}}
\end{center}
\caption{Benchmarks before and after optimisation.}
\label{fig:results}
\end{figure}
\end{document}
```

This fails with the following error:

! LaTeX Error: Cannot determine size of graphic in results.gnuplot (no Bounding Box).

The problem is that results.gnuplot.tex contains the code \includegraphics{results.gnuplot}. Since there is a file named results.gnuplot, that will be included instead of results.gnuplot.eps. We can fix this, either by editing the file manually, or with Sed:

```
$ sed -i 's/includegraphics{\([^}]*\)}/includegraphics{\1.eps}/' \
results.gnuplot.tex
```

We can now compile our LaTeX document to PDF like this:

```
$ latex histogram.tex
$ dvips histogram.dvi
$ ps2pdf histogram.ps
```

The generated PDF can be viewed here. The plot looks like this:

The careful reader might notice that the PDF title is wrong: the name of the Gnuplot output, results.gnuplot.tex, has overridden the title of our document.

There is a discussion in comp.graphics.apps.gnuplot about this. One way to fix it is to manually remove the \Subject, \Author, etc. tags from the EPS file generated by Gnuplot. Or with Sed:

```
$ sed -i 's/\/\(Title\|Subject\|Creator\|Author\|CreationDate\).*//' \
results.gnuplot.eps
```

So to create our plot from results.gnuplot and results.dat, and compile our LaTeX document, we run the following commands:

```
$ gnuplot results.gnuplot
$ sed -i 's/includegraphics{\([^}]*\)}/includegraphics{\1.eps}/' \
results.gnuplot.tex
$ sed -i 's/\/\(Title\|Subject\|Creator\|Author\|CreationDate\).*//' \
results.gnuplot.eps
$ latex histogram.tex
$ dvips histogram.dvi
$ ps2pdf histogram.ps
```

This may seem like a lot of work, but if one uses a Makfile or shell script to compile the document, it shouldn't be any trouble.

Sometimes it can be annoying to have to put the data in a separate file from the Gnuplot script, especially when the data is small.

Using the special - filename for the plot command (see help special-filenames) makes Gnuplot read the data directly from the lines after the comment. This is similar to how the unix << heredoc works. Use e on a line of its own to signal end of inline data.

]]>We know that w moves forward a word, but W moves forward a word without stopping at symbols and punctuation.

uu is a classic trick to move to the position of the last edit by undoing it, and the undoing the undo. This doesn's work in Vim where you have infinite undo. Use `. to jump to the mark of the last change instead.

The + (or just enter) and - characters move the cursor to the first character of the next and previous line, respectively.

0 and $ moves to the beginning and end of a line respectively. ^ moves to the first nonblank character. n| moves to column n. Useful e.g. when investigating a compiler error at a specific line and column.

Use ; to repeat previous f (find-character-in-line) command.

Marks. Use mx to set mark x. Use `x to jump to mark x. (Or 'x to first character of that line.) Vim sets lots of special marks, use :marks to show a list. I find . (last change) and ^ (end of last insert) most useful.

Repeat last edit with .. I wish there was a command to "repeat the last move".

Just as move commands take numeric prefixes, so do insert commands. To append 10 xs to a line: 10Ax. The r command does this too. To replace five characters with x, do 5rx.

Registers: Last deletions are stored in registers 1 to 9. Registers a to z are available for general use. To paste from a certain register x, use "xp. To yank the current line into register x, use "xyy. The thing to remember is to prefix with "x. There are many special registers; prehaps most notably, * is the system clipboard.

Filtering text through a command is super useful. I tend to select a line range with visual mode, and then pipe through sort for example, with !sort. Useful for sorting include directives in C and C++.

Use << and >> to shift left and right by the shiftwidth amount. Always have shiftwidth set to the same value as tabstop.

Use CTRL-n for auto completion. Use CTRL-n and CTRL-p to cycle and select match.

Change case with ~. This does not make sense with Swedish keyboard layout.

Use qx to start recording a macro into register x. Stop recording by pressing q again. Use @x to replay the macro in register x. Use @@ to repeat the last macro replay.

I tend to remember CTRL-u (up) and CTRL-d (down) which scroll by half screens. CTRL-f (forward) and CTRL-b (backward) scroll whole screens.

scroll-cursor: z followed by return makes current line the top line, z. makes current line the center of screen, z- makes current line bottom of screen. Giving a prefix makes z use a specific line rather than the current, e.g. 42z. puts the cursor on line 42 and makes that line center of the screen.

To do the opposite of moving the screen around a certain line, we can move to a certain line on the screen: H (home) moves cursor to the first line on the screen, M (middle) moves to the middle line, and L (last) moves to the last line.

Edit a file in a new tab with :tabe. Use gt to go to the next tab, and gT to go to the previous. Use ngt to go to tab n. Use :tabl to go to the last tab.

Use vim -p file1 file2 to open files in separate tabs.

It can be very useful to set different options for different directories, for example when working on projects with different style guidelines:

autocmd BufNewFile,BufRead /path1/* set tabstop=4 shiftwidth=4 autocmd BufNewFile,BufRead /path2/* set tabstop=2 shiftwidth=2

The LLVM project has some useful suggestions for .vimrc here. This is a bit of it that I like to use:

" Highlight trailing whitespace and lines longer than 80 columns. highlight LongLine ctermbg=DarkYellow guibg=DarkYellow highlight WhitespaceEOL ctermbg=DarkYellow guibg=DarkYellow if v:version >= 702 " Lines longer than 80 columns. au BufWinEnter * let w:m0=matchadd('LongLine', '\%>80v.\+', -1) " Whitespace at the end of a line. This little dance suppresses " whitespace that has just been typed. au BufWinEnter * let w:m1=matchadd('WhitespaceEOL', '\s\+$', -1) au InsertEnter * call matchdelete(w:m1) au InsertEnter * let w:m2=matchadd('WhitespaceEOL', '\s\+\%#\@<!$', -1) au InsertLeave * call matchdelete(w:m2) au InsertLeave * let w:m1=matchadd('WhitespaceEOL', '\s\+$', -1) else au BufRead,BufNewFile * syntax match LongLine /\%>80v.\+/ au InsertEnter * syntax match WhitespaceEOL /\s\+\%#\@<!$/ au InsertLeave * syntax match WhitespaceEOL /\s\+$/ endif

Set guioptions to remove some components from the gvim GUI. Hiding the menus, buttons, etc. that you never use gives a cleaner experience. In my .vimrc, I normally do:

:set guioptions-=M "remove menu bar :set guioptions-=T "remove tool bar

Ex command-lines start with a : followed by a line range. No range means current line. The range 1,3 means lines 1 to 3. $ is the last line in the file. % is equal to 1,$, that is, all lines in the file. We can also use g/pattern/ to specify all lines that contain pattern.

The line range is followed by a command. If we do not specify a command, we get the default command, p, which prints the lines. The command I use the most is s for substitution. E.g. %s/foo/bar/ substitutes foo with bar on every line. Add the g flag after the last / to allow for more than one match in each line. There is also d for delete.

Combining with the g line range thingy, we can do :g/patt/s/foo/bar/g, which will replace all occurrences of foo with bar on lines that contain patt.

The search patterns are basically ordinary regular expressions. Two very useful metacharacters are \< and \> which force the match to be at the begining or end of a word, respectively.

Other commands. :r reads a file and inserts it at the line after the cursor. :w file writes the current buffer to a file. It is also possible to do :w >>file to append to a file.

Combine r with ! to read the results of a command into the file: :r !hostname.

Use :set to show which options you have set (i.e. not showing defaults). Use :set option? to show the value of some option.

Abbreviations. Set with :ab foo bar. When you type foo, it will be expanded to bar. Remove abbreviation with :unab foo, show all abbreviations with :ab.

Control whether searches wrap or not with the wrapscan option, i.e. :set wrapscan or :set nowrapscan.

I use vim +n filename to open a file at a specific line a lot. Ommit n to open at the last line. Use vim +/word filename to open at first search match for word. Use -R to open read-only.

Some more Vim command-line flags: -b binary edit mode, -c command execute an ex command (you can have more than one of these).

Press CTRL-v to enter blockwise visual mode. This lets you select a rectangular area, which you can then apply an edit operation to. Sometimes called column editing.

Use CTRL-g to show info about total number of lines. Use g CTRL-g to show word count, byte count, etc.

]]>