hanshq.net

Algorithms for Roman Numerals
(12 June 2016)

Converting between regular (Hindu-Arabic) and Roman numerals seems to be a common programming exercise.

However, one annoying aspect is that the system is not always defined very well. The way I was taught about Roman numerals in school was something like the following. Roman numerals are written left-to-right using letters that each represent different values:

Letter M D C L X V I
Value 1000 500 100 50 10 5 1

The values are usually just added together (an additive system); however, when a smaller value is followed by a larger value, the smaller one is subtracted from the larger (subtractive notation). For example, 16 is written as XVI, and 14 as XIV.

I soon learned, after being told off for burning IIV onto the face of the clock I made in woodworking class, that there are rules for when one is supposed to use the subtractive notation. But it was never made clear exactly what those rules were.

This article examines how some popular programs implement conversion to Roman numerals.

Table of Contents

History

How did the Romans do it? According to D. E. Smith History of Mathematics, Vol II, historical use seems to have allowed forms which we wouldn't use today (such as VIIII) while avoiding forms we do use (like MCM):

[..] the Romans did not commonly use the subtractive principle in this case, preferring the form IIII to the form IV. They used the principle more frequently in the case of 9, but even here they wrote VIIII oftener than IX. In the case of 400 they usually wrote CCCC, but occasionally they used CD. (p. 59)

Even when the subtractive principle was used, no fixed standard was recognized. The number 19 was commonly written XIX, but not infrequently IXX. We also find IIX for 8 and IIXX for 18, but these were not so common. It is quite rare to find CD for 400 or CM for 900, and forms like MCM and DCD were never used in ancient or medieval times. In general, therefore, it may be said that the Romans recognized the subtractive principle but did not make much use of it. (p. 60)

Modern Use

While there does seem to be a real convention for modern use of Roman numerals, it's annoyingly hard to find authoritative guidance.

The Chicago Manual of Style (16th edition, section 9.64) provides a table of example Roman numerals, but offers no general rule for how to form them. About subtractive notation, it says:

The use of subtrahends (back counters) was introduced during the Renaissance. Note that IIII, not IV, still appears on some clock faces. The Romans would have expressed the year 1999, for example, as MDCCCCLXXXXVIIII. A more modern form, approved by the US government and accepted (if reluctantly) by classical scholars, is MCMXCIX (not MIM, considered a barbarism).

Note that it offers no explanation as to why MIM is considered a barbarism (or which US government body approves Roman numerals).

While it offers no particular justification for it, the Wikipedia article does provide the following guidance:

[..] subtractive notation is often used as follows: I placed before V or X indicates one less [..] X placed before L or C indicates ten less [..] C placed before D or M indicates a hundred less

By those rules, MIM would indeed not be allowed, since they only mention placing I before V or X, not any of the other letters such as M.

One page that provides some proper hard rules is Patrick Morandi's course material page on Roman numerals:

To avoid ambiguity, the only pairs of numerals that use this subtraction rule are IV, IX, XL, XC, CD [and] CM.

Now, that's a well expressed rule! It also coincides perfectly with the one from Wikipedia.

Implementations

Programs for presentation of documents sometimes deal with Roman numerals. Below are four different implementations of conversion to Roman numerals in such programs. They all follow the rules above, but use different algorithms to apply them.

LibreOffice

LibreOffice is a popular open-source office suite. Its word processor, Writer, supports using Roman numerals in numbered lists.

The code below (from numitem.cxx) implements the conversion. It was part of the initial commit to that repository, from around the time when StarOffice was open-sourced in 2000, so it is likely to have been written much earlier than that.

OUString SvxNumberFormat::CreateRomanString( sal_uLong nNo, bool bUpper )
{
    nNo %= 4000;            // more can not be displayed
//      i, ii, iii, iv, v, vi, vii, vii, viii, ix
//                          (Dummy),1000,500,100,50,10,5,1
    const char *cRomanArr = bUpper
                        ? "MDCLXVI--"   // +2 Dummy entries!
                        : "mdclxvi--";  // +2 Dummy entries!

    OUString sRet;
    sal_uInt16 nMask = 1000;
    while( nMask )
    {
        sal_uInt8 nZahl = sal_uInt8(nNo / nMask);
        sal_uInt8 nDiff = 1;
        nNo %= nMask;

        if( 5 < nZahl )
        {
            if( nZahl < 9 )
                sRet += OUString(*(cRomanArr-1));
            ++nDiff;
            nZahl -= 5;
        }
        switch( nZahl )
        {
        case 3:     { sRet += OUString(*cRomanArr); }
        case 2:     { sRet += OUString(*cRomanArr); }
        case 1:     { sRet += OUString(*cRomanArr); }
                    break;

        case 4:     {
                        sRet += OUString(*cRomanArr);
                        sRet += OUString(*(cRomanArr-nDiff));
                    }
                    break;
        case 5:     { sRet += OUString(*(cRomanArr-nDiff)); }
                    break;
        }

        nMask /= 10;            // for the next decade
        cRomanArr += 2;
    }
    return sRet;
}

The variable names use Hungarian notation: a prefix that indicates the variable's type (n for integer, c for character, s for string, etc.) nZahl reveals the code's German roots: Zahl means number, and StarOffice was originally written by a German company.

Let's walk through the conversion of 379 to lower-case Roman numerals:

  • Inputs larger than 3999 are truncated, cRomanArr set to point at an array of lower-case letters starting with 'm' and nMask is set to 1000. nMask presumably represents the value of the letter currently pointed at by cRomanArr.
  • nZahl is set to how many times the current Roman numeral fits in nNo. In this case, 1000 fits 0 times in 379, so nZahl = 0.
  • Since nZahl is zero, the if- and switch-statements won't have any effect in this first iteration. In the next iteration, nNo is still 379, nMask is 100, and cRomanArr points to 'c'. It seems cRomanArr always points to a power-of-ten letter.
  • This time, the switch statement kicks in and prints three 'c' letters. For the following iteration, nNo is reduced to 79, nMask is 10, and cRomanArr points to 'x'.
  • Now the if-statement comes into play. nZahl = 7 ('x' fits in 79 7 times). Since cRomanArr always points at a power of ten, cRomanArr-1 will always point at a character with value five times the current one. In the if-statement, an 'l' is printed, nZahl is reduced by five, and, in the switch below, two 'x' letters will be printed.
  • For the final 9, cRomanArr points to 'i', the if-statement doesn't print anything, but nDiff gets incremented, so case 4 in the switch ends up printing an 'i' followed by 'x'.

KHTML / WebKit

Web browsers support using Roman numerals for ordered lists, either through a type attribute on the <ol> HTML element, or the list-style-type CSS property.

The code below is from Lars Knoll's initial KHTML commit. It builds on a version previously committed by Martin Jones to khtmlw (KDE HTML Widget, the precursor to KHTML) in 1997.

Today, the same code is used in WebKit and in Blink, the rendering engines used by Apple's Safari and Google Chrome.

QString toRoman( int number, bool upper )
{
    QString roman;
    QChar ldigits[] = { 'i', 'v', 'x', 'l', 'c', 'd', 'm' };
    QChar udigits[] = { 'I', 'V', 'X', 'L', 'C', 'D', 'M' };
    QChar *digits = upper ? udigits : ldigits;
    int i, d = 0;

    do
    {
	int num = number % 10;

	if ( num % 5 < 4 )
	    for ( i = num % 5; i > 0; i-- )
		roman.insert( 0, digits[ d ] );

	if ( num >= 4 && num <= 8)
	    roman.insert( 0, digits[ d+1 ] );

	if ( num == 9 )
	    roman.insert( 0, digits[ d+2 ] );

	if ( num % 5 == 4 )
	    roman.insert( 0, digits[ d ] );

	number /= 10;
	d += 2;
    }
    while ( number );

    return roman;
}

This code builds the Roman numeral from right to left, considering the least significant decimal digit of number in each iteration of the loop and adding letters to the front of the result string.

d is an index into the array of letters. Since it always points to a power-of-ten letter, d + 1 points to a letter worth five times as much, and d + 2 to the next power of ten.

Four if-statements are used to decide what letters to add to the result. The first one decides how many of the current character to insert with additive notation. The next two decide if a letter of the next multiple-of-five or power-of-ten should be inserted, and the last one checks if a subtractive character should be inserted.

Firefox

The code below was used in Firefox's rendering engine, Gecko, until recently. It seems to have been committed on 1 December 1998. This version has since been replaced, when Xidorn Quan changed the code to use CSS3's @counter-style based styles. The original Mozilla code drop had a different implementation, which is probably what was used in Netscape Navigator.

static const char* gLowerRomanCharsA = "ixcm";
static const char* gUpperRomanCharsA = "IXCM";
static const char* gLowerRomanCharsB = "vld?";
static const char* gUpperRomanCharsB = "VLD?";

case NS_STYLE_LIST_STYLE_LOWER_ROMAN:
case NS_STYLE_LIST_STYLE_UPPER_ROMAN:
{
  if (ordinal <= 0) {
    ordinal = 1;
  }
  nsAutoString addOn, decStr;
  decStr.Append(ordinal, 10);
  PRIntn len = decStr.Length();
  const PRUnichar* dp = decStr.GetUnicode();
  const PRUnichar* end = dp + len;
  PRIntn romanPos = len;
  PRIntn n;

  const char* achars;
  const char* bchars;
  if (aListStyle.mListStyleType == NS_STYLE_LIST_STYLE_LOWER_ROMAN) {
    achars = gLowerRomanCharsA;
    bchars = gLowerRomanCharsB;
  }
  else {
    achars = gUpperRomanCharsA;
    bchars = gUpperRomanCharsB;
  }
  for (; dp < end; dp++) {
    romanPos--;
    addOn.SetLength(0);
    switch(*dp) {
    case '3':  addOn.Append(achars[romanPos]);
    case '2':  addOn.Append(achars[romanPos]);
    case '1':  addOn.Append(achars[romanPos]);
      break;
    case '4':
      addOn.Append(achars[romanPos]);
      // FALLTHROUGH
    case '5': case '6':
    case '7': case  '8':
      addOn.Append(bchars[romanPos]);
      for(n=0;n<(*dp-'5');n++) {
        addOn.Append(achars[romanPos]);
      }
      break;
    case '9':
      addOn.Append(achars[romanPos]);
      addOn.Append(achars[romanPos+1]);
      break;
    default:
      break;
    }
    result.Append(addOn);
  }
}

This code starts off by converting the input number, ordinal, into a decimal string, pointed to by dp (for "decimals pointer", I presume). achars is set up to point to an array of power-of-ten Roman letters, and bchars points to the letters that are multiples of five.

romanPos is initialized based on the length of the decimal string. For a four-decimal number, it will point one element past 'm' in achars (it gets decremented to really point to 'm' at the start of the for-loop.) There has presumably been a check before that ordinal isn't too large.

Each character in the decimal string is then processed left-to-right, with a switch statement that decides what to do for each kind of digit. achars[romanPos] is the letter whose value corresponds to the current decimal position, and bchars[romanPos] has the letter worth five times that.

TeX

TeX, the typesetting program written by Don Knuth, has a built-in primitive for converting to Roman numerals: \romannumeral.

The code below is from the latest version of TeX, available on CTAN. It is the same as in the earliest version of TeX82 that I could find. Earlier versions used a different algorithm. Knuth provides the source for some of the old versions here.

@ Roman numerals are produced by the |print_roman_int| routine.  Readers
who like puzzles might enjoy trying to figure out how this tricky code
works; therefore no explanation will be given. Notice that 1990 yields
\.{mcmxc}, not \.{mxm}.

@p procedure print_roman_int(@!n:integer);
label exit;
var j,@!k: pool_pointer; {mysterious indices into |str_pool|}
@!u,@!v: nonnegative_integer; {mysterious numbers}
begin j:=str_start["m2d5c2l5x2v5i"]; v:=1000;
loop@+  begin while n>=v do
    begin print_char(so(str_pool[j])); n:=n-v;
    end;
  if n<=0 then return; {nonpositive input produces no output}
  k:=j+2; u:=v div (so(str_pool[k-1])-"0");
  if str_pool[k-1]=si("2") then
    begin k:=k+2; u:=u div (so(str_pool[k-1])-"0");
    end;
  if n+u>=v then
    begin print_char(so(str_pool[k])); n:=n+u;
    end
  else  begin j:=j+2; v:=v div (so(str_pool[j-1])-"0");
    end;
  end;
exit:end;

TeX is written using a literate programming system called WEB. From the source code can be generated both human-readable documentation and compilable code (Pascal). In this case, the documentation can be generated like so:

$ wget http://mirrors.ctan.org/systems/knuth/dist/tex/tex.web
$ weave tex.web
$ pdftex tex.tex

The relevant excerpt from the resulting 535-page document (which is also available in print) is shown below:

Knuth's print_roman_int function in TeX82.

Spoiler alert: If you want to solve this puzzle by yourself, don't read any further.

To understand the code, I found it helpful to translate it into C:

void print_roman_int(int n)
{
        const char str[] = "m2d5c2l5x2v5i";

        int j, k;  /* Mysterious indices into str. */
        int u, v;  /* Mysterious numbers. */

        j = 0;
        v = 1000;

        while (true) {
                while (n >= v) {
                        putchar(str[j]);
                        n -= v;
                }

                if (n <= 0) {
                        return;  /* Nonpositive input produces no output. */
                }

                k = j + 2;
                u = v / (str[k - 1] - '0');

                if (str[k - 1] == '2') {
                        k += 2;
                        u = u / (str[k - 1] - '0');
                }

                if (n + u >= v) {
                        putchar(str[k]);
                        n += u;
                } else {
                        j += 2;
                        v = v / (str[j - 1] - '0');
                }
        }
}

Let's walk through calling the function with 1990:

  • Initially, j = 0 and v = 1000. Since j is an index into str, that means it's effectively pointing at 'm'. It seems that j is pointing to the current letter being considered, and v holds that letter's value.
  • The main loop starts off printing as many j-letters as possible, in our case one 'm', after which n is set to 990.
  • If n becomes zero (it never becomes negative), the conversion is done.
  • Next comes an interesting part. k is set up to point to the next letter after j. u is set to the value of v divided by the integer value that comes after the current letter. Aha, the digits in the string are encoding how the value changes when going from one letter to the next! k now points to 'd', and u is the corresponding value, 500.
  • The following if-statement checks whether the number we just divided by was 2. If so, k is bumped another step, and u is updated again. k is now pointing to 'c', and u's value is 100. This code effectively means that k will only ever point to 'c', 'x', or 'i', that is, powers of ten.
  • After that comes the n + u >= v check. If the condition is true, a k-character is written. Aha, the character pointed to by k is being considered for the subtractive notation! In our case, it means a 'c' gets printed, n is increased to 1090, and when we loop, another 'm' is written, and n becomes 90.
  • The loop then moves on, j is changed to point to 'd', then 'c'. A subtractive 'x', followed by 'c' is printed, and then we're done.

Pretty clever!

Exercises

  • None of the implementations above score any points for readability. The most straight-forward way to implement conversion to Roman numerals might be to think of it as a purely additive system with the symbols M (1000), CM (900), ..., IV (4), I (1), thus avoiding logic for figuring out when to use subtractive notation. Use this idea to implement conversion to Roman numerals.
  • Write a function that converts from roman numerals to integers. It should also handle numerals which don't follow the rules for modern use, for example IIV (3), IIII (4), and MIM (1999). Hint: it might be easier to read the Roman numeral backwards.
  • Knuth's trick of storing the ratio between numerals in the string is clever, but the ratios could also be determined based on whether the "mysterious indices" i and j are odd or even. Implement a variant of the TeX algorithm that does this.
  • After removing the ratios from the string in the previous exercise, the string is now 7 characters long, which is small enough to fit in a 64-bit integer. Instead of indexing into the string, it can be shifted right to put the next character in the least significant 8 bits. To check whether a character c is a multiple of 5 ('d', 'l', or 'v'), check (c & 0x5) == 0x4. Use these ideas to write a variant of the TeX algorithm without the string or indexes.