Simon Singh - The Code Book - The Arab Cryptanalysts

The Arab Cryptanalysts

At the age of about forty, Muhammad began regularly visiting an isolated cave on Mount Hira just outside Mecca. This was a retreat, a place for prayer, meditation and contemplation. It was a during a period of deep reflection, around AD 610, that he was visited by the archangel Gabriel, who proclaimed that Muhammad was to be the messenger of God. This was the first of a series of revelations which continued until Muhammad died some twenty years later. The revelations were recorded by various scribes during the Prophet's life, but only as fragments, and it was left to Abu Bakr, the first caliph of Islam, to gather them together into a single text. The work was continued by Umar, the second caliph, and his daughter Hafsa, and was eventually completed by Uthman, the third caliph. Each revelation became one of the 114 chapters of the Koran.

The ruling caliph was responsible for carrying on the work of the Prophet, upholding his teachings and spreading his word. Between the appointment of Abu Bakr in 632 to the death of the fourth caliph, Ali, in 661, Islam spread until half the known world was under Muslim world. Then in 750, after a century of consolidation, the start of the Abbasid caliphate (or dynasty) heralded the golden age of Islamic civilisation. The arts and sciences flourished in equal measure. Islamic craftsmen bequeathed us magnificent paintings, ornate carvings, and the most elaborate textiles in history, while the legacy of Islamic scientists is evident from the number of Arabic words that pepper the lexicon of modern science such as algebra, alkaline and zenith.

The richness of Islamic culture was to a large part the result of a wealthy and peaceful society. The Abbasid caliphs were less interested than their predecessors in conquest, and instead concentrated on establishing an organised and affluent society. Lower taxes encouraged businesses to grow and gave rise to greater commerce and industry, while strict laws reduced corruption and protected citizens. All this relied on an effective system of administration, and in turn the administrators relied on secure communication achieved through the use of encryption. As well as encrypting sensitive affairs of state, it is documented that officials protected tax records, demonstrating a widespread and routine use of cryptography. Further evidence comes from many administrative manuals, such as the tenth-century Adab al-Kuttab (`The Secretaries' Manual'), which include sections devoted to cryptography.

The administrators usually employed a cipher alphabet which was simply a rearrangement of the plain alphabet, as described earlier, but they also used cipher alphabets that contained other types of symbols. For example, a in the plain alphabet might be replaced by # in the cipher alphabet, b might be replaced by +, and so on. The monoalphabetic substitution cipher is the general name given to any substitution cipher in which the cipher alphabet consists of either letters or symbols, or a mix of both. All substitution ciphers that we have met so far come within this general category.

Had the Arabs merely been familiar with the use of the monoalphabetic substitution cipher, they would not warrant a significant mention in any history of cryptographers. However, in addition to employing ciphers, the Arab scholars were also capable of destroying ciphers. They in fact invented cryptanalysis, the science of unscrambling a message without knowledge of the key. While the cryptographer develops new methods of secret writing, it is the cryptanalyst who struggles to find weaknesses in these methods in order to break into secret messages. Arabian cryptanalysts succeeded in finding a method for breaking the monoalphabetic substitution cipher, a cipher that had remained invulnerable for several centuries.

Cryptanalysis could not be invented until a civilisation had reached a sufficiently sophisticated level of scholarship in several disciplines, including mathematics, statistics and linguistics. The Muslim civilisation provided an ideal cradle for cryptanalysis, because Islam demands justice in all spheres of human activity, and achieving this requires knowledge, or ilm. Every Muslim is obliged to pursue knowledge in all its forms, and the economic success of the Abbasid caliphate meant that scholars had the time, money and materials required to fulfil their duty. They endeavoured to acquire the knowledge of previous civilisations by obtaining Egyptian, Babylonian, Indian, Chinese, Farsi, Syriac, Armenian, Hebrew and Roman texts and translating them into Arabic. In 815, the Caliph al-M`mun established in Baghdad the Bain al-Hikmah (`House of Wisdom'), a library and centre for translation.

At the same time as acquiring knowledge, the Islamic civilisation was able to disperse it, because it had procured the art of paper-making from the Chinese. The manufacture of paper gave rise to the profession of warraqin, or `those who handle paper', human photocopying machines who copied manuscripts and supplied the burgeoning publishing industry. At its peak, tens of thousands of books were published every year, and in just one suburb of Baghdad there were over a hundred bookshops. As well as such classics as Tales from the Thousand and One Nights, these bookshops also sold textbooks on every imaginable subject, and helped to support the most literate and learned society in the world.

In addition to a greater understanding of secular subjects, the invention of cryptanalysis also depended on the growth of religious scholarship. Major theological schools were established in Basra, Kufa and Baghdad, where theologians scrutinised the revelations of Muhammad as contained in the Koran. The theologians were interested in establishing the chronology of the revelations, which they did by counting the frequencies of words contained in each revelation. The theory was that certain words had evolved relatively recently, and hence if a revelation contained a high number of these newer words, this would indicate that it came later in the chronology. Theologians also studied the Hadith, which consists of the Prophet's daily utterances. They tried to demonstrate that each statement was indeed attributable to Muhammad. This was done by studying the etymology of words and the structure of sentences, to test whether particular texts were consistent with the linguistic patterns of the Prophet.

Significantly, the religious scholars did not stop their scrutiny at the level of words. They also analysed individual letters, and in particular they discovered that some letters are more common than others. The letters a and l are the most common in Arabic, partly because of the definite article al- whereas the letter j appears only a tenth as frequently. This apparently innocuous observation would lead to the first great breakthrough in cryptanalysis.

Although it is not known who first realised that the variation in the frequencies of letters could be exploited in order to break ciphers, the earliest known description of the technique is by the ninth-century scientist Abu Yusuf Ya`qub ibn Is-haq ibn as-Sabbah ibn `omran ibn Ismail al-Kindi. Known as `the philosopher of the Arabs', al-Kindi was the author of 290 books on medicine, astronomy, mathematics, linquistics and music. His greatest treatise, which was rediscovered only in 1987 in the Sulaimaniyyah Ottoman Archive in Istanbul, is entitled A Manuscript on Deciphering Cryptographic Messages; the first page is shown in Figure 6. Although it contains detailed discussions on statistics, Arabic phonetics and Arabic syntax, al-Kindi's revolutionary system of cryptanalysis is encapsulated in two short paragraphs:

One way to solve an encrypted message, if we know its language, is to find a different plaintext of the same language long enough to fill one sheet or so, and then we count the occurrences of each letter. We call the most frequently occurring letter the `first', the next most occurring letter the `second', the following most occurring letter the `third', and so on, until we account for all the dfferent letters in the plaintext sample.

Then we look at the ciphertext we want to solve and we also classify its symbols. We find the most occurring symbol and change it to the form of the `first' letter of the plainterxt sample, the next most common symbol is changed to the form of the `second' letter, and the following most common symbol is changed to the form of the `third' letter, and so on, until we account for all the symbols of the cryptogram we want to solve.

Al-Kindi's explanation is easier to explain in terms of the English alphabet. First of all, it is necessary to study a length piece of normal English text, perhaps several, in order to establish the frequency of each letterof the alphabet. In English, e is the most common letter, followed by t, then a, and so on, as given in Table 1. Next examine the ciphertext in question, and work out the frequency of each letter. If the most common letter in the ciphertext is, for example, J, then it would seem likely that this is a substitute for e. And if the second most common letter in the ciphertext is P, then this is probably a substitute for t, and so on. Al-Kindi's technique, known as frequency analysis, shows that it is unneccessary to check each of the billions of potential keys. Instead, it is possible to reveal the contents of a scrambled message simply by analysing the frequency of the characters in the ciphertext.

Table 1   This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Becker and F. Piper, and originally published in Cipher Systems: The Protection of Communication.

Letter Percentage   Letter Percentage
a 8.2   n 6.7
b 1.5   o 7.5
c 2.8   p 1.9
d 4.3   q 0.1
e 12.7   r 6.0
f 2.2   s 6.3
g 2.0   t 9.1
h 6.1   u 2.8
i 7.0   v 1.0
j 0.2   w 2.4
k 0.8   x 0.2
l 4.0   y 2.0
m 2.4   z 0.1

However, it is not possible to apply Al-Kindi's recipe for cryptanalysis unconditionally, because the standard list of frequencies in Table 1 is only an average, and it will not correspond exactly to the frequencies of every text. For example, a brief message discussing the effect of the atmosphere on the movement of striped quadrupeds in Africa would not yield to straightforward frequency analysis: `From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags.' In general, short texts are likely to deviate significantly from the standard frequencies, and if there are less than a hundred letters, then decipherment will be difficult. On the other hand, longer texts are more likely to follow the standard frequencies, although this is not always the case. In 1969, the French author Georges Perec wrote La Disparition, a 200-page novel that did not use words that contain the letter e. Doubly remarkable is the fact that the English novelist and critic Gilbert Adair succeeded in translating La Disparition into English, while still following Perec's shunning of the letter e. Entitled A Void, Adair's translation is surprisingly readable (see Appendix A). If the entire book were encrypted via a monoalphabetic substitution cipher, then a naive attempt to deciper it might be stymied by the complete lack of the most frequently occurring letter in the English alphabet.

Extract taken from Simon Singh, The Code Book, 1999, pp 14-20.

Back to the Cryptofiction page.