Hard Truths about the Hard Business of finding Hard Random Numbers

Editorial note: this was a rant originally posted elsewhere, but has now moved here. It is updated with new thoughts from time to time, here not elsewhere.

As many have noticed, there is now a permathread (Paul's term) on how to do random numbers. It's always been warm. Now the arguments are on solid simmer, raging on half a dozen cryptogroups, all thanks to the NSA and their infamous breach of NIST, American industry, mom's apple pie and the privacy of all things from Sunday school to Angry Birds.

Why is the topic of random numbers so bubbling, effervescent, unsatisfying? In short, because generators of same (RNGs), are *hard*. They are in practical experience trickier than most of the other modules we deal with: ciphers, HMACs, public key, protocols, etc.

Yet, we have come a long way. We now have a working theory. When Ada put together her RNG this last summer, it wasn't that hard. Out of our experience, herein is a collection of things we figured out; with the normal caveat that, even as RNs require stirring, the recipe for 'knowing' is also evolving.

Use what your platform provides. Random numbers are hard, which is the first thing you have to remember, and always come back to. Random numbers are so hard, that you have to care a lot before you get involved. A hell of a lot. Which leads us to the following rules of thumb for RNG production.
1. Use what your platform provides.
2. Unless you really really care a lot, in which case, you have to write your own RNG.
3. There isn't a lot of middle ground.
4. So much so that for almost all purposes, and almost all users, Rule #1 is this: Use what your platform provides. E.g., for *nix, use urandom [Ptacek] [Hühn].
5. When deciding to breach Rule #1, you need a compelling argument that your RNG delivers better results than the platform's [Gutmann1]. Without that compelling argument, your results are likely to be more random than the platform's system in every sense except the quality of the numbers.

Software is our domain.
1. Software is unreliable. It can be made reliable under bench conditions, but out in the field, any software of more than 1 component (always) has opportunities for failure. In practice, we're usually talking dozens or hundreds, so failure of another component is a solid possibility; a real threat.
2. What about hardware RNGs? Eventually they have to go through some software, to be of any use. Although there are some narrow environments where there might be a pure hardware delivery, this is so exotic, and so alien to the reader here, that there is no point in considering it. Hardware serves software. Get used to it.
3. More specifically, the application is our domain, not the platform. This means we can tune the RNG to our local context, whereas a platform RNG is constrained by an idealised model.
4. As a practical reliability approach, we typically model every component as failing, and try and organise our design to carry on.

Security is also our domain, which is to say we have real live attackers.
1. Many of the sciences rest on a statistical model, which they can do in absence of any attackers. According to Bernoulli's law of big numbers, models of data will even out over time and quantity. In essence, we then can use statistics to derive strong predictions. If random numbers followed the law of big numbers, then measuring 1000 of them would tell us with near certainty that the machine was good for another 1000.
2. In security, we live in a byzantine world, which means we have real live attackers who will turn our assumptions upside down, out of spite. When an attacker is trying to aggressively futz with your business, he will also futz with any assumptions and with any tests or protections you have that are based on those assumptions. Once attackers start getting their claws and bits in there, the assumption behind Bernoulli's law falls apart. In essence this rules out lazy reliance on statistics.

No Test. There is no objective test of random numbers, because it is impossible to test for unpredictability [Denker1]. Which in practical terms means that you cannot easily write a test for it, nor can any test you write do the job you want it to do. This is the key unfortunate truth that separates RNs out from ciphers, etc (which latter are amenable to test vectors, and with vectors in hand, they become tractable).

Entropy. Everyone talks about entropy so we must too, else your future RNG will exhibit the wrong sort of unpredictability. Sadly, entropy is not precisely the answer, enough such that talking about is likely missing the point. If we could collect it reliably and fullsomely, RNs would be easy. We can't so it isn't.
1. Entropy is a statistical property of the physical situation [Denker2]. It is the opposite of information. It quantifies how much is /not/ known about the situation. Under suitable conditions, the thermal fluctuations in a linear analog circuit are a good source of entropy [Nyquist&Johnson]. This stands in contrast to ordinary digital logic, which takes pains to prevent thermal fluctuations from coupling to the signal.
2. There are objective statements we can make about entropy. The objective way to approach the collection of entropy is to carefully analyse the properties of the system and apply science to estimate a lower bound to the amount of (e.g.) thermal uncertainty one can derive from it. This is possible and instructive, and for a nice (deep) example of this, see John Denker's Turbid [Denker1].
3. At the level of implementation, objective statements about entropy fail to serve us for 2 reasons. Let's look at those, as understanding these limitations on objectivity is key to understanding why entropy does not serve us so willingly.
  1. Entropy can be objectively analysed as long as we do not have an attacker. An attacker can deliver a faulty device, can change the device, and can change the way the software deals with the device at the device driver level. And much more...
  2. This approach is complete if we have control of our environment. Of course, it is very easy to say Buy the XYZ RNG and plug it in. But many environments do not have that capability, often enough we don't know our environment, and the environment can break or be changed [Gutmann2]. Examples: rack servers lacking sound cards; phones; VMs; routers/firewalls; early startup on embedded hardware.
4. In conclusion, entropy for everyone is too high a bar to leap. We can reach it briefly, or in controlled environments, but not enough to make it provide the only answer. Given our limitations, we have to do more than collect entropy.

CSRNs. The practical standard to deliver to users is what we call Cryptographically Secure Random Numbers.
1. Cryptographically secure random numbers (or CSRNs) are numbers that are not predictable /to an attacker/. In contrast to entropy, we might be able to predict our CSRNs, but our enemies cannot. This is a strictly broader and easier definition than entropy, which is needed because collecting entropy and feeding it to our users is too hard, as above.
2. One big assumption: that we can determine who is our attacker and keep him out, and determine who is friendly and let them in. This is a big flaw! But it happens to be a very basic and ever-present one in security, so while it exists, it is one we can readily work with.
3. Second big assumption: we can generate CSRNs easily enough, but we still need a seed! Which is entropic, but we only need a little of it. The battle of entropy versus security leads to the design secret in the very next section: Compromise.

Design. Many experiments and research seem to have settled on the following design pattern, which we call a Trident Design Pattern:
```
   Entropy Collector  ----\
                           \ _____          _________
                            /     \        /         \
   Entropy Collector  ---->( Mixer )----->( Expander  )-----> RNs
                            \_____/        \_________/
                           /
   Entropy Collector  ----/
```
In short, many collectors of entropy feed their small contributions in to a Mixer, which uses the melded result to seed an Expander. The high level caller (application) uses this Expander to request her random numbers.

Collectors. After all the above bad news, what is left in the software toolkit is: redundancy .
1. A redundant approach tells us to draw our RNs from different places. The component that collects RNs from one place is called a Collector. Therefore we want many Collectors.
2. Each of the many places should be uncorrelated with each other. If one of these were to fail, it would be unlikely that others also would fail, as they are uncorrelated. Typical studies of fault-tolerant systems often suggest the number 3 as the target.
3. Some common collector ideas are:
  - the platform's own RNG, as a Collector into your RNG
  - any CPU RNG such as Intel's RDRAND,
  - measuring the difference between two uncorrelated clocks,
  - timings and other measurands from events (e.g., mouse click times and locations),
  - available sensors (GPS, movement on phones),
  - differences seen in incoming new business packets,
  - a roughly protected external source such as a business feed,
  By the analysis that got us past Rule #1, there are no great Collectors by definition, as otherwise we'd already be using them, and this problem would go away.
4. An attacker is assumed to be able to take a poke at one or two of these sources, but not all. If the attacker can futz with all our sources, this implies that he has more or less unlimited control over our entire machine. In which case, it's his machine, and not ours. We have bigger problems than RNs.
5. We tend to want more numbers than fault-tolerant reliability suggests because we want to make it harder for the attacker. E.g., 6 would be a good target.
6. Remember, we want maximum uncorrelation. Adding correlated collectors doesn't improve the numbers.
7. It helps to mix in general-purpose components. Relying solely on built-for-purpose RNGs (including RNG chips, FIPS-approved HSMs) is like putting a "kick me" sign on your own back, and you have to pay through the nose for the abuse. In contrast, the humble sound card can be put to lots of different uses, and it is relatively hard for the bad guys to mess with it in a way that subverts the crypto without making the device unusable for other purposes.
8. Calibrate each Collector to provide a lower bound on entropy. This is done to a reasonable not perfect level, err on the side of underestimation not overestimation. As we have redundancy, on a large scale, we are not that fussed about the quality of each Collector. Better to add another uncorrelated collector than improve the quality of one of them by 10%. This is an important benefit of redundancy, we don't have to be paranoid about the quality of this code (as we do with the Mixer).

Mixer. Because we want the best and simplest result delivered to the caller, we have to take the output of all those above Collectors, mix them together, and deliver downstream.
1. The Mixer is the trickiest part of it all. Here, you make or break. Here, you need to be paranoid. Careful. Seek more review.
2. The Mixer has to provide some seed numbers of say 128-512 bits to the Expander (see below for rationale). It has to provide this on demand, quickly, without waiting around.
3. There appear to be two favourite designs here: Push or Pull. In Push the collectors send their data directly into Mixer, forcing it to mix it in as it's pushed in. In contrast, a Pull design will have the Mixer asking the Collectors to provide what they have right now. This in short suggests that in a Push design the Mixer has to have a cache, while in Pull mode, the Collectors might be well served in having caches within themselves.
4. Push or Mixer-Cache designs are probably more popular. See Yarrow and Fortuna as perhaps the best documented efforts [Yarrow] [Fortuna].
5. We wrote our recent Trident effort (AdazPRING) using Pull. The benefits include: simplified API as it is direct pull all the way through; no cache or thread in mixer; and as the Collectors better understand their own flow, so they better understand the need for caching and threading.

Expander. Out of the Mixer comes some nice RNs.
1. But not a lot! That's because good Collectors are typically not firehoses but rather leaky taps, and the Mixer can't improve on that, as, according to the law of thermodynamics, it is impossible to create entropy by computation alone.
2. Yet, the caller often wants a lot of RNs and she doesn't like to wait around.
3. To solve the mismatch between the Mixer output and the application's needs, we create an expansion function or Expander. This function is pretty simple: (a) it takes a small seed and (b) turns that into a hugely long stream. It could be called the Firehose...
4. Recalling our truth above of (c) CSRNs being the goal, not entropy, we now have a really easy solution to this problem: Use a cryptographic stream cipher. This black box takes a small seed (a-check!) and provides a near-infinite series of bytes (b-check!) that are cryptographically secure (c-check!). We don't care about the plaintext, but by the security claims behind the cipher, the stream is cryptographically unpredictable without access to the seed.
5. Super easy: Any decent, modern, highly secure stream cipher is probably good for this application. Our current favourite is ChaCha20 but any of the NESSIE set would be fine.
6. In summary, the Expander is simply this: when the application asks for a font of RNs, we ask the Mixer for a seed, initialise a stream cipher with the seed, and return it back to the user. The caller sucks on the output of the stream cipher until she's had her fill!

Subtleties.
1. Startup. When a system first starts up there is often a shortage of easy entropy to collect. This can lead to catastrophic results if your app decides that it needs to generate high-value keys as soon as it starts up. This is a real problem -- scans of keys on the net have found significant numbers that are the same, which is generally traced to the restart problem. To solve this, either change the app (hard) ... or store some entropy for next time. How you do this is beyond scope.
2. Freeze & Copy attacks. With many environments, your attacker can put your platform into a halt, read off your RNG's state in some fashion, restart it, and then use the state for nefarious purposes. This is especially a problem with VMs. We therefore set the goal that the current state of the RNG cannot be rolled forward nor backwards to predict prior or future uses. To deal with this, a good RNG will typically:
  - stir fresh entropy into the at-rest state even if not required by the user. E.g. a mixer might poll the Collectors occasionally, or an Expander might rekey occasionally.
  - Use hash whiteners between components. Typically, a SHA digest or similar will be used to protect the state of a component as it passes its output to the next stage.
3. As a technical design argument, the only objective way that you can show that your design is at least as good as or better than the platform-provided RNG is the following:
  1. Very careful review and testing of the software and design, and especially the Mixer; and
  2. including the platform's RNG as a Collector.

Business Justifications. As you can see, doing RNGs is hard! Rule #1 -- use what the platform provides. You shouldn't be doing this. About the only rationales for doing your own RNG are the following.
1. Your application has something to do with money or journalism or anti-government protest or is a CVP. By money, we mean Bitcoin or other forms of hard digital cash, not online banking. The most common CVP or centralised vulnerability party (aka TTP or trusted third party) is the Certification Authority.
2. Your operating platform is likely to be attacked by a persistent and aggressive attacker. This might be true if the platform is one of the following: any big American or government controlled software, Microsoft Windows, Java (code, not applets), any mobile phone OS, COTS routers/firewalls, virtual machines (VMs).
3. You write your own application software, your own libraries *and* your own crypto!
4. You can show objectively that you can do a better job.
Note that it is still a hard test, you want ALL of those to be true before you start mucking around in this chaotic area.

That all said, good luck! Comments to the normal place, please, and Ed's note: this will improve in time.

Endnotes. This essay received considerable review from John Denker. However, due to the need to shorten the content into something more easily digestible by the busy hacker, there are both shortcuts and intentional errors. Those errors remain mine, the unravelling is yours.

Terminology. Terms can be very confusing. For sake of observation, the above Collector is also a mini-random-number-generator; yet the term generator is strictly incorrect because entropy is collected, not generated. Historically, the terms RNG (random number generator) and PRNG (psuedo-RNG) have been applied to the whole construction. More usefully, the PRNG is the Expander, as it 'generates' psuedo random numbers from a seed. We have less consensus on what I call the Mixer above. As you can see from comments, the caching part is optional whereas the mixing part is essential.

References

[Ptacek] Thomas & Erin Ptacek, " How To Safely Generate A Random Number", 25 Feb 2014.

[Hühn] Thomas Hühn, " Myths about /dev/urandom", 16 March 2014.

[Gutmann1] Peter Gutmann, "Random Number Generation," (Chapter 6 of Cryptographic Security Architecture: Design and Verification, Springer 2004 Amazon).

[Denker1] John Denker, " Turbid : A High-Entropy Random Generator," Section 7.2, 2005.

[Denker2] John Denker, Modern Thermodynamics.

[Nyquist&Johnson] Nyquist and Johnson, combined works, 1928.

[Denker1] Denker, Turbid, op cit.

[Gutmann2] Peter Gutmann, "Testing Issues with OS-based Entropy Sources," NIST Random Number Generation Workshop.

[Yarrow] J. Kelsey, B. Schneier, and N. Ferguson, "Yarrow-160: Notes on the Design and Analysis of the Yarrow Cryptographic Pseudorandom Number Generator," Sixth Annual Workshop on Selected Areas in Cryptography, Springer Verlag, August 1999.

[Fortuna] Ferguson & Schneir, Practical Cryptography, chapter 10, Wiley (2003) ISBN 0-471-22357-3 pbk. Also, Ferguson, Schneier & Kohno Cryptographic Engineering, chapter 9, (2010) ISBN 978-0-470-47424-2].