RNG Range Projection

This page demonstrates the pitfalls of using a pseudo-random number generator with a small native range, namely the standard rand() function of Microsoft Visual C++, to generate output values within arbitrary target ranges.

Addendum 2013-12-17: Stephan T. Lavavej’s presentation rand() Considered Harmful discusses the range projection trap of Visual C++ rand(), along with many other issues, and demonstrates the new C++ 11 facilities which fix them (available in VC++ 2013). Eric Lippert’s analysis How much bias is introduced by the remainder technique? nicely visualizes the distortion.

Overview

Random number generators typically generate values within a certain integer range, defined as [0 … RAND_MAX] in standard C libraries. To use these values in an application, it is usually necessary to project them to an arbitrary target range [0 … n-1], or possibly a shifted range [mm+n-1] which is equivalent to the unshifted range for the purpose of this discussion. The two most commonly used projection methods are modulo (actually remainder but that’s not relevant here) and division:

r_mod = rand() % n;
r_div = (unsigned) ((double) rand() / (RAND_MAX+1) * n);

Many authors recommend against using the modulo method as the random values generated by some algorithms contain low-order bits (which the modulus operator isolates) that are not particularly random at all. These authors then usually proceed to recommend the division method.

However, both methods share the same and much worse flaws, which become apparent and relevant to practical applications if the random number generator has a small “native range”, i.e. a small RAND_MAX:

  1. If n > RAND_MAX+1, some values in the target range [0 … n-1] cannot be generated by either the modulo or the division projection.
  2. If n is not an even multiple of RAND_MAX+1 or vice versa, some values in the target range will be generated much more frequently than others (twice as often or more).

This is due to the fact that values in the target range are mapped to a varying number of values in the native range. Each target value receives a probability of x/(RAND_MAX+1) where x may be different for different target values and may even be zero in the first case, rather than its expected probability of 1/n for an ideal distribution.

Although this fact must seem obvious to professional mathematicians, I confess that I only learned of it from a brief remark in the excellent Accelerated C++ by Andrew Koenig and Barbara E. Moo (Addison-Wesley 2000, p.135, section 7.4.4). Many programmers seem to be unaware of this issue, likely because most libraries provide generators with a large RAND_MAX (typically 232-1). This does not make the flaws of the modulo and division projection disappear but it does render them sufficiently small as to be irrelevant in practice.

Bucketing Algorithm

The solution – if you are either stuck using a generator such as Visual C++ rand() with a small RAND_MAX or if you need a perfect distribution – is to use a bucketing algorithm for projection, as suggested by Koenig/Moo. The one provided in our test program is somewhat more complex as it handles ranges greater than RAND_MAX via recursive calls. Please see function RandBucket in file RandTest.c for this algorithm.

Bucketing retains the exact distribution quality produced by the underlying random number generator for all target ranges, as opposed to modulo/division projection which can produce a significantly worse distribution. Bucketing’s main drawback is its slower speed because (1) random values may have to be re-generated until one hits a bucket, and (2) target ranges greater than RAND_MAX require recursive calls.

On the rest of this page we’ll examine just how badly modulo and division projection can damage the output distribution of a generator with a small native range, and how much of a performance cost we have to pay for bucketing to avoid this problem.

Test Programs

All results shown below were obtained with two small test programs. The download package RandTest.zip (65.6 KB, ZIP archive) comprises the precompiled executables and their complete source code. Please refer to the enclosed ReadMe.txt file and the compilation batch file for the required development tools and expected file paths.

The principal test program is RandTest.c, written to test the rand() library function of Visual C++, a pseudo-random number generator with a small RAND_MAX of only 215-1. We also perform control tests with other generators. The result table identifies the various test combinations as follows:

  • Modulo Visual C++ rand() with modulo projection
  • Division Visual C++ rand() with division projection
  • Bucketing Visual C++ rand() with bucketing projection
  • Secure Visual C++ rand_s() with modulo projection
  • Twister Visual C++ Mersenne Twister with modulo projection
  • C# Random Visual C# System.Random with modulo projection
  • C# Twister Visual C# Mersenne Twister with modulo projection

The “secure” Windows generator rand_s() is an external library function available on Windows XP and later. The System.Random class is the standard generator of the .NET Framework. Finally, I added C and C# versions of the famous Mersenne Twister algorithm (suggested by Nathan Mates). The native range of rand_s() and Mersenne Twister is [0 … 232-1] while that of System.Random is [0 … 231-1].

Invocation and Output

The C program RandTest and the C# program CSharpTest are each called with two parameters:

  • The upper bound n of the target range, indicating a range of [0 … n-1]
  • The iteration count, indicating the number of random values to be generated

The program counts how many times each value of the target range is “hit” by a random number – ideally, each should receive the same number of hits – and then computes the standard deviation for this series. Additionally, the time spent in the random number generation loop is displayed.

Standard deviation is by no means an exhaustive test of the quality of a random number generator, and I’m no expert in this area anyway. However, as we shall see the distribution produced by a generator with a small native range is so massively flawed for certain target ranges that the standard deviation is sufficient to make our point.

The difference in standard deviation becomes more pronounced as the ratio of iteration count to target range increases. The test is useless if the iteration count is no greater than the target range, as many target values would be missed even if the range projection method would not introduce any distribution errors.

Additionally, the distribution distortion of the modulo and division projections only appear for target ranges approaching or exceeding the generator’s native range. So in order to discover this distortion with a native range of [0 … 231-1] or greater, at least one billion hit counters must be kept in memory and at least ten billion iterations performed. I did not attempt to conduct such an experiment.

Sample Test Results

The following tables show two sets of sample test results on my systems. All tests were conducted with 100 million iterations for the indicated target range n. Each test result shows the standard deviation from a perfect hit distribution, followed by the execution time in seconds. The programs were compiled with Visual C++ and Visual C# from Microsoft Visual Studio. Both programs run in 32-bit mode and use full optimization (/Ox, /o) with unchecked arithmetic and no debug information.

The first set of results was obtained on 20 May 2012 (original publication date). The system was Windows 7 SP1 (64 bit) with Visual Studio 2010 Professional SP1 and .NET Framework 4.0, running on an Intel DX58SO motherboard with an Intel Core i7 920 CPU (2.67 GHz) and 6 GB RAM (DDR3-1333).

n Modulo Division Bucketing Secure Twister C# Random C# Twister
100 1,625 2.55 1,727 2.87 877 2.77 938 187.67 946 1.00 1014 1.76 1018 2.25
10,000 1,368 2.52 1,369 2.83 98 3.20 99 187.84 100 1.07 101 1.83 100 2.07
16,384 74 2.53 77 2.84 76 2.75 78 187.93 78 1.09 79 1.74 78 2.08
20,000 1,468 2.52 1,468 2.85 68 5.61 70 188.49 70 1.09 71 1.80 70 2.26
32,768 54 2.54 54 2.87 54 2.75 55 188.09 55 1.08 55 1.79 56 2.11
50,000 1,451 2.49 1,451 2.74 42 6.83 45 190.25 45 1.08 45 1.86 45 2.11
65,536 1,526 2.47 1,526 2.75 37 5.05 39 190.45 39 1.09 39 1.87 39 2.11

The second set of results was obtained on 17 September 2015. The system was Windows 10 Pro (64 bit) with Visual Studio 2015 Community and .NET Framework 4.6, running on a Dell XPS 15 notebook (model 9530) with an Intel Core i7 4712HQ CPU (2.30 GHz) and 16 GB RAM (DDR3-1600).

n Modulo Division Bucketing Secure Twister C# Random C# Twister
100 1,752 4.57 1,815 5.66 1,114 4.90 991 43.43 998 1.59 959 2.82 962 3.10
10,000 1,369 4.99 1,369 5.96 98 5.80 101 44.69 100 1.66 100 2.84 101 3.15
16,384 75 4.97 76 6.47 76 5.24 78 44.61 78 1.67 77 2.85 78 3.15
20,000 1,468 5.05 1,468 6.03 68 9.22 71 44.72 71 1.68 71 2.86 70 3.18
32,768 54 5.04 54 5.86 54 5.27 55 45.06 55 1.70 55 2.88 55 3.26
50,000 1,451 5.10 1,451 6.41 42 12.84 45 45.22 45 1.82 45 2.96 45 3.41
65,536 1,526 5.10 1,526 6.35 37 9.78 39 45.42 39 1.88 39 3.03 39 3.54

The relative timings were quite similar across systems, so we can summarize both tables together.

Distribution

Modulo and Division both show the same unacceptable degradation in distribution quality for target ranges that are either larger than Microsoft’s RAND_MAX=32,767 or not a factor of RAND_MAX+1. Only the results for n=16,384 and n=32,768 are comparable in quality to Bucketing and to generators with a larger native range.

However, this effect is not as pronounced for target ranges far smaller than RAND_MAX. Test case n=100 shows a deterioration of only ca. 160-200% compared to at least an order of magnitude for larger target ranges. This is also presumably the reason why the distribution of the other tested generators was not significantly degraded by the simple modulo projection.

Performance

As if the small native range wasn’t bad enough, Microsoft’s rand() is also strangely slow. The C version of the superior Mersenne Twister runs over twice as fast, and even both tested C# algorithms are significantly faster! Using Bucketing to correct the target range distribution can further slow down rand() by up to a factor of two whenever such a correction is actually necessary.

The Secure algorithm is not just slow, it’s incredibly slow. Depending on the software environment, 1–2 orders of magnitude separate the execution time of Microsoft’s rand_s() function from the rest of the field. Perhaps the quality of the generated random number sequence is worth the wait in a security-critical context, but you certainly wouldn’t want to call this function for any other purpose.

Test Conclusions

Never use the rand() function of Microsoft Visual C++. This generator is slow to begin with, and its small RAND_MAX necessitates the use of bucketing for many output ranges, making it even slower. Generally, avoid pseudo-random number generators with small native ranges and prefer ones with larger ranges, e.g. Mersenne Twister.

Use bucketing projection with any generator for target ranges near your generator’s native range where distribution quality could become a problem, or for any target range if you need a perfect distribution quality. Bucketing guarantees as good a distribution as your generator can offer, but at the price of a significantly reduced speed for certain ranges.

Use a generator with a large native range if you are confident that your target ranges will always be much smaller than the generator’s native range. Such a generator will not require bucketing to produce a good distribution in practical applications, and depending on the distribution quality of its least significant bits it can even work well with the fast & simple modulo projection.