Benford’s Law
Posted: May 8, 2013 Filed under: Measurement and Analytics | Tags: analysis, auditing, Benford's Law, Black Belts, frank benford, fraud, Lean Six Sigma, simon newcomb, statistics, stock market figures 1 CommentIn our modern world, we are surrounded by numbers. These numbers include speed limits, prices of goods, ages of friends and distances to travel. And if you look at these numbers, you’ll find a very odd thing – nature has a soft spot for the number “one”.
This was first discovered around 1881, when the American astronomer Simon Newcomb wrote a letter to the American Journal of Mathematics.
He’d noticed something odd about books of logarithms. Logarithms were how scientists and engineers multiplied big numbers before cheap electronic calculators appeared in the 1970s. If you wanted to multiply two numbers together, you’d look up their logarithms in a book, add them together, look up the anti-logarithm of that number, and there was your answer.
Newcomb noticed something odd about these books of logarithms – the early pages were much dirtier than the last pages. This meant that the scientists and engineers spent a lot of time dealing with numbers beginning with 1, less time with numbers beginning with 2 and so forth. In fact, Newcomb came up with the Law that the probability that a number will begin with the digit N, is equal to log (N+ 1) – log (N). But the mathematicians weren’t interested.
The first real burst of interest was generated in 1939 by Frank Benford, a physicist working with the General Electric Company in the USA. He accidentally came across the effect that Newcomb had mentioned. But Benford went a little bit deeper. He looked at a huge sample size, much bigger than Newcomb’s He analysed over 20,000 numbers that he got from collections as obscure as the drainage areas of rivers, to stock market figures and various properties of different chemicals. Again he showed that about 30% of numbers began with the digit “1”, 18% with “2”, all the way down to 4.6% of numbers starting with the number “9”. The Law is now called Benford’s Law in his honour, even though he wasn’t the first to discover it.
But Benford’s Law doesn’t apply everywhere.
- First, you need to have a big enough sample size so that patterns can show themselves. For example, you almost certainly won’t find Benford’s Law in the heights of your average family of 4.5 people.
- Second, you don’t want numbers that are truly random. By definition, in a random number, every digit from 0 to 9 has an equal chance of appearing in any position in that number.
- And third, you don’t want numbers that are the complete opposite of random, and are tightly controlled. So if you deal with numbers that have artificial limits, such as the prices of petrol in a capital city, you won’t find Benford’s Law. Here market forces lock the prices of petrol to stay within a narrow range.
But Benford’s Law does apply to numbers somewhere between totally random and totally constrained – such as the monthly electricity bills in the Solomon Islands.
And also, the numbers that appear in accountancy tables, and the balance sheets of companies, follow Benford’s Law. This was discovered by Mark Nigrini, in his Ph. D. thesis written in 1992. He showed that sales figures, buying and selling prices, insurance claim costs and expenses claims should all follow Benford’s Law.
Dr. Nigrini gave his accountancy students an assignment. They were supposed to go through some real sales figures from a real company, and see how many of them started with the digit “one”. One of his students used the figures from his brother-in-law’s hardware shop. The sales figures were astonishing – 93% began with the digit “1”, none of them began with the digits between “2” and “7”, four began with the digit “8”, and 21 with the digit “9”. According to Benford’s Law, the brother-in-law was a crook who was cooking the books – and Benford’s Law was right.
Nigrini calls the use of Benford’s Law to find fraud “digital analysis”. According to Nigrini, “it’s used by listed companies, large private companies, professional firms and government agencies in the USA and Europe – and by one of the World’s biggest audit firms”.
But people who are not accountants are also interested in Benford’s Law to look for other types of trickery. Mark Buyse and his colleagues from the International Institute for Drug Development in Brussels think that Benford’s Law could show up data that had been faked in drug trials.
Peter Schatte, a mathematician from the Bergakademie Technical University in Freiberg is using Benford’s Law to efficiently organise space on computer hard drives.
Benford’s law predicts a decreasing frequency of first digits, from 1 through 9. Every entry in data sets developed by Benford for numbers appearing on the front pages of newspapers, by Mark Nigrini of 3,141 county populations in the 1990 U.S. Census and by Eduardo Ley of the Dow Jones Industrial Average from 1990-93 follows Benford’s law within 2 percent.
Benford’s law can be used to test for fraudulent or random-guess data in income tax returns and other financial reports. Here the first significant digits of true tax data taken by Mark Nigrini from the lines of 169,662 IRS model files follow Benford’s law closely. Fraudulent data taken from a 1995 King’s County, New York, District Attorney’s Office study of cash disbursement and payroll in business do not follow Benford’s law.
Likewise, data taken from the author’s study of 743 freshmen’s responses to a request to write down a six-digit number at random do not follow the law. Although these are very specific examples, in general, fraudulent or concocted data appear to have far fewer numbers starting with 1 and many more starting with 6 than do true data.
Cards in the Hat Explanation
Perhaps the easiest explanation why probabilities decline from one digit to the next is provided by Weaver. Assume we write numbers on cards starting with 1 and ending with 999,999, each number getting a separate card. Let P = the probability of getting 1, 2, 3, or 4 as the leading digit. Intuition suggests P = 4/9.
As we number the individual cards, we begin to place them one at a time into a hat. After each card, we ask the question “What is the probability, P, that a card picked at random from the hat at this point in time will have a leading digit of 1, 2, 3, or 4?” The answer would be P = 100 percent for the first four cards, of course. After the fifth card, P would drop to 80 percent. After the sixth card, it would drop to 66.7 percent. After the ninth card, it would drop to 4/9 or 44.4 percent, our overall intuitive level based on the entire batch.
After the tenth card, however, P would rise to 50 percent since five of the ten cards have a leading digit of 1 through 4. Through 19, P rises 73.7 percent since 11 of the leading digits would be 1, and the initial 2, 3, and 4 are still in the hat, making 14 of the 19 cards winners. As cards 20 through 49 are added, P increases steadily to a maximum of 44/49 or 89.80 percent.
As more cards are added beginning with 50, P then declines steadily, reaching a minimum at the 99th card of 44/99 or 44.4 percent again. With the addition of the 100 card, P begins to rise once more, and will continue to do so until 499 is reached. At that point, 444/499 cards, 88.98 percent, meet the criteria. At 500, P will begin to fall, ultimately back to 444/999 or about 44.4 percent again.
A key observation here is that each time we pass a turning point in our calculations, the length or span of cards we must cover to reach the next turning point gets longer and longer in real terms. To reach 10, it took only five additional cards after 4. To reach 50, it took 40; to reach 500, it took 400, etc. To reach 500,000 it took 400,000 additional cards. If the maximum check written is a six digit check, it will have to be over $500,000 to begin to offset the lead that the lower digits have built up to that point. Because of the way we count, 1, 2, 3, and 4 will always be in the lead.
That is, their probabilities will be always be higher unless the number of checks over $500,000 is greater than the number below $500,000, which would be rare. Therefore, the lower digits 1, 2, 3, and 4 will nearly always have a higher probability of occurring as leading digits than 5 to 9.
Working Backwards Explanation
Another intuitive explanation is to consider the largest number in the data set we are dealing with and work backwards. Assume the numbers in the data set represent checks written against an account, and consider the largest check that a company writes. Few companies write many checks in the seven-digit range, that is, over $1 million, but if a check is in the seven-digit range, it seems reasonable that the chances are it will be closer to $1 million than $9,999,999. The same would be true for six digit checks (probably more are closer to $100,000 than $999,999). As we work our way backwards, the same holds true (but probably not to the same extent) for five digit checks, and even four digit checks. For any digit range in which it is plausible that smaller checks are more likely than larger checks, we would expect to see more 1’s, 2’s, 3’s, 4’s and 5’s as leading digits than 6’s, 7’s, 8’s, or 9’s.
It is important to understand the elements of this explanation: 1) leading digits inherently involve sequencing because 1 is smaller than 2, 2 is smaller than 3, etc.; 2) because of the way we count, l always get a head start in terms of being the leading digit whether we are dealing with two digit numbers, three digit numbers, four digit numbers, etc.; 2 gets to start next, then 3, and so forth; 3) the expected value of the largest number we are dealing with in any finite table will lie in the center of the ending digit range; and 4) in the real world, we are always using numbers that have a miniscule finite digit length compared to the entire range of possible numbers from zero to infinity.
Taken together, these facts lead to the conclusion that the leading significant digits we use are more likely to be 1 than 2, 2 more likely than 3, etc.
Digit Analysis on Spreadsheets
The foregoing explains why Benford’s Law works for the first digit. Similar explanations can be made for the second, third, and other digits, but the logarithmic effect is diluted each time. Armed with a knowledge of Benford’s Law, anyone can check a column of numbers on Excel by simply making use of the “mid” worksheet function.
This command slices one or more characters off a cell entry starting from any point specified. Assume the number 975 appeared in cell B8, for example. The command =mid(B8,2,1) would return the digit 7 because B8 identifies cell B8, the 2 says start with the second digit, and the 1 says print one digit. Other examples: mid(B8,1,1) would return the digit 9; mid(B8, 3,1) would return the digit 5; and mid(B8,1,2) would return the two digits 97.
Once the numbers have been parsed into their respective digits, the rest of the analysis follows a normal pattern for statistical significance testing. The chi-square, t, and Kolmogorov-Smirnov are the most commonly applied tests.
Rounded Ending Numbers
Rounding of the final digit is checked by simply comparing the actual frequencies of multiples of the ending digits 10, 25, 100, and 1000 to their theoretical expectancies (.10, .04, .01, and .001, respectively). Rounding often indicates estimation, which may be inappropriate for taxable items such as sales or inventory counts.
Analysis of the last two digits proceeds in a similar manner. Numbers containing four or more digits should have the last two follow a uniform distribution of one percent each. Final digits on Excel are captured by sorting the original data into a descending array, then using the mid worksheet function as explained earlier for the various digit ranges separately.
Duplicates
Duplicate numbers on Excel are found by simply sorting the column of numbers first, then subtracting subsequent entries. Zero differences mean duplicate entries.
The simplest method to estimate the expected number of duplicates to expect is based on the well-known birthday problem. If you have 100 people in a room, how many matching birthdays do you expect to get? The most common solution is to assume a uniform distribution, so that the probability of any single pair of people having identical birthdays is 1/365. How many different ways can you pair up the 100 people? That is, how many different combinations of 2 can you make from n choices? The answer is n!/2!(n-2)! = 4,950. Multiplying 4,950 possible pairs times 1/365 yields 13.56 as the expected number of matching birthdays.
Why Benford’s Law Works: A Calculus Explanation
If we were to pick any number in the finite range from 1 to x, the chances of selecting any specific number would be 1/x. For the numbers from 1 to 999, the chances of picking any particular number would be 1/999. Any list of “real world” numbers that have not been generated artificially by some process represents a random sample of numbers between 0 and the largest number on the list, x. The checks written by a company would represent an example of such a list.
The area underneath this density function between any two points a and b yields the probability of getting a value lying between a and b. This is precisely the same, of course, as calculating probabilities of normally distributed random variables. Only in this case, the density function is simply 1/x rather than the complicated mathematical equation discovered by Gauss that describes the normal curve. From integral calculus, we know that the probability of getting a value between a and b is the area under the distribution ‘curve’ between a and b and is derived as
Integral from a to b over the function (1/x) dx = ln(b) – ln(a) = ln(b/a) (2)
Proving this is beyond the scope of this post; most mathematicians treat it as a definition, but Feller attempts a proof. For single digit numbers, if a = n and b = the next digit n+1, then the probability the digit equals n is ln[(n+1)/n] = ln(1+1/n). Raimi shows that because natural logs are simply a scalar multiple of 10-based logs (i.e. ln(a) = ln(10)* log(a) = 2.3026 log(a)), percentages will be the same whether ln or log is used. Benford uses logs to the base 10 rather than the natural logs with the base e is based on the principle that we use a numbering system based on 10. Raimi provides an excellent review and discussion of the mathematical treatment of the problem.
Hi Bruce,
Sound like the answer might be tied to the nature of growth over time. The nature of nost natural progression of growth is a sequence where lower measurements precede larger measurements. There never seems to be a matching decay model that mirrors growth functions that would provide a matching offset. There is either stagnation, or death.
Another exercise that I do that hints at this phenomena. Grab handfuls of snacks, and then count the occurrences of ending up with 1 or 0 items after pairing off and counting what is in your hand.
Facinating post. Thanks
Kelly