# Benford Synthetic Data

Introduction and Bloomberg Billionaires

The other day I updated a page I had written on the subject of Benford’s Law: it’s a mathematical law that fascinates people like me and is a huge friend of forensic accountants and auditors, to name but two.

Here is another page on Benford but this time to try to answer the question: is it possible to fool Benford by simulating a Benford compliant data set? Simple question and the answer is: YES! An email friend asked me that question yesterday and this morning I put together the file you can download from the link below. Here are the approaches I tried as I explored the Bloomberg billionaire 500 list that they published the other day:

• the actual data
• Normalised data
• Standardised data
• NORM.INV function
• Synthetic distribution

Workings

My basic workings are all in columns B:F of the data_workings tab. My Digit 1 analyses can be found in columns I:O. In columns I:O you will see the frequencies in rows 2:11 and the percentages in rows 13:21. The formulas/functions for these methods are:

• the actual data … taken from Bloomberg
• Normalised data =(B2-H\$5)/(H\$4-H\$5)*1000
• Standardised data =STANDARDIZE(B2,H\$2,H\$3)*1000
• NORM.INV function =NORM.INV(RAND(),H\$2,H\$3)
• Synthetic distribution =10*(10^(2/1000))^(A2-1)

Outputs

Look at what I have done as I have copied these formulas down 500 rows to match the raw, actual, data. The formula for extracting digit 1 from each of these distributions is =VALUE(LEFT(ABS(D2),1)) … you need the ABS for the columns that return negative values, otherwise just use =VALUE(LEFT(B2,1)).

I drew histograms from these distributions: You can see that they are all heavily positively skewed, just like the raw data, apart from the distribution that is based on the NORM.INV function, where the distribution is virtually normal.

That leads on to the Frequency tables I have mentioned in columns I:O:

I use the ARRAY entered FREQUENCY function, =FREQUENCY(Q2:Q501,I2:I10) for the tables in rows 2:11 and for the percentages, I just used =J2/SUM(J\$2:J\$10)

Here is the frequency work I have done: Finally, I created the following graphs that compare the Actual distribution of Digit 1s with the Benford standard for that: The frequency tables and these final graphs provide the answer!

• the actual data … not Benford compliant
• Normalised data … may be Benford acceptable
• Standardised data … not Benford compliant
• NORM.INV function … may be Benford acceptable
• Synthetic distribution … looks Benford perfect, don’t you think?

Before you do anything else, open my file and use it as a template to work on your own data and se if you can replicate these results.

Conclusions

The synthetic distribution is the most reliable way to create a Benford compliant distribution BUT do notice that none of the methods replicates the raw data, they provide distributions from the raw data that are Benford compliant. In essence, what it means is that we CAN fool Benford with synthetic, artificial, data if we want to.