Nicklas Wright, Class of 2022

Throughout the COVID-19 pandemic, one of the most frequently asked questions has been: “How many people are infected?” Data from the CDC, WHO, and other global entities has been useful in providing this information, but skepticism regarding this data is still prevalent. Some countries were accused of keeping their numbers artificially low by underreporting infections while others were accused of inflating their numbers to spread panic and justify lockdowns. Some recent studies assessed the accuracy of COVID-19 data using statistical methods in hopes of coming to an unbiased answer.

These studies utilized an observation known as Benford’s Law. Benford’s Law states that in a sample of random numbers that spans several orders of magnitude, the chance of any digit “x” being the first digit is equal to log(x+1) - log(x). This means the first digit of any number in the sample is most likely to be 1 and least likely to be 9. This rule applies to a surprisingly wide range of data sets. Tax returns, bank account balances, and even the collection of physical constants such as the gas constant R follow the Benford distribution. When humans are making up data, they often use an even ratio of the integers 1 through 9 as the first digit. Knowing that high numbers are rarely seen as the first digit in a truly random sample, Benford’s Law can be used to detect data manipulation. The law also makes predictions about the second and third digits of random numbers and once again, low numbers are more likely to occur. However, the difference in frequency is not as pronounced as it is in the first digit.

Benford’s Law is routinely used to identify accounting and tax fraud and has even been admitted as evidence in court. It is also often applied to election data, including the recent 2020 US elections. Researchers sought to use the law to analyze COVID-19 data from countries around the world, ensuring the incoming data had a random distribution. A study by Koch and Okamura focused on China, Italy, and the United States, while a separate study by Vei and Vellwock covered a wide range of countries. Both studies simultaneously examined infections and deaths, both daily and in total. They also used data from states, provinces and counties to acquire a large enough sample size so the resulting data spanned several orders of magnitude. Taken as a whole, the data from around the world fits Benford’s Law very closely. This suggests that COVID-19 data in general is a good candidate for testing with Benford’s Law.

For the vast majority of countries examined, their data matched the expected Benford’s distribution. Researchers used a chi-squared test as well as a Kolmogorov-Smirnov test to determine how well the data fit the distribution. Interestingly, China’s data fit Benford’s law well. This is significant because the outbreak began there and China was frequently accused of manipulating their numbers. Only two countries had data that did not match well with Benford’s Law. Russia’s data for daily deaths and infections as well as total deaths and infections was far off. Instead of having a high proportion of 1’s, 2’s, and 3’s, Russia’s numbers showed an even distribution of all numbers in the first digit. This is a classic sign of data manipulation. The numbers from Iran also deviated from Benford’s Law in certain categories, but matched it in others. The study did not determine if these deviations were statistically significant or not, but suggested a slight possibility of data manipulation. Russia and Iran are two of the countries besides China who are most accused of reporting false numbers. These studies suggest that there is potentially some merit to those accusations.

It is important to note that Benford’s Law does not prove fraud. It does suggest its occurrence in Russia and Iran, but further investigations should be done in order to come to a definite conclusion, particularly in Iran where the deviations were only slight. Furthermore, just because a country’s data matched the Benford distribution, this does not guarantee that the data is accurate. If one is aware of Benford’s Law, it is possible to produce numbers that match up with it perfectly. This might be difficult to coordinate on a countrywide basis, but it is still a possibility.

Overall, the fact that most of the data seems to be accurate is an encouraging sign. It is especially important that China’s data matches the expected distribution since it is critical to have reliable information about the initial spread of the virus. Data from China has been used to predict how the virus will spread in new countries and inform how those countries respond to the pandemic. Mistrust in the data has been widespread throughout the pandemic and these studies will hopefully restore credibility in the world’s data collection efforts.

References

1. Wei, Anran & Vellwock, Andre E.. (2020). Is COVID-19 data reliable? A statistical analysis with Benford's Law. 10.13140/RG.2.2.31321.75365/1.

2. Koch C, Okamura K. 2020. Benford’s Law and COVID-19 reporting. Economics Letters. 196:109573. doi:10.1016/j.econlet.2020.109573.

3. Paul H. Kvam, Brani Vidakovic, Nonparametric Statistics with Applications to Science and Engineering, p. 158

These studies utilized an observation known as Benford’s Law. Benford’s Law states that in a sample of random numbers that spans several orders of magnitude, the chance of any digit “x” being the first digit is equal to log(x+1) - log(x). This means the first digit of any number in the sample is most likely to be 1 and least likely to be 9. This rule applies to a surprisingly wide range of data sets. Tax returns, bank account balances, and even the collection of physical constants such as the gas constant R follow the Benford distribution. When humans are making up data, they often use an even ratio of the integers 1 through 9 as the first digit. Knowing that high numbers are rarely seen as the first digit in a truly random sample, Benford’s Law can be used to detect data manipulation. The law also makes predictions about the second and third digits of random numbers and once again, low numbers are more likely to occur. However, the difference in frequency is not as pronounced as it is in the first digit.

Benford’s Law is routinely used to identify accounting and tax fraud and has even been admitted as evidence in court. It is also often applied to election data, including the recent 2020 US elections. Researchers sought to use the law to analyze COVID-19 data from countries around the world, ensuring the incoming data had a random distribution. A study by Koch and Okamura focused on China, Italy, and the United States, while a separate study by Vei and Vellwock covered a wide range of countries. Both studies simultaneously examined infections and deaths, both daily and in total. They also used data from states, provinces and counties to acquire a large enough sample size so the resulting data spanned several orders of magnitude. Taken as a whole, the data from around the world fits Benford’s Law very closely. This suggests that COVID-19 data in general is a good candidate for testing with Benford’s Law.

For the vast majority of countries examined, their data matched the expected Benford’s distribution. Researchers used a chi-squared test as well as a Kolmogorov-Smirnov test to determine how well the data fit the distribution. Interestingly, China’s data fit Benford’s law well. This is significant because the outbreak began there and China was frequently accused of manipulating their numbers. Only two countries had data that did not match well with Benford’s Law. Russia’s data for daily deaths and infections as well as total deaths and infections was far off. Instead of having a high proportion of 1’s, 2’s, and 3’s, Russia’s numbers showed an even distribution of all numbers in the first digit. This is a classic sign of data manipulation. The numbers from Iran also deviated from Benford’s Law in certain categories, but matched it in others. The study did not determine if these deviations were statistically significant or not, but suggested a slight possibility of data manipulation. Russia and Iran are two of the countries besides China who are most accused of reporting false numbers. These studies suggest that there is potentially some merit to those accusations.

It is important to note that Benford’s Law does not prove fraud. It does suggest its occurrence in Russia and Iran, but further investigations should be done in order to come to a definite conclusion, particularly in Iran where the deviations were only slight. Furthermore, just because a country’s data matched the Benford distribution, this does not guarantee that the data is accurate. If one is aware of Benford’s Law, it is possible to produce numbers that match up with it perfectly. This might be difficult to coordinate on a countrywide basis, but it is still a possibility.

Overall, the fact that most of the data seems to be accurate is an encouraging sign. It is especially important that China’s data matches the expected distribution since it is critical to have reliable information about the initial spread of the virus. Data from China has been used to predict how the virus will spread in new countries and inform how those countries respond to the pandemic. Mistrust in the data has been widespread throughout the pandemic and these studies will hopefully restore credibility in the world’s data collection efforts.

References

1. Wei, Anran & Vellwock, Andre E.. (2020). Is COVID-19 data reliable? A statistical analysis with Benford's Law. 10.13140/RG.2.2.31321.75365/1.

2. Koch C, Okamura K. 2020. Benford’s Law and COVID-19 reporting. Economics Letters. 196:109573. doi:10.1016/j.econlet.2020.109573.

3. Paul H. Kvam, Brani Vidakovic, Nonparametric Statistics with Applications to Science and Engineering, p. 158