Categories: Big Data
The past and future collided last year for the International Year of Statistics, when six professional organizations celebrated the multifaceted role of statistics in contemporary society to raise public awareness of statistics, and to promote thinking about the future of the discipline. The past came in the form of the 300th anniversary of Jacob Bernoulli’s Ars conjectandi (Art of Conjecturing) and the 250th anniversary of Thomas Bayes’ “An Essay Towards Solving a Problem in the Doctrine of Chances.” The future came in the form of an identity crisis for the discipline brought on by the rise of Big Data.
To cap the year, in November 100 prominent statisticians attended an invitation-only event in London to grapple with the challenges and possible pathways that future presents. Earlier this month, Statistics and Science: A Report of the London Workshop on the Future of the Statistical Sciences, the product of that high-level meeting, was released by the six societies: the American Statistical Association, the Royal Statistical Society, the Bernoulli Society, the Institute of Mathematical Statistics, the International Biometric Society, and the International Statistical Institute.
Method Space is excerpting portions of that report highlighting case studies on the current use of statistics and the challenges the discipline faces, such as the reproducibility crisis. (For a PDF of the full report, click here.)
Without a doubt, the most-discussed current trend in statistics at the Future of Statistics Workshop was Big Data. The ubiquity of this phrase perhaps conceals the fact that different people think of different things when they hear it. For the average citizen, Big Data brings up questions of privacy and confidentiality: What information of mine is out there, and how do I keep people from accessing it? For computer scientists, Big Data poses problems of data storage and management, communication, and computation. And for statisticians, Big Data introduces a whole different set of issues: How can we get usable information out of databases that are so huge and complex that many of our traditional methods can’t handle them? …
Two talks at the London workshop, given by Stephen Fienberg and Cynthia Dwork, focused on privacy and confidentiality issues. Fienberg surveyed the history of confidentiality and pointed out a simple, but not obvious, fact: As far as government records are concerned, the past was much worse than the present. U.S. Census Bureau records had no guarantee of confidentiality at all until 1910. Legal guarantees were gradually introduced over the next two decades, first to protect businesses and then individuals. However, the Second War Powers Act of 1942 rescinded those guarantees. Block-by-block data were used to identify areas in which Japanese-Americans were living, and individual census records were provided to legal authorities such as the Secret Service and Federal Bureau of Investigation on more than one occasion. The act was repealed in 1947, but the damage to public trust could not be repaired so easily.
There are many ways to anonymize records after they are collected without jeopardizing the population-level information that the census is designed for. These methods include adding random noise (Person A reports earning $50,000 per year and the computer adds a random number to it, say –$10,000, drawn from a distribution of random values); swapping data (Person A’s number of dependents is swapped with Person B’s); or matrix masking (an entire array of data, p variables about n people, is transformed by a known mathematical operation—in essence, “smearing” everybody’s data around at once). Statisticians, including many at the U.S. Census Bureau, have been instrumental in working out the mechanics and properties of these methods, which make individual-level information very difficult to retrieve.
Cryptography is another discipline that applies mathematical transformations to data that are either irreversible, reversible only with a password, or reversible only at such great cost that an adversary could not afford to pay it. Cryptography has been through its own sea change since the 1970s. Once it was a science of concealment, which could be afford¬ed by only a few—governments, spies, armies. Now it has more to do with protection, and it is available to everyone. Anybody who uses a bank card at an ATM machine is using modern cryptography.
One of the most exciting trends in Big Data is the growth of collaboration between the statistics and cryptography communities over the last decade. Dwork, a cryptographer, spoke at the workshop about differential privacy, a new approach that offers strong probabilistic privacy assurances while at the same time acknowledging that perfect security is impossible. Differential privacy provides a way to measure security so that it becomes a commodity: A user can purchase just as much security for her data as she needs.
Still, there are many privacy challenges ahead, and the problems have by no means been solved. Most methods of anonymizing do not scale well as p or n get large. Either they add so much noise that new analyses become nearly impossible or they weaken the privacy guarantee. Network-like data pose a special challenge for privacy because so much of the information has to do with relationships be¬tween individuals. In summary, there appears to be “no free lunch” in the tradeoff between privacy and information.