The crucial thing about "big data" is the data. "Big" is relative, and while size often matters, real disruption can come from data of any size.
Laboratory picture from Shutterstock
This is not a new idea, being several hundred years old. The key advance of the scientific revolution (and associated industrial revolution) was in order to understand something you had to measure it – that is gather the data.
The modern hoopla about "big data" is simply the scientific method applied to a wider range of problems. Doing this cheaply enough is the challenge.
While the idea of collecting the data is the most fundamental, it is not sufficient. Analysts need to make sense of the data. The field of statistics was developed well over a century ago to help do so, originally for kings so they could know how much tax they could raise (the word "statistics" shares etymology with "state").
Statistical thinking involves computing functions of data. Until recently the ability to do these computations was a major bottleneck. The many orders of magnitude in reduction of cost per byte stored or computation done that we have seen in the last couple of decades (driven by technological advances in chip fabrication and disk drive manufacture) has removed the old bottlenecks. It is this reduction in cost that has enabled the "big data revolution".
Many businesses and organisations are now gathering and keeping data on a much finer grained scale than before. Rather than tabulating aggregate sales figures, a large retailer can now store every single purchase made by every single customer. With this they can understand the patterns of consumer behaviour in a manner that allows them to tailor their offerings in a very personalised manner.
Taking the analogy with the methods of science, this allows business people of all types to approach their business as a scientist would an experiment.
The use of data-centric techniques for marketing and the analysis of customer behaviour is certainly the most visible use of big data in industry, but it is actually just the tip of the iceberg. It is perhaps popular now since businesses typically record a lot of this data for other purposes (such as payments).
The real disruptions are likely to occur when business leaders realise they can measure (and then potentially make sense of) any other aspects of their business.
- A bus company can measure every single journey on each bus by capturing the data from electronic payment systems. It can use this to optimise its routes and timetable in a much more fine grained manner than before.
- A city can potentially control all of the traffic lights in the whole city on the basis of real-time information of the traffic across the whole city, rather than simply controlling locally at each intersection.
- Energy companies can measure the output of rooftop solar panels and predict the energy produced 10 minutes hence.
- Hospitals can mine the nurse's daily records to detect deadly fungal diseases before they are noticed by other means.
Any problem where there is something you can measure is amenable to doing better with the techniques of data science.
A few hurdles
What are the challenges and barriers to the radical use of data and disruption of diverse businesses? There are three important factors.
First, the tools for making sense of data are still very primitive in the sense of their usability and composability with other business processes. They have been developed by and for data scientists. It will take considerable engineering work to make them more widely usable.
Second, there is a growing concern about who owns data and what is done with it. It is not the NSA that is the worry here, but rather search/advertising companies, who not only can learn a lot about you, but then make their money precisely by subsequently manipulating you with this information.
While it seems impossible to reliably contain the spread and leaking of personal data, it is in principle possible to regulate its use – this is the premise behind the notion of "data accountability". After all, it is not who knows certain things about you that matters, but what they do with it. Whether we can evolve to a system that provides adequate protection and transparency of data remains to be seen.
Third, there is likely to remain a substantial skill shortage for some time. Computation and storage even at a large scale has been commoditised with cloud computing. But the process of extracting sense and meaning out of data is far from being a commodity.
Many businesses from banks to transport companies are now hiring data scientists or machine learning experts to help them ask the right questions of their data and process it appropriately. Businesses also need to adapt the way they think about what they do and especially how they can deal with uncertainty. Merely measuring everything does not give certainty – all the data in the world will still not help you predict the future, but it will give you some clues.
The businesses and countries that stay ahead with big data analytics are going to be the winners of the 21st century.
Robert Williamson is Professor, College of Engineering and Computer Science at Australian National University. He receives funding (via the ANU) from the Australian Research Council, and via NICTA, from the Commonwealth Government. He is affiliated with the Australian National University and NICTA.