Trying to choose the right tool for a big data project? This chart (and three simple rules) can help guide you through the options.
Flags picture from Shutterstock
This chart is based on one shown by Microsoft Research senior research program manager Wenming Ye during a presentation at Build 2014 last week.
“While choosing appropriate tools is important, skills remains the biggest challenge, Ye noted. “There’s a lot of talk about challenges in the tools and challenges in the data,” he said. “But what’s really important is actually the people. There’s a lack of understanding — we really have a lack of people who are able to understand and use these distributed tools. And it’s no-one’s fault — a lot of these tools are very difficult.”
Yen suggest three key rules when dealing with big data:
- Make sure that you’re using data to drive decisions, and not merely tracking it for its own sake.
- Continuously update and refine your metrics.
- Use automation to conduct more experiments and ask more questions.
The chart divides big data tasks into three areas: batch processing, interactive analysis and real-time stream processing.
|Batch processing||Interactive analysis||Stream processing|
|Query runtime||Minutes to hours||Milliseconds to minutes||Never-ending|
|Data volume||TBs to PBs||GBs to PBs||Continuous stream|
|Users||Developers||Developers and analysts||Developers|
|Open source tools||Hadoop, Spark||Drill, Shark, Impala Hbase||Storm, Apache S4, Kafka|
Disclosure: Angus Kidman travelled to San Francisco as a guest of Microsoft.