How To Choose The Best Tool For Your Big Data Project

How To Choose The Best Tool For Your Big Data Project

Trying to choose the right tool for a big data project? This chart (and three simple rules) can help guide you through the options.

Flags picture from Shutterstock

This chart is based on one shown by Microsoft Research senior research program manager Wenming Ye during a presentation at Build 2014 last week.

“While choosing appropriate tools is important, skills remains the biggest challenge, Ye noted. “There’s a lot of talk about challenges in the tools and challenges in the data,” he said. “But what’s really important is actually the people. There’s a lack of understanding — we really have a lack of people who are able to understand and use these distributed tools. And it’s no-one’s fault — a lot of these tools are very difficult.”

Yen suggest three key rules when dealing with big data:

  • Make sure that you’re using data to drive decisions, and not merely tracking it for its own sake.
  • Continuously update and refine your metrics.
  • Use automation to conduct more experiments and ask more questions.

The chart divides big data tasks into three areas: batch processing, interactive analysis and real-time stream processing.

  Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Developers and analysts Developers
Open source tools Hadoop, Spark Drill, Shark, Impala Hbase Storm, Apache S4, Kafka

Disclosure: Angus Kidman travelled to San Francisco as a guest of Microsoft.


  • Angus, I suggest that at the top of the decision tree one should consider HPCC Systems from LexisNexis. We definitely are seeing an increase in activity with companies responding to the impact big data has made on their business. For companies any size, getting meaningful insights from data analytics is an important priority. LexisNexis has open sourced its HPCC Systems big data platform which represents more than a decade of internal research and development in the big data analytics field. Designed by data scientists, their built-in libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at

Log in to comment on this story!