Hadoop is the dominant platform for analysing big data, and as an open source project, it can run on a variety of operating systems. Whichever OS you choose, one core principle remains the same: your new big data systems are a complement to existing databases and analytics system, not a replacement for them.
Picture: Hammonton Photography
Rohit Bakhshi, product manager for Hortonworks Data Platform (HDP), gave a detailed presentation on Hadoop and how to use it to maximum effect on Windows at the recent TechEd US 2014 event in Houston. HDP combines Hadoop with other core components that have been tested for interoperability, meaning you can deploy analytics servers more quickly. "Every single pillar is a separate open source project — HDP aims to integrate everything with the latest and most stable version of each individual project."
"People generally start with Hadoop on a very specific application," Bakhshi said. "You'll find one new data source you want to incorporate into your daily analysis. "It could be traffic from a customer-facing weblog or supply chain machines generating thousands of data points every hour. That's often when you start with a Hadoop cluster. As you get success with that applications, other groups in your enterprise will want to get on that software platform."
"Organisations over time end up with a data lake, which is one platform, complementary to your relational database, to store all your enterprise data in its raw form." Recognising that distinction is important.
"Hadoop at its core is a scalable raw file system. When you write data you can write it in its native form — there's nothing compelling you to write a schema," Bakhshi said. "That makes writes extremely fast. Only when you do processing and you do reads do you seek out metadata to understand that information. It really fits loosely structured types of data because you're just loading it as raw files. You can write a query and it will act on the specific nodes where your data lives. That's why you can get so much performance and scale."
You also need to realise that Hadoop itself is just a small part of the picture. "The data platform is more than just storage and processing," Bakhshi said. "It covers management, integration, security and access patterns."
The tools you choose for those tasks will often vary depending on the expertise you have available. "As people work with different databases, there's a skill set. Some people are very comfortable with scripting data workflow." Those users are likely to choose Pig.
"What we've seen in the last year is a shift and an opening of new use cases. Multiple workloads were enabled because of the shift in architecture. Hadoop was created with HDFS as the storage layer and one processing pattern, MapReduce. What we did in the last year is abstracted away from the capabilities. Now you can run all these different processing workloads, not just MapReduce."