Your Hadoop Big Data Lake Still Needs A Database River

10 years ago

June 11, 2014 at 12:30 pm

Hadoop is the dominant platform for analysing big data, and as an open source project, it can run on a variety of operating systems. Whichever OS you choose, one core principle remains the same: your new big data systems are a complement to existing databases and analytics system, not a replacement for them.

Picture: Hammonton Photography

Rohit Bakhshi, product manager for Hortonworks Data Platform (HDP), gave a detailed presentation on Hadoop and how to use it to maximum effect on Windows at the recent TechEd US 2014 event in Houston. HDP combines Hadoop with other core components that have been tested for interoperability, meaning you can deploy analytics servers more quickly. “Every single pillar is a separate open source project — HDP aims to integrate everything with the latest and most stable version of each individual project.”

“People generally start with Hadoop on a very specific application,” Bakhshi said. “You’ll find one new data source you want to incorporate into your daily analysis. “It could be traffic from a customer-facing weblog or supply chain machines generating thousands of data points every hour. That’s often when you start with a Hadoop cluster. As you get success with that applications, other groups in your enterprise will want to get on that software platform.”

“Organisations over time end up with a data lake, which is one platform, complementary to your relational database, to store all your enterprise data in its raw form.” Recognising that distinction is important.

“Hadoop at its core is a scalable raw file system. When you write data you can write it in its native form — there’s nothing compelling you to write a schema,” Bakhshi said. “That makes writes extremely fast. Only when you do processing and you do reads do you seek out metadata to understand that information. It really fits loosely structured types of data because you’re just loading it as raw files. You can write a query and it will act on the specific nodes where your data lives. That’s why you can get so much performance and scale.”

You also need to realise that Hadoop itself is just a small part of the picture. “The data platform is more than just storage and processing,” Bakhshi said. “It covers management, integration, security and access patterns.”

The tools you choose for those tasks will often vary depending on the expertise you have available. “As people work with different databases, there’s a skill set. Some people are very comfortable with scripting data workflow.” Those users are likely to choose Pig.

“What we’ve seen in the last year is a shift and an opening of new use cases. Multiple workloads were enabled because of the shift in architecture. Hadoop was created with HDFS as the storage layer and one processing pattern, MapReduce. What we did in the last year is abstracted away from the capabilities. Now you can run all these different processing workloads, not just MapReduce.”

How Could a Lack of Sleep Increase Your Risk of Type 2 Diabetes?

How to Share Your Wifi Password From Any Device

Does Electrical Muscle Stimulation Really Supercharge Your Workouts?

Use the ‘FlyLady’ Method to Make Routine Cleaning Less Overwhelming

If Your Galaxy Notifications Are Messed Up, Here’s the Fix

Here Are Amazon Australia’s Best Deals of the Week

TPG Has Changed the Prices for Almost All of Its NBN Plans

Wrap Me in ALDI’s $30 Heated Winter Travel Blanket

JB Hi-Fi Is Clearing Out Games For As Little As $2

Amazon Australia Beauty Week Sale: 24 of the Best Products to Shop

Your Hadoop Big Data Lake Still Needs A Database River

Comments