by Thor Olavsrud

Ease Big Data Hiring Pain With Cascading

Feature
Jun 6, 20125 mins
Business IntelligenceEngineerJava

Finding developers with the skills to create MapReduce jobs in Apache Hadoop is challenging, but you can ease that hiring pain with Cascading, an open source Java application framework for building enterprise Big Data applications on Hadoop.

Around the world and across industries, companies are investing in Big Data technologies, especially Apache Hadoop. But building Big Data applications on Apache Hadoop using Hadoop MapReduce is a challenging thing. Finding developers that can create MapReduce jobs for your Big Data projects is more challenging yet. Finding an API that allows Java developers to use their existing skills to query Big Data datasets? Priceless.

Almost every company (91 percent) is already using tools to manage and analyze data, according to an April 2012 survey by managed services provider Avanade. Avanade surveyed 569 C-level executives, business unit leaders and IT decision-makers in 18 countries in an effort to quantify attitudes and adoption trends surrounding Big Data.

Big Data at a Tipping Point

“Big data is reaching a tipping point where it is becoming much more mainstream,” says Steve Palmer, Avanade’s Business Intelligence Leader for North America. “In our survey, 95 percent of companies surveyed do not consider data analysts to be part of the IT staff anymore. Instead, they’re distributing them into the line of business.”

While data management and analysis have historically been considered IT jobs, the study found that 58 percent of respondents say data management is now embedded throughout their business operations, and 59 percent of global companies say more employees than ever before are involved in making decisions as a result of more widely available company data.

Palmer says the survey also found that companies are translating this changing perspective into Big Data technology investment. The most widely used technologies already in use are data storage, reporting, data integration and enterprise search. Seventy-five percent of survey respondents say their company will make additional investments to improve their capability to analyze data within the next 12 months, especially in Big Data technologies such as predictive analytics, mobile data access and management tools.

While the majority of executives (58 percent) believe finding the right technology is the biggest challenge their companies face in analyzing data, the majority (56 percent) of IT decision-makers charged with implementing Big Data programs believe finding the right staff is a bigger challenge than finding the right technology. And it should come as no surprise that 63 percent of stakeholders believe their company needs to develop new skills to turn data into business insights, especially math and statistics (17 percent), business operations and analysis (37 percent) and visual design and reporting (22 percent).

Developers with Apache MapReduce Skills Are in High Demand

One of the major challenges companies face when they set out to transform the data they store into actionable insight is finding developers able to create MapReduce jobs to query Hadoop-stored datasets. MapReduce is a complicated and difficult framework to use.

“Folks that know MapReduce are a tough find and they’re in high demand,” says Brandon Mason, CTO of Upstream Software, a specialist in integrated marketing performance management. Upstream analyzes all the marketing data a retailer has-including Coremetrics or Omniture logs, keywords shoppers use, direct mail logs, email logs and so forth-to help retailers properly weight their marketing mix. “To do the secret sauce stuff, we really needed a platform to handle lots of different data sets. Sometimes it’s very dirty.”

Open Source Cascading Is Alternative to MapReduce

Enter Cascading, a stand-alone open source Java application framework designed as an alternative API to MapReduce. Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset.

“I created Cascading in anger after having used MapReduce once in my life and vowing never to use it again,” explains Chris Wensel, creator of Cascading.

Wensel authored Cascading as an open source project in 2007 and is now CEO of Concurrent, an enterprise Big Data application platform company that continues to drive development of Cascading as its primary commercial sponsor. Concurrent numbers companies like Twitter and Etsy, as well as Upstream, among its clients. Twitter has three internal teams that use Cascading to perform sophisticated statistical functions to analyze huge volumes of data from tweet contents, ad campaigns and user activity. Etsy executes more than 65 Cascading applications daily to extract data from its web logs and databases to monitor and understand user behavior, A/B site testing and power new features on its ecommerce site.

On Tuesday, Concurrent released Cascading 2.0 under the Apache 2.0 License Agreement. Cascading 2.0 adds a number of new features, including in-memory processing that allows users to run it in memory on a local computer to rapidly test Big Data applications in development. Upstream’s Mason says his company made the switch to Cascading 2.0 about two months ago. But even as the CTO of a company that lives and dies on its ability to leverage Big Data, Mason is not as excited by the new features of Cascading as he is in the ability to use it to more easily build a team to meet Upstream’s needs.

Leverage Java Developers for Big Data

“It’s been easier to hire and build up a team around it,” he says. “Cascading makes it look and operate like Java. Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.”

He notes that Java developers do need to spend a few weeks learning about Hadoop basics in order to apply their Java knowledge to Cascading, but adds it’s nothing compared with learning raw MapReduce.

“There’s definitely a learning curve,” he says. “But having to learn raw MapReduce is a pretty involved process and takes a lot of time. Cascading just takes that off the table and you don’t have to worry about it. You have to understand the concepts of MapReduce, but you don’t have to put your feet into the weeds of raw MapReduce.”

Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline and on Facebook. Email Thor at [email protected]