This document discusses Hadoop and its relationship to Microsoft technologies. It provides an overview of what Big Data is, how Hadoop fits into the Windows and Azure environments, and how to program against Hadoop in Microsoft environments. It describes Hadoop capabilities like Extract-Load-Transform and distributed computing. It also discusses how HDFS works on Azure storage and support for Hadoop in .NET, JavaScript, HiveQL, and Polybase. The document aims to show Microsoft's vision of making Hadoop better on Windows and Azure by integrating with technologies like Active Directory, System Center, and SQL Server. It provides links to get started with Hadoop on-premises and on Windows Azure.
2. Session Objectives
• What is BigData?
• How it fits into the Windows and Windows Azure environments
• How do I program against it in the Microsoft Environment
3. What is Big Data?
• Traditionally:
• Physics Experiments, Sensor data, Satellite data, …
• Now:
• Operational Logs
• Customer behavior
• Social interactions online
• …
• From Terabytes in the 1990 over Petabytes today to Zetabytes in the
future
5. VOLUME VARIETY VELOCITY
(Size) (Structure) (Speed)
Big Data.
6. What’s the social sentiment How do I better predict
of my product? future outcomes?
How do I optimize my services
based on patterns of weather,
traffic, etc.?
New Questions.
8. What is Hadoop (v1)?
• Processing Platform for Big Data Processing
• Using the “Map-Reduce” Processing Paradigm
• Characteristics:
• Highly-scalable (scaled out)
• Commodity HW-based
• Open Source
=> Very low cost for acquisition and storage costs
14. HDFS on Azure: Tale of two File Systems
HDFS API
Containers on Azure Blob Storage
NameNode
Front end
Front end
Front end
Data Node Partition Layer
Data Node
…
Stream Layer
DFS (1 Data Node per Worker Role) Azure Storage Vault (ASV)
and Compute Cluster
15. .Net Map/Reduce Support
• Install NuGet
• “NuGet” Microsoft .Net MapReduce API for Hadoop
• Provide an implementation of a HadoopJob
• Execute the job via either
• MRLibMRRunner.exe -dll ConsoleAppHadoopJob.exe
Or
– HadoopJobExecutor.ExecuteJob<HadoopJobClass>();
• Collect your result on HDFS
16. Javascript Map/Reduce Support
• Provide a map and reduce function variable in JS file
• Use Javascript console with
• runJS(‘/user/myself/MRjob.js’, ‘/path/to/inputfile’,
‘/path/to/output/dir’)
• Collect your result on HDFS
17. Invoking HiveQL Queries
• Run queries in Hadoop Command Shell after invoking hive
• Through the web console
• Programmatically through ODBC
• Coming soon: LINQ to Hive!
18. Polybase – Enhancing PDW query engine
Data Scientists
BI Users
DB Admins
Regular Results Traditional schema-based DW
Social Sensor T-SQL applications
Apps & RFID
Mobile Web Enhanced
Apps Apps PDW query engine
Hadoop PDW V2
Unstructured data Structured data
19. Microsoft Hadoop Vision
Better on Windows and Azure
• Active Directory
• System Center
• .Net Programmability
Microsoft Data Connectivity
• SQL Server / SQL Parallel Data Warehouse
• Azure Storage / Azure Data Market
Microsoft Business Intelligence (BI)
• Hive ODBC Connectivity
• BI Tools for Big Data
Collaborate with and Contribute to OSS
• Collaborate with HortonWorks
• Provide improvements and Windows support back to OSS
20. Getting started
• On prem: http://www.microsoft.com/bigdata/
• Single node cluster (onebox) install
• C:hadoop
• Starts local services
• Can start/stop them with start-onebox.cmd/stop-onebox.cmd
• Comes with:
• Hadoop command line (shell)
• Hadoop Status for name node and map-reduce cluster
• HDInsight Dashboard
• On Windows Azure: http://HadoopOnAzure.com/
• 3 node cluster running as a service in Azure
• Can be used for 5 days
• Provides samples and HDInsight Dashboard
• TAP Program
Big DataThis is a picture down the center isle of a shipping container from one of Microsoft’s datacenters. We put ~1800 computers inside one of these containers. Some of us had the privilege of working on the data storage and computational platform that powers Bing. We used 22 of these containers, spanning 40,000 machines where we stored over 100PB of data. This was three years ago, and now these servers are almost obsolete.Big Data is in constant motion and growing at an incredible rate,90% of the world’s data generated in just the past two years. That's remarkable growth. Technology history has taught us that the one with themost data wins. The empires of data like Twitter, Facebook, Yahoo all of whom are able to capitalize on the notion that data equates to power. More and more companies are increasingly utilizing Hadoop to power Big Data analytics and drive revenue and profit.It’s all about your Data.
I’d like to introduce the 3V’s of Big DataIs it big as in Volume? Where your data exceeds limits of physical capabilities of systems today.Is it Velocity? The data is moving at a fast rate and value can decay over time.Is it Variability? of structure from unstructured, semi-structured to highly structured data.Doug Laney http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdfThe answer is it’s all of the above.Finally some refer to the fourth V of Big data as Value; the value of the insight that can be gained from extracting insight form your Big Data sources.
Given all of this data, and the variety of sources there are new questions that we can answer today that weren’t possible just a few years ago.By asking and answering these questions you canreap the benefits of Big Data.Data is everywhere to be mined, but we have what one can call "the pomegranate problem" Imagine all of your data being inside a pomegranate. When you eat a pomegranate it’s a bit difficult getting into all of the little pieces inside the pomegranate out, it's a bit of work.That’s the process that you need to go through to extract insights out of your data.It’s useful to think of it in this way; where your data is the platform. Not the tooling that surrounds it. It’s all about the data. It’s all about the questions that you ask.
The second thing I want to talk about is Hadoop and how Hadoop is setup to deliver Breakthrough Insights from your data.How many of you are familiar with Hadoop? How many of you are using Hadoop for projects today?How many are planning on using Hadoop in the next 12mo? How about in the cloud?When people talk about Hadoop they are often talking about specific computational patterns including map reduce, which emerged as a method to process lots of unstructured data on top of a distributed storage system in a highly fault tolerant and embarrassingly scalable way. Hadoop allows us to store and process large amounts of data on commodity hardware. In the past you would spend large amounts of money on very specialized hardware. Today you can do this with off the shelf hardware running Hadoop. Now, Hadoop doesn’t have a monopoly on “big”, “real time” or “unstructured” but does provide some unique capabilities.
I’d like to share my experience with an internal Microsoft service; Halo 4. We launched Halo 4 recently; players are playing over 25 million games per day. Each of those games upload many metrics are coming in every minute.Something amazing happened when we moved the Halo4 event stream into Hadoop/Hive, we noticed a change in how we thought about data. We were freed from the constant anxiety of wondering how we were going to handle an ever-increasing amount of data, we shifted from trying to store only what we really needed to storing everything. It’s a digital shoebox of information. Then the questions started to shift from how and what to store; to how to gain breakthrough insights from the data.In a traditional database or data warehouse you have to define the structure of the data, or schema, up front, with Hadoop you define the structure of the data when you use it. It’s a schema on read vs a schema on write.
Manage data of any type or sizeTo gain the full value of Big Data you need a modern data platform that manages data of any type, whether structured or unstructured, and of any size – from gigabytes to petabytes. Your Big Data solution should also manage data at rest or in motion. Leverage the power of HDInsight on Windows Server or as a Windows Azure Service. HDInsight provides simplicity, ease of management, and an open Enterprise-ready Hadoop service that runs on premise or in the cloud.Enrich your data with the worlds dataHDInsights enables you to realize new value in the data you have and can combine these new insights with 3rd party datasets simply and elegantly. The time spent by your data analysts trying to surface the right data and source for your precise needs is costly. By connecting to external data sources you can begin to answer new types of questions and deliver new value in ways that previously were not possible.Gain insight from any dataYou cannot begin to realize the value of Big Data until you can deliver new insights from all types of data- structured, unstructured, previously archived or discarded. The benefits of Big Data are not limited only to business intelligence experts or data scientists. Nearly everyone in your organization can analyze and make more informed decisions with the right tools including Microsoft Office Excel.Key CapabilitiesAny Data, Any Size, AnywhereMicrosoft Big Data offers an integrated platform for managing data of any shape or any size, whether it’s structured data in relational databases, unstructured data with Hadoop, or streaming data.Microsoft Big Data offers an integrated platform for managing data of any shape or any size, whether it’s structured data in relational databases, unstructured data with Hadoop, or streaming data.Enterprise-ready HadoopSeamlessly extend access privileges across HDInsight with Active Directory.Manage your HDInsight clusters easily with System Center 2012.Enjoy the reliability and high availability of 100% Apache Hadoop compatible HDInsight.Gain Windows Simplicity and Manageability for HadoopSimplicity on premise with a virtualized deployment model.Consistent platform on Windows or on Windows Azure with shared codebase.Deploy Hadoop easily thanks to smart packaging and Cloud optimization from Microsoft.Scale on Demand in the CloudBenefit from deployment options for Big Data on both Windows Server and Windows Azure.Enjoy elastic scalability in the cloud.Gain better control of your data and costs.Open Big Data PlatformGain from the strategic Microsoft and Hortonworks partnership. Leverage the benefits of Microsoft HDInsight that offers 100% compatibility with Apache Hadoop.Enterprise-ready HadoopSeamlessly extend access privileges across HDInsight with Active Directory.Manage your HDInsight clusters easily with System Center 2012.Enjoy the reliability and high availability of 100% Apache Hadoop compatible HDInsight.Gain Windows Simplicity and Manageability for HadoopSimplicity on premise with a virtualized deployment model.Consistent platform on Windows or on Windows Azure with shared codebase.Deploy Hadoop easily thanks to smart packaging and Cloud optimization from Microsoft.Scale on Demand in the CloudBenefit from deployment options for Big Data on both Windows Server and Windows Azure.Enjoy elastic scalability in the cloud.Gain better control of your data and costs.Open Big Data PlatformGain from the strategic Microsoft and Hortonworks partnership. Leverage the benefits of Microsoft HDInsight that offers 100% compatibility with Apache Hadoop.Connecting with the World’s DataMicrosoft offers unparalleled opportunities for discovery and enrichment by enabling end users to connect to the world’s data and services.Microsoft offers unparalleled opportunities for discovery and enrichment by enabling end users to connect to the world’s data and services.Connect Hadoop to the World via Windows Azure Marketplace Access a wide variety of data from reliable providers such as the U.S. Census Bureau, United Nations, Dunn and Bradstreet, to name a few.Take advantage of hundreds of applications built on the Windows Azure platform.Integrate smart data mining algorithms, such as Microsoft Translator, which uses machine learning for automated text translation.Enrich Your Data with External Information ServicesConvert raw data into useful information through data transformation and advanced analytics, and mashups with external data.Utilize out-of-the-box tools, such as SQL Server Integration Services (SSIS) and Data Quality Services for data transformation and cleansing.Enrich your raw data using smart analytical algorithms. (For instance, you can use a segmentation model to enhance targeting.)Access Predictive Analytics on HadoopGain new insights through predictive analytics, the process of inferring relationships and predictions from huge quantities of data. Unlock new insights from all of your data using smart data-mining tools in SQL Server Analysis Services.Simplifies the data mining process using the Data Mining Add-in for Excel.Integrate a range of data mining tools from the Open Source Community, such as Mahout and R. Connect Hadoop to the World via Windows Azure Marketplace Access a wide variety of data from reliable providers such as the U.S. Census Bureau, United Nations, Dunn and Bradstreet, to name a few.Take advantage of hundreds of applications built on the Windows Azure platform.Integrate smart data mining algorithms, such as Microsoft Translator, which uses machine learning for automated text translation.Enrich Your Data with External Information ServicesConvert raw data into useful information through data transformation and advanced analytics, and mashups with external data.Utilize out-of-the-box tools, such as SQL Server Integration Services (SSIS) and Data Quality Services for data transformation and cleansing.Enrich your raw data using smart analytical algorithms. (For instance, you can use a segmentation model to enhance targeting.)Access Predictive Analytics on HadoopGain new insights through predictive analytics, the process of inferring relationships and predictions from huge quantities of data. Unlock new insights from all of your data using smart data-mining tools in SQL Server Analysis Services.Simplifies the data mining process using the Data Mining Add-in for Excel.Integrate a range of data mining tools from the Open Source Community, such as Mahout and R. Immersive Insights, Wherever You AreMicrosoft Big Data empowers end users to gain insights from any data, whether structured or unstructured, with the familiar tools they use every day. Developers can build Big Data applications with tools for simplified Hadoop programming.
I see the real breakthrough insights coming through when you take what is the traditional "Business Intelligence" and add more capabilities like machine learning, predictive analysis, statistical analysis, large scale graph processing, pattern mining, trend analysis, economic modeling. All of which today are a reality in Hadoop. The implications of this are quite astounding when you think about it. This is huge.
Big Data; in terms of data volume, variability and velocity at scale are is the first problem. But the Big Data solutions and technology by themselves don't lead to solving business objectives. We don't have a Hadoop problem they have analytics, pattern mining, trend analysis, statistical inferenceing, economic modeling, market regression level problems.Data science starts where the utility class services like Big Data Hadoop end. The real opportunity is to expose data science to everyone.As powerful as Hadoop is, today it’s still more of a computer scientist’s or academically-trained analyst’s tool than it is an enterprise analytics product. Hadoop itself is controlled through programming code rather than anything that looks like it was designed for business unit personnel. Hadoop data is often more “raw” and “wild” than data typically fed to data warehouse and OLAP (Online Analytical Processing) systems. This is where I and Microsoft see opportunity. Essentially; wouldn't it be cool if mere mortals could use this stuff and consume insights that are directly coming from Hadoop? Microsoft HDInsight enables you to gain insight from virtually any data, connect with the world of data, improve decision making, and enhance the development of the next generation of products and services.Nearly everyone in your organization can analyze and make more informed decisions with the right tools.PowerPivot for Microsoft Excel and Power View for SharePoint give nearly all users a view into structured and unstructured data.With the Hive Add-in for Excel and Hive ODBC Driver, almost anyone in your organization can directly access Hadoop datafrom end-user tools.Hadoop simplifies programming for developers with JavaScript for MapReduce jobs. The JavaScriptimplementation can also reduce your code by up to 10 times compared to Java.
Front End: Security/Auth and scaled out request handlerPartition Layer: Object Layer, Mapping of objects such as Tables, Blobs, Queues to streams (cached in Front End), CCStream Layer: 3-Node HA, Scale-out stream store