HDInsight Server
HDInsight is based on the Hortonworks Data Platform, which is a 100% open source distribution of Apache™ Hadoop™. HDInsight provides a software framework designed to manage, analyze, and report on data. Apache Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze in parallel the data stored in this distributed system. To simplify configuration, management, and running MapReduce jobs, HDinsight provides interactive, web-based consoles which can be used to perform tasks using JavaScript or HiveQL.
HDInsight Server Developer preview makes Apache Hadoop available as a service on Windows Server, and provides a streamlined deployment and configuration process. The HDInsight Service on Windows Azure provides Hadoop as a scalable, on-demand service as part of the Windows Azure Platform.
This section contains links to the download page, and to the official documentation.
HDInsight Server
The HDInsight Server is currently available as a Developer Preview. It provides only a single node deployment, and can be used as a local development environment for the Windows Azure HDInsight Service.
Getting Started with the HDInsight Server Developer Preview
Windows Azure HDInsight Service Documentation
The Windows Azure HDInsight Service is available as a preview feature of the Windows Azure Platform.
Windows Azure Services: HDInsight
Hadoop for Windows (Hortonworks)
The new Hortonworks Data Platform (HDP) for Windows extends the availability of Apache Hadoop to the Windows operating system, and allows the creation of multi-node Hadoop deployments.
This section contains links to other resources for planning and operating Hadoop deployments using HDInsight.
The Microsoft IT Big Data Program is the result of collaboration between multiple enterprise services organizations working in partnership with Microsoft’s HDInsight product development team. This group has been tasked with assessing the usefulness of Hadoop for corporate and client applications, and for providing a reliable, cost-effective platform for future Big Data solutions within Microsoft. Towards this end, the group has extensively researched the Hadoop offerings from Apache, and has implemented Hadoop production and test environments of all sizes, using the HDInsight platform for Windows Azure and for on-premise Windows.
The Microsoft IT Big Data Program is publishing their research to the community at-large so that others can benefit from this experience and make more efficient use of Hadoop for data-intensive applications.
Compression in Hadoop
When using Hadoop, there are many challenges in dealing with large data sets. The goal of this document is to describe compression methods that you can use to optimize your Hadoop jobs, and reduce bottlenecks associated with moving and processing large data sets. In this paper, we describe the problem of data volumes in different phases of a Hadoop job, and explain why compression can mitigate performance problems in large Hadoop jobs. We review the compression tools and techniques that are available, and present a summary of our tests of each tool.
Downloadable white paper: Compression in Hadoop
Performance of Hadoop on Windows Hyper-V
The goal of this paper is to answer the question of whether Hadoop clusters on virtual machines hosted in Microsoft Hyper-V can be as efficient as physical clusters. The creation of such virtualized “private clouds’ might answer the Big Data needs of applications that cannot be hosted in public cloud services. This paper presents the result of internal benchmarks by Microsoft IT that compared the performance of Hadoop jobs on comparable physical and virtual clusters.
Downloadable white paper: Performance of Hadoop on Windows Hyper-V
Hadoop Job Optimization
Understanding how to analyze, fix, and fine-tune the performance of jobs is an extremely important skill for Hadoop developers. This paper provides a collection of tuning techniques collected or developed by the SES Big Data team in Microsoft IT. The paper describes the principal bottlenecks that occur in Hadoop jobs, and presents a selection of techniques for resolving each issue.
The paper provides specific suggestions for configuring Hadoop jobs to mitigate performance problems on different workloads, but also explains the interaction of disk I/O, CPU, RAM and other resources, and demonstrates why efforts to tune performance should adopt a balanced approach. It includes the results of experiments with performance tuning by MSIT, which resulted in significant differences in the speed of the same MapReduce job before and after tuning.
Downloadable white paper: Hadoop Job Optimization
HDInsight team blog
See the HDInsight development team blogs for the latest information about releases, tips, and features.
Getting started with HDInsight
Other Microsoft blogs on Big Data
Denny Lee is a Program manager for Microsoft who works with Big Data and BI applications.
Carl Nolan is a Microsoft consultant in the United Kingdom who focused on enterprise applications using C# and SQL Server, with interests in F# and MapReduce.
Cindy Gross is a member of the SQL Server Customer Advisory team who has made Hadoop her hobby.
Avkash Chauhan is a support escalation engineer for Microsoft Azure who frequently blogs about Big Data.
Avkash Chauhan blog: Azure, Big Data and Hadoop, All Together
The Microsoft Patterns and Practices Group has published a guide containing best practices for Big Data solutions. If you are new to Big Data, we recommend that you review this guide first to get an idea of where and how Big Data fits in your business plans. The guide includes these topics:
Patterns: Data loading patterns, query patterns, common methods for analysis and visualization
Sample applications: Walkthrough of sample applications for Twitter analysis and BI integration
Azure guidance: Broad guidance and technical advice on working with HDInsight in Azure.