Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)
SQL Server Technical Article
Published: October 2012
Authors: Benjamin Guinebertière (Microsoft France), Philippe Beraud (Microsoft France), Rémi Olivier (Microsoft France)
Technical Reviewers/Contributors: Carla Sabotta (Microsoft Corporation), Steve Howard (Microsoft Corporation), Debarchan Sarkar (Microsoft India GTSC), Jennifer Hubbard (Microsoft Corporation).
Summary: With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.
This white paper explores how SQL Server Integration Services (SSIS), i.e. the SQL Server Extract, Transform and Load (ETL) tool, can be used to automate Hadoop + non Hadoop job executions, and manage data transfers between Hadoop and other sources and destinations.