One of the features I’m most excited about is the inclusion of PolyBase, which will bridge the gap between data warehouses stored in traditional databases and big data stored in Hadoop.
What is PolyBase in SQL Server?
PolyBase has been around for several years in Microsoft’s Analytics Platform System (APS) data warehouse appliance and, more recently, as part of Azure SQL Data Warehouse. It is part of SQL Server. It allows for the creation of an external table in the database where the data is physically stored in Hadoop or Azure Blob Storage. To an end user, the table appears to be part of the database like any other table. It can be used in Transact-SQL (T-SQL) queries and stored procedures.
The difference with PolyBase is where the data is stored and how it’s processed. The data can be stored in Hadoop or Azure Blob Storage, both of which provide much more cost-effective storage solutions. When a query is executed, the database will determine the most efficient processing method for the query, whether that be processing locally in SQL or having the massive parallel capability of Hadoop tackle the query.
Big data sources that PolyBase can connect to
- SQL Server
- Oracle
- Teradata
- MongoDB
- Hadoop
- S3-compatible object storage
- CSV in Azure Blob storage
PolyBase allows for the blending together of the capabilities of traditional databases and Hadoop like never before. Here are just a few powerful use cases that are now available to us:
Hot/cold data
In the past, we’ve been presented with problems of what to do with massive amounts of data that's stored in our warehouse tables. We often store a lot of older data that's accessed infrequently but takes up a lot of valuable space. Many organizations aggregate or archive and delete this data as a cost savings measure. This infrequently accessed data is referred to as “cold data,” while data in high demand is considered “hot.”
A recent method of handling this is to remove the cold data from database storage and archive that data in Hadoop. The problem with this method is that our data is now spread between two different platforms. If you want to query across the entire data set, you either need to move data or combine multiple queries. With PolyBase, all the data can be accessible from the SQL database. Combining these two data sets into one query is now a trivial task.
Use Hadoop for Extract, Transform & Load (ETL).
Fairly recent advances in SQL Server, such as column storage, have opened the door for fantastic performance against very large data sets in SQL Server. This performance is possible after transforming the data and loading it into a proper model setup for query performance. This is great for the queries, but the process of getting the data into the model can be tedious and process-intensive. Wouldn’t it be great to process that data with the parallel processing power of Hadoop but allow for the data to be queried in a well-defined columnar table in SQL Server?
PolyBase makes this easy in that the table can be loaded using standard T-SQL sourcing from Hadoop tables. Hadoop can take on the heavy workload of transforming the data and loading it into a high-performance SQL table for analytics.
Data movement from Hadoop to SQL
There are various methods to move data from Hadoop to a SQL Server database. Some of those methods, such as Sqoop, can be overly complicated once you start to consider file formats and scheduling. Others, such as using an ETL tool and Open Database Connectivity (ODBC) to query a Hive table, can perform poorly. Moving data using PolyBase is very simple and high-performing. Performance can be increased further by setting up a PolyBase Scale-Out Group, which allows a cluster of SQL Server databases to move data from Hadoop to SQL Server in parallel.
Hadoop streaming
The various components and flexibility of Hadoop make it a compelling platform for consuming live streams of information from everything from sensors to receiving messages from an enterprise messaging solution. Those live steams can be accessed live from SQL Server by simply creating a PolyBase connection to that information.
BI tool integration
Many organizations store massive amounts of raw data in Hadoop or Azure Blob Storage. The data is ready for analysts to perform their data analyses and discovery with BI tools. Although most new BI tools interact well with Hadoop, not all tools have good integration. In addition, analysts may be stuck on older versions of the tool or have to jump through administrative hurdles to get the latest drivers to Hadoop components installed and configured.
Finally, managing permissions in various tools across the organization can be complicated and costly. Allowing the user to access these important data sources with the ease of connecting to SQL Server will open new possibilities and streamline processes.