Blog PolyBase Use Cases in SQL Server

SQL Server has a number of new and exciting Business Intelligence (BI) features that have been added over time. Many of these open revolutionary possibilities, such as allowing true real-time in-memory analytics or processing advanced analytics and machine learning directly inside the database engine.

By Todd Walker / 21 Jul 2016 / Topics: Software

What is PolyBase in SQL Server?

PolyBase has been around for several years in Microsoft’s Analytics Platform System (APS) data warehouse appliance and, more recently, as part of Azure SQL Data Warehouse. It is part of SQL Server. It allows for the creation of an external table in the database where the data is physically stored in Hadoop or Azure Blob Storage. To an end user, the table appears to be part of the database like any other table. It can be used in Transact-SQL (T-SQL) queries and stored procedures.

The difference with PolyBase is where the data is stored and how it’s processed. The data can be stored in Hadoop or Azure Blob Storage, both of which provide much more cost-effective storage solutions. When a query is executed, the database will determine the most efficient processing method for the query, whether that be processing locally in SQL or having the massive parallel capability of Hadoop tackle the query.

Hot/cold data

In the past, we’ve been presented with problems of what to do with massive amounts of data that's stored in our warehouse tables. We often store a lot of older data that's accessed infrequently but takes up a lot of valuable space. Many organizations aggregate or archive and delete this data as a cost savings measure. This infrequently accessed data is referred to as “cold data,” while data in high demand is considered “hot.”

A recent method of handling this is to remove the cold data from database storage and archive that data in Hadoop. The problem with this method is that our data is now spread between two different platforms. If you want to query across the entire data set, you either need to move data or combine multiple queries. With PolyBase, all the data can be accessible from the SQL database. Combining these two data sets into one query is now a trivial task.

Use Hadoop for Extract, Transform & Load (ETL).

Fairly recent advances in SQL Server, such as column storage, have opened the door for fantastic performance against very large data sets in SQL Server. This performance is possible after transforming the data and loading it into a proper model setup for query performance. This is great for the queries, but the process of getting the data into the model can be tedious and process-intensive. Wouldn’t it be great to process that data with the parallel processing power of Hadoop but allow for the data to be queried in a well-defined columnar table in SQL Server?

PolyBase makes this easy in that the table can be loaded using standard T-SQL sourcing from Hadoop tables. Hadoop can take on the heavy workload of transforming the data and loading it into a high-performance SQL table for analytics.

Data movement from Hadoop to SQL

There are various methods to move data from Hadoop to a SQL Server database. Some of those methods, such as Sqoop, can be overly complicated once you start to consider file formats and scheduling. Others, such as using an ETL tool and Open Database Connectivity (ODBC) to query a Hive table, can perform poorly. Moving data using PolyBase is very simple and high-performing. Performance can be increased further by setting up a PolyBase Scale-Out Group, which allows a cluster of SQL Server databases to move data from Hadoop to SQL Server in parallel.

Hadoop streaming

The various components and flexibility of Hadoop make it a compelling platform for consuming live streams of information from everything from sensors to receiving messages from an enterprise messaging solution. Those live steams can be accessed live from SQL Server by simply creating a PolyBase connection to that information.

BI tool integration

Many organizations store massive amounts of raw data in Hadoop or Azure Blob Storage. The data is ready for analysts to perform their data analyses and discovery with BI tools. Although most new BI tools interact well with Hadoop, not all tools have good integration. In addition, analysts may be stuck on older versions of the tool or have to jump through administrative hurdles to get the latest drivers to Hadoop components installed and configured.

Finally, managing permissions in various tools across the organization can be complicated and costly. Allowing the user to access these important data sources with the ease of connecting to SQL Server will open new possibilities and streamline processes.

Blog PolyBase Use Cases in SQL Server

One of the features I’m most excited about is the inclusion of PolyBase, which will bridge the gap between data warehouses stored in traditional databases and big data stored in Hadoop.

What is PolyBase in SQL Server?

Big data sources that PolyBase can connect to

Hot/cold data

Use Hadoop for Extract, Transform & Load (ETL).

Data movement from Hadoop to SQL

Hadoop streaming

BI tool integration

Todd Walker

Related posts

Datasheet Modernize Your Contact Center

Datasheet Unlock Agreement Value — No Matter Your Industry

Brochure Guide to Intelligent Agreement Management

Blog The Real Cost of Delaying Windows 11