Saturday, April 9, 2016

Introduction to Azure Big Data Analytics Suite

In Build 2015, Microsoft announced Azure Data Lake analytics, Data lake Store, Data Factory, Data Catalog to address analytics and big data challenges of enterprises. Many wondering why is “Yet another big data service” when there are plenty of big data technologies and products in the market including open source and premium products. Microsoft also has HDInsight in their market place which is an Apache Hadoop distribution powered by the cloud powered by HortonWorks.

The primary reason being the complexity and learning curve involved in learning and getting started with the existing big data tools and technologies. Below is a partial and most popular list of Apache open source technologies (For complete list of apache big data open source projects, click here) which addresses certain challenges of a greater big data problems, but there are plenty of industry specific vertical and domain based big data solutions from various vendors.

At first sight, this is of course confusing especially if you are coming from a Microsoft world, you will simply skip and drop the big data project.

The second major reason is currently Microsoft doesn’t have any Big data solutions to cater their enterprises audience, being a provider of massively successful SQL Server solutions Microsoft doesn’t want to left behind on Big data offering and this makes lot of sense as well.  

Finally Microsoft doesn’t want their Visual Studio, .Net/C#/SQL developers to look at another vendor or learn new technologies to achieve big data projects. Azure Big data lake suite is the answer for bridging the gap between .Net worlds with big data.

However the real challenge for Microsoft was deciding between building a big data product from scratch or use existing established open source projects. Finally, Microsoft took the later route i.e. embracing open source projects to power Azure Big data suite.

The Data Lake analytics service is a new distributed analytics service built on Apache YARN that dynamically scales so you can focus on your business goals, not on distributed infrastructure like other IaaS based solutions. Azure Data Lake analytics natively works with all the data sources including Azure SQL, SQL Server on VM, Azure BLOBs, Azure SQL Datawarehouse that lives in cloud as well as from elsewhere, similarly Data lake store built in top of WebHDFS can ingest and receive data from any HDFS endpoints whether it’s coming from Azure or anywhere else.

Is the Azure Data Lake Query engine powered by Hadoop?

No, Microsoft didn’t build the data lake suite on top of Hadoop map reduce, partly because of the overhead involved in integrating it with the vast set of existing Microsoft products. Secondly Microsoft wanted its developers to use the same SQL + C# even for big data projects and not to burden them with introducing another language exclusively for big data. However, both C# and SQL scores well in its own areas and had advantages and disadvantages when it comes to big data processing, so Microsoft intelligently combined the power of C# expressions and the elegance of SQL and formulated Unified SQL (U-SQL), an all-new big data query language for big data processing.

Welcome to U-SQL

U-SQL, is the big data query language from Microsoft for its Data lake suite. If you are SQL developer and SQL DBA, you’ll notice that U-SQL queries look a lot of SQL queries. Many fundamental concepts and syntactic expressions will be very familiar to those with a background in SQL.

However U-SQL is a unique language and some of the expectations you might have from the SQL world do not carry over into U-SQL. For example, Select in SQL is not case sensitive, but it has to be SELECT (in upper) when used in Data lake.

In the upcoming blog posts, I’ll dig deep into the various U-SQL concepts with example. Please come back for more on Azure Big Data Analytics and U-SQL.

Good Luck!