Tuesday, April 19, 2016

Summary of Chennai Azure Global Bootcamp 2016

The Chennai Global Azure bootcamp 2016 happened on Apr, 16th 2016 went on really well with lots of technical session and handson labs. We received more than 1000 registrations from people with variety of background, some of them were practicing professionals, few from IT Pro background and lots of students who are aspiring to be a cloud professionals soon.

The event started at 10.00 am and the keynote was delivered by Balaji Uppili, Chief Delivery Officer ,GAVS. He gave a lightning talk on the current cloud landscape and how azure is leading the game also touched little upon how developers must equip themselves to stay relevant in the ever changing IT. Soon after the keynote, presenters started offering session in 2 tracks, 1. Developer 2. IT Pro

We received 10 fantastic speakers from Chennai and Bangalore who delivered various tracks on various topics including Azure Apps Service, Open Source, Big data. I delivered  a topic on Azure Data lake Analytics and Data lake store, the services which are currently in preview, but the attendees were able to recognize the value of the services and how it can help developers to leverage these big data analytics services and exploit big data.

Event Highlights


  • Total Registrations for the event : 1000+ 
  • Attendees joined the session : 450
  • % of Dev/IT Pro : 75%
  • % of Student partners : 25%
  • Total no of Technical Tracks : 10
  • Hands on Lab conducted : 1

Tracks



9:00 TO 9:30 : RECEPTION AND REGISTRATION
Dev TrackIT Pro Track
09:30 - 10:00: CHENNAI GLOBAL AZURE BOOTCAMP: KEYNOTE BY BALAJI
10:15 - 11:00: Building Mobile Apps with Visual Studio and Azure Services10:15 - 11:00: Power of Open Source on Azure (Ruby on Rails)
11:00 - 11:15: Café
11:30 - 12:15: Deep into Azure Machine Learning & Predictive Analytics11:30 - 12:15: Running LINUX workload on Azure
12:30 - 13:15: 14:00 - 14:45: DevOps In Azure12:30 - 13:15: Kick off your project for success!
13:15 - 14:00: LUNCH
14:00 - 14:15: Deep dive into Azure SQLIntroduction to Data Lake Analytics & Store
15:00 - 15:45: IoT on Azure (Part 1)15:00 - 15:45: Azure AD with Office 365
15:45 - 16:00: Café
16:15 - 17:00: IoT on Azure (Part 2)Azure Hands on Labs
17:00 - 17:30: AZURE GLOBAL BOOTCAMP 2016 CLOSING


Local Sponsors

GAVS Technologies and 8KMiles Software Services are the key local sponsors who helped us to execute such a larger event in Chennai. Infact, this is one of the largest community driven Azure event in the recent past conducted in the city. I'm very thankful for all the sponsors in helping us to execute the event.



Conclusion

Overall, the event went on really well, and a lot of great content was delivered by our awesome experts and MVP speakers. Thanks to all the presenters as well as all attendees! Without you, there wouldn’t have been an event. Also, special thanks goes to the Global Azure Bootcamp team for organizing the global event and getting together the prizes from all the global sponsors.

I had a great time presenting and helping people out with the hands-on labs launching Windows and Linux VMs towards the end of the day. It was a great learning and fun experience. Currently, I’m planning to help coordinate the Chennai Global Azure Bootcamp event next year as long as god will.

Until next year, adios amigos!

P.S. Please feel free to contact me with any question about Azure or general feedback on the event. You can either submit them to me in the comments on this post, via Twitter @ilyas_tweets or drop an email to me @ ilyas@bornoncloud.com.

Thanks!

Thursday, April 14, 2016

Chennai Azure Global Bootcamp 2016 – By Chennai Azure User Group

Azure Global Bootcamp, as the name suggests a global event organized by community all around the globe at various places. If you got few minutes to learn the history of this event I recommend you to head over to http://global.azurebootcamp.net/. The motto of the event is to bootstrap developers, IT Pros to get into Azure cloud and learn the various services offered by Microsoft Azure. The event happens in various cities and this year I grabbed the opportunity to host and present a session on Azure Data lake analytics and Data store topic.

Chennai compared to rest of the cities in India is kind of offbeat when it comes to Technical events and communities, but it doesn’t mean that Chennaiites are not enthusiastic towards technology or programming, but it’s less compared to other cities of India.

The Azure Bootcamp has been conducted in Chennai ever since the event was started in the year 2013, but for various reasons its been conducted at universities and colleges mainly because of overhead involved in arranging venues, food and other expenses involved in conducting such an community event where there is no source of fund!

Conventionally for the year 2016, I took it as a challenge and wanted to conduct the event mainly targeting the IT community rather than student community, its not that we don’t want students to participate, but the content demanded more of IT Professionals rather than students. However we have got reservations from students from various universities and colleges to participate and learn Azure.

For the 2016 Chennai edition we have received more than 1000+ registrations because of the good work put into marketing this event and reaching out to all my connects through various social media platforms. I wanted to make it as a larger event with multiple tracks under different topics to cater the various kinds of audiences including developers, IT Pros, Big Data enthusiasts, Architects and mid-level managers.


As it is a full day event with multiple sessions with such a crowd, I wanted a larger venue that can accommodate such a size, but unfortunately hiring an auditorium to host such a larger event is outside of our budget. So i reached out to many IT & ITES services partners in around Chennai especially around the IT Express Highway where the majority of the IT communities exist. Many It partners came forward, but unfortunately some of them has a smaller venue or away from the potential location we were looking forward.

Atlast, GAVS Technologies, a leading IT & ITes provider located at Sholinganallur, OMR Chennai sponsored us the venue, food and other key assistance for us to host the bootcamp at their premise. Similarly audio/video, mic and speakers arrangments were sponsored by 8KMiles, my current employer, thanks to my boss Mr.Harish Ganesh who is the CTO of the company.

Getting speakers for a community events like this is always challenging, however with the help of   Chennai Microsoft user group and the right connects with the community people helped me to find experts in various areas to offer a session in the azure bootcamp!




We are very excited to conduct and host the session and looking forward to see there!



Tuesday, April 12, 2016

Sorting Operations in U-SQL

Just like SQL we can sort rowsets by using the ORDER BY operator in U-SQL by specifying the ASC and DESC keyword controls whether the sort is ascending or descending, respectively. In the below U-SQL SELECT * FROM statement you can order the output rows IF and ONLY you also specify how many rows you want to fetch back, failing to using FETCH First clause will result in error. The reason for this behavior is, when running U-SQL scripts in the Azure Datalake analytics distributed environment, performing any sorting will be a very expensive operation. It is for this reason that the ability to order rows has been deliberately restricted to the cases shown above.
DECLARE Out_File string = "/Smples/Output.tsv";

@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// List the sessions in increasing order of Duration
@output = 
    SELECT * 
    FROM @searchlog
    ORDER BY Duration ASC
    FETCH FIRST 3 ROWS;

OUTPUT @output
    TO @Out_File
    USING Outputters.Tsv();

Alternatively, you can also specify ordering when using an OUTPUT statement which does not require a FETCH FIRST.
@output = 
    SELECT * 
    FROM @searchlog;

OUTPUT @output 
    TO @"/Samples/Output/SearchLog_output.tsv"
    ORDER BY Duration ASC
    USING Outputters.Tsv();

Declaring Parameters in U-SQL

Declaring re-usable parameters is always handy when writing a lengthy U-SQL Scripts. U-SQL also supports using DECLARE statement while writing queries. You can use the DECLARE instruction to save values of Native U-SQL datatypes.
DECLARE Out_File string = "/Output/sample.txt";

@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT 
        Region,  
         (Duration>300 ? "long" :"short") AS DwellType
    FROM @searchlog;

OUTPUT @output
    TO @Out_File
    USING Outputters.Tsv();
U-SQL Data Types DECLARE can be used with all the U-SQL native datatypes. String Type
DECLARE @text1 string = "Hello World";
DECLARE @text2 string = @"Hello World";
DECLARE @text3 char = 'a';
You can use other variable declarations in a #DECLARE statement to combine strings.
DECLARE @text4 string = "BEGIN" + @text1 + "END";
DECLARE @text5 string = string.Format("BEGIN{0}END", @text1);
Numeric signed
DECLARE @numeric1 sbyte = 0;
DECLARE @numeric2 short = 1;
DECLARE @numeric3 int = 2;
DECLARE @numeric4 long = 3L;
DECLARE @numeric5 float = 4.0f;
DECLARE @numeric6 double = 5.0;
Numeric unsigned
DECLARE @unumeric1 byte = 0;
DECLARE @unumeric2 ushort = 1;
DECLARE @unumeric3 uint = 2;
DECLARE @unumeric4 ulong = 3L;
Miscellaneous
DECLARE @misc1 bool = true;
DECLARE @misc2 Guid = System.Guid.Parse("BEF7A4E8-F583-4804-9711-7E608215EBA6");
DECLARE @misc4 byte [] = new byte[] { 0, 1, 2, 3, 4};
DateTime
DECLARE @d1 DateTime = System.DateTime.Parse("1979/03/31");
DECLARE @d2 DateTime = DateTime.Now;
Usage of user definied function in DECLARE is currently not supported. For example, you cannot use the below statement which will result in error
DECLARE myName string = MyHelper.GetMyName();
Similalry Values cannot be assigned from a RowSet. For example, you cannot use the below statement which will result in error
DECLARE @maxval int = SELECT MAX(value) FROM data;
The above statements are not explicitly supported by U-SQL, however an alternative is to get a single-row rowset with a single column and then JOIN that tiny rowset other rowset to get what you need.

Introduction to Azure API Management

APIs been there for many decades and it’s an integral part of many applications including desktop and web and mobile solutions. Recently, the momentum of API has tremendously increased because businesses wants to integrate their applications with their vendors and partners, web companies wants to enable newer business models, governments wants to expose their data to citizen developers to build newer insights and solutions for the betterment of the society. The secondary reason for the enormous adoption of the API Application is due to the technical evolution such as Server less application development, Micro Services, Containerization etc.,.

API applications are just like any other web applications without a user interface, however the requirements an API solution are very similar to web applications. For e.g. Authentication and Authorization, discoverability, ease of access, Logging, monitoring and finally cross platform support. Any typical solution development has two key parts 1. Functional components 2. Non-Functional components. Functional components are basically the core of the problem that you are solving for example if you are a developer at a Pharma company and you’re building an API App that would expose the list of medicines that your company has developed. Example for the nonfunctional components would be building an authentication module for securing your API app against malicious user or a system or enabling analytics for API etc.

Businesses has to make significant investment in building the nonfunctional components required to make a successful API strategy, similarly if an enterprise has built internal APIs way back using legacy technologies and looking for exposing these APIs to the external world, it would be extremely difficult. So Microsoft has come up with a new offering to bridge the gap between API developers and consumers.

Building Blocks of Azure API Management


Azure API Management provides powerful capabilities for businesses to offload the complex management and delivery of APIs. Business can bring in any APIs built on any language and hosted anywhere to Azure and let API management to take care of the complete lifecycle of the API. Below are the key benefits of using Azure API Management.


Benefits of API Getway


  • Accepts API request from the clients and routes it to APIs hosted at azure or anywhere including on-premise 
  • Validates the request for security
  • Apply quotas and limitation per client or customer
  • Convert the API response to desired format, for example if your API outputs XML and the desired format is JSON, Azure API management does the conversion on the fly
  • Cache the responses
  • Logging and Monitoring

 Publisher Portal


  • Define or import API schema.
  • Create package or subscriptions by bundling one or more APIs
  • Set up configuration policies and restricting no of API request from one customer in a given time Get insights from analytics. User management 

 Developer Portal

  • Automatic design and delivery of API documents/websites using Swagger/Swashbuckler
  • Web console for trying APIs on the fly 
  • Customer onboarding and security token issuing
  • creation Access analytics on customer API usage.

Monday, April 11, 2016

EXTRACTing data from Multiple Files in U-SQL

We’ve seen this example EXTRACT Operator to extract data from a single file, but what if we want to extract data from multiple identical/un-identical data from similar or from different locations.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
The below example shows how to read data from multiple files from different locations.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM 
        @"/Samples/Data/SearchLog1.tsv",
        @"/Samples/Data/SearchLog2.tsv",
        @"/Samples/Data/Innerfolder/SearchLog3.tsv"
    USING Extractors.Tsv();

Important pointers to note: 

  • Notice that the name of each input file must be specified 
  • If one of the specified input files does not exist, then the script will fail to compile 
  • Not obvious in this example, the files don’t have to have any common naming or location

Transforming Rows into Columns in U-SQL

Pivot operator in SQL helps us to turn the unique values of a specified column from a single/multiple rows into multiple column values in the output. Additionally we can also perform aggregations, wherever required, for column values that are necessary in the final output.

If you want similar functionality in your U-SQL query, you have to use the CROSS APPLY operator of U-SQL to essentially do the similar job.

@output = 
    SELECT 
        Region, 
        Urls
    FROM @searchlog;

@output =
    SELECT 
        Region, 
        SqlArray.Create(Urls.Split(';')) AS UrlTokens
    FROM @output;

@output = 
    SELECT 
        Region, 
        Token AS Url
    FROM  @output
    CROSS APPLY EXPLODE (UrlTokens) AS r(Token);

Cross Apply Output

0
1
"en-us"
"www.nachos.com"
"en-us"
"www.wikipedia.com"
"en-gb"
"skiresorts.com"
"en-gb"
"ski-europe.com"
"en-gb"
"www.travelersdigest.com/ski_resorts.htm"
"en-gb"
"mayoclinic.com/health"
"en-gb"
"webmd.com/a-to-z-guides"
"en-gb"
"mybrokenleg.com"
"en-gb"
"wikipedia.com/Bone_fracture"
"en-ca"
"southparkstudios.com"
"en-ca"
"wikipedia.org/wiki/Sout_Park"
"en-ca"
"imdb.com/title/tt0121955"
"en-ca"
"simon.com/mall"
"en-us"
"cosmos.com"
"en-us"
"wikipedia.org/wiki/Cosmos:_A_Personal_Voyage"
"en-us"
"hulu.com/cosmos"
"en-fr"
"microsoft.com"
"en-fr"
"wikipedia.org/wiki/Microsoft"
"en-fr"
"xbox.com"
"en-us"
"www.amazon.com"
"en-us"
"reviews.cnet.com/wireless-headphones"
"en-us"
"store.apple.com"
"en-us"
"dominos.com"
"en-us"
"wikipedia.org/wiki/Domino's_Pizza"
"en-us"
"facebook.com/dominos"

Converting columns into Rows


Similarly if we want to convert a set of columns into rows or revert back to the previous state (refer the above situation), we have to use ARRAY_AGG operator. In the example below you will see rowset taken apart by CROSS APPLY and then reconstructed via the ARRAY_AGG operator.

@a = SELECT Region, Urls FROM @searchlog;

@b =
    SELECT 
        Region, 
        SqlArray.Create(Urls.Split(';')) AS UrlTokens
    FROM @a;
@c = 
    SELECT 
        Region, 
        Token AS Url
    FROM  @b
    CROSS APPLY EXPLODE (UrlTokens) AS r(Token);
@d =
    SELECT Region,
           string.Join(";", ARRAY_AGG(Url).ToArray()) AS Urls
    FROM @a
    GROUP BY Region;

Grouping and Aggregation in U-SQL

Group By function helps in collapsing multiple rows into single rows based on a criteria. By using Group By operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done). Let’s apply the group by operation in an example
@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// find the total Duration by Region
@output = 
    SELECT 
        Region, 
        SUM(Duration) AS TotalDuration
    FROM searchlog
    GROUP BY Region; 
Now lest look at the common use of HAVING operator. We can use HAVING to restrict the output rowset to those rows that have aggregate values we are interested in. For example, we want to find all the Regions where total dwell time is above some value.
@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// find all the Regions where the total dwell time is > 200
@output = 
    SELECT 
        Region, 
        SUM(Duration) AS TotalDuration
    FROM @searchlog
    GROUP BY Region
    HAVING TotalDuration > 200;
Sample for Count Function (Count the number of total sessions by Region)
// Count the number of total sessions.
@output =
    SELECT 
        COUNT() AS NumSessions
    FROM @searchlog;
 

@output = 
    SELECT 
        COUNT() AS NumSessions, 
        Region
    FROM @searchlog
    GROUP BY Region;
Count the number of total sessions by Region and include total duration for that language.
@output = 
    SELECT 
COUNT() AS NumSessions, 
Region, 
SUM(Duration) AS TotalDuration, 
AVG(Duration) AS AvgDwellTtime, 
MAX(Duration) AS MaxDuration, 
MIN(Duration) AS MinDuration
    FROM @searchlog
    GROUP BY Region;
Note: Dealing with Aggregations by Data Types If the input data type is double, the output datatype while applying the Aggregate functions will be as follows
SUM(double) -> double
COUNT(double) -> long(int64)
If the input data type is numeric (long/int/short/byte, etc.):
SUM(type) -> long(int64)
COUNT(type) -> long(int64)
Tip: Aggregate functions can ONLY be used in a SELECT clause and not in Extract statement. Usage of DISTINCT Operator with Aggregates Every aggregate function can take a DISTINCT qualifier. For example
COUNT(DISTINCT x) 
DISTINCT also works for user-defined aggregates.
MyAggregator(DISTINCT x,y,z)

Sunday, April 10, 2016

Ternary Operators in U-SQL

In C#, the ternary operator tests a condition. It compares two values. It produces a third value that depends on the result of the comparison. In U-SQL this can be achieved using the conditional operator as shown below and not using If statement as if is not supported by U-SQL. Applying this in our query sample
@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT 
        Region,  
         (Duration>300 ? "long" :"short") AS DwellType
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();
Understand the importance of using Parentheses with the Conditional Operator Always enclose the conditional operator in parenthesis to C# Precedence rules. Fro example, the below statement without the parenthesis
c1 ? true : false AND c2 ? true : false
It will be evaluated as below without parenthesis, which we don’t want to happen
c1 ? true : (false AND c2 ? True : false)
To prevent this from happening just get into the habit of always using parenthesis around a conditional operator as shown below
(c1 ? True : false) AND (c2 ? True : false)

How to use C# Code behind in U-SQL

The role of C# code behind class file is very similar to having C# code behind in asp.net web forms application. The purpose of the code behind file in a USQL project is to store C# code that can be used in your U-SQL script. Every U-SQL Script in Visual Studio will have a Code-Behind file associated with it. In the case of the Script.usql file w’ve created so far, the code behind looks like this.


Clicking on the code behind file will show the following code.

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace MyUSQLApp
{
}
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace MyUSQLApp
{
    public static class Helpers
    {
        public static Double AddDuration(int duration)
        {
            return duration + 1.0;
        }
    }
}


To use the helper method in the script, follow the below code snippet. Please note that we have to use the fully qualified name while referring the function in the script, failing to do that will cause run time error.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT 
        Region,
        MyUSQLApp.Helpers.AddDuration( Region ) AS Region2
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

U-SQL Data Transformations - Part2

Create new columns

In certain scenarios, we may need to create a new column as part of the data transformation. The below sample shows how we can add a new column into the rowsets.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT 
        Start,
        Region,
        Duration + 5.0 AS Duration2
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

Find Distinct records

The below snippet shows the application of "DISTINCT" to fetch only unique values.
@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT DISTINCT Region FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

using IN Operator in U-SQL

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT Region FROM @searchlog where Region in ("USA","India");

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

Numbering Rows

We can use use ROW_NUMBER windowing function aggregate to assign row numbers to the dataset.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT ROW_NUMBER() OVER ( ) AS RowNumber, 
           Start, 
           Region 
       FROM @searchlog
       ORDER BY Start;


OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

Saturday, April 9, 2016

Introduction to Azure Big Data Analytics Suite

In Build 2015, Microsoft announced Azure Data Lake analytics, Data lake Store, Data Factory, Data Catalog to address analytics and big data challenges of enterprises. Many wondering why is “Yet another big data service” when there are plenty of big data technologies and products in the market including open source and premium products. Microsoft also has HDInsight in their market place which is an Apache Hadoop distribution powered by the cloud powered by HortonWorks.

The primary reason being the complexity and learning curve involved in learning and getting started with the existing big data tools and technologies. Below is a partial and most popular list of Apache open source technologies (For complete list of apache big data open source projects, click here) which addresses certain challenges of a greater big data problems, but there are plenty of industry specific vertical and domain based big data solutions from various vendors.


At first sight, this is of course confusing especially if you are coming from a Microsoft world, you will simply skip and drop the big data project.


The second major reason is currently Microsoft doesn’t have any Big data solutions to cater their enterprises audience, being a provider of massively successful SQL Server solutions Microsoft doesn’t want to left behind on Big data offering and this makes lot of sense as well.  

Finally Microsoft doesn’t want their Visual Studio, .Net/C#/SQL developers to look at another vendor or learn new technologies to achieve big data projects. Azure Big data lake suite is the answer for bridging the gap between .Net worlds with big data.

However the real challenge for Microsoft was deciding between building a big data product from scratch or use existing established open source projects. Finally, Microsoft took the later route i.e. embracing open source projects to power Azure Big data suite.


The Data Lake analytics service is a new distributed analytics service built on Apache YARN that dynamically scales so you can focus on your business goals, not on distributed infrastructure like other IaaS based solutions. Azure Data Lake analytics natively works with all the data sources including Azure SQL, SQL Server on VM, Azure BLOBs, Azure SQL Datawarehouse that lives in cloud as well as from elsewhere, similarly Data lake store built in top of WebHDFS can ingest and receive data from any HDFS endpoints whether it’s coming from Azure or anywhere else.



Is the Azure Data Lake Query engine powered by Hadoop?

No, Microsoft didn’t build the data lake suite on top of Hadoop map reduce, partly because of the overhead involved in integrating it with the vast set of existing Microsoft products. Secondly Microsoft wanted its developers to use the same SQL + C# even for big data projects and not to burden them with introducing another language exclusively for big data. However, both C# and SQL scores well in its own areas and had advantages and disadvantages when it comes to big data processing, so Microsoft intelligently combined the power of C# expressions and the elegance of SQL and formulated Unified SQL (U-SQL), an all-new big data query language for big data processing.


Welcome to U-SQL

U-SQL, is the big data query language from Microsoft for its Data lake suite. If you are SQL developer and SQL DBA, you’ll notice that U-SQL queries look a lot of SQL queries. Many fundamental concepts and syntactic expressions will be very familiar to those with a background in SQL.

However U-SQL is a unique language and some of the expectations you might have from the SQL world do not carry over into U-SQL. For example, Select in SQL is not case sensitive, but it has to be SELECT (in upper) when used in Data lake.

In the upcoming blog posts, I’ll dig deep into the various U-SQL concepts with example. Please come back for more on Azure Big Data Analytics and U-SQL.

Good Luck!

U-SQL Data Transformations - Part1

In the last post, we looked at the detailed step by step process in getting started with the U-SQL Hello world project. Now that we are comfortable in the basic querying and rowset creation process, lets look how we can selectively filter the dataset by applying conditions.

Selection and Filtering

The SELECT clause, as in SQL, allows you to pick the columns of interest.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = SELECT Start, Region, Duration
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();


If you notice the line no 12, we are creating another rowset called @output by selecting only “Start,Region,Duration” columns from the @Searchlog. Similarly we can create as many filtered rowsets as we want from other Rowsets

Now let’s apply WHERE and HAVING Clauses to further filter @Output rowset as we do in SQL. Logical operators such as AND and OR are supported.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog
    WHERE Region == "en-gb";

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();



Note: Notice the use of "==" in the example above instead of "=". This is because expressions in the SELECT statement are C# expressions.

Refining RowSets

A rowset can be created from itself – this allows you to refine a rowset one statement at a time which may be useful for debugging or simply to make your script easier to use.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog;

@output = 
    SELECT *
    FROM @output
    WHERE Region == "en-gb";

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Filter Rowsets by Dates

A rowset can be created from itself – this allows you to refine a rowset one statement at a time which may be useful for debugging or simply to make your script easier to use.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog;

@output = 
    SELECT *
    FROM @output
    WHERE  Start >= DateTime.Parse("2012/03/01") AND Start <= DateTime.Parse("2012/03/30");

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Logical Operators

The AND/OR/NOT operators can be combined with parentheses for more complex expressions

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog 
    WHERE (Duration >= 2*60 AND Duration <= 5*60) OR NOT (Region == "en-gb");


OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

In addition to AND/OR/NOT, U-SQL also supports C#-style logical operators. These are equivalent to their AND/OR/NOT. However, || and && perform short-circuiting which AND/OR do not.


C#
Logical AND with short-circuiting
(a && b && c)
Logical OR with short-circuiting
(a || b || c)
Logical NOT
!a





Writing your first U-SQL Hello World

Writing your first U-SQL Hello World

To get started with writing your first U-SQL query, open visual studio and click on new project à and click on U-SQL (If you don’t see the U-SQL project, you have to install U-SQL tools for Visual Studio, click here for instructions)

Create new U-SQL Project

Sample data and script


The sample datasets are available in the Azure Data lake analytics account for us to get started with ADL. To get the sample datasets, click on Explore Sample Jobs and click on copy samples, this process will copy the entire dataset samples to our ADL store account. Additionally you can also download the sample datasets and keep it in your local environment.

Copy Sample Datasets

Sample Data sets in your Azure Data Analytics Store

Now that we have the data, let’s go ahead and execute the below sample script. This script is a very basic query and it essentially just copies the contents of one TSV file into another TSV file.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int,
            Urls            string,
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Sample Script in the Solution Explorer

Compile and run the script

Click on start button, this will compile the script and run it against the Azure ADL account. After successful completion of the execution, the output file will be created in the output folder inside the samples folder in the Azure Data Analytics Store.

Key aspects to note

Extracting data from a File

You must have noticed that there is no SELECT statement in the first U-SQL, it’s because the input data has no pre-defined schema and hence EXTRACT statement is used to read data from files. However once the data is extracted and created as a Rowset (Schematized), we can apply SELECT on top of it. While using EXTRACT you must specify both the column name and column type.
To perform the EXTRACT function, we must use an Extractor – it’s an inbuilt function of Azure ADL that knows how to parse the file and produce rowsets. U-SQL comes with several such built-in extractors for reading data from files. The one used in this script is Extractors.Tsv. (“TSV” stands for tab-separated value).

Rowsets

The rows derived from the EXTRACT statement are defined by a RowSet named @searchlog. Rowsets – is essentially an intermediate table structure created by U-SQL to internally pass data during script execution. The rowsets could be further transformed to one or more rowsets.

Outputters

The @searchlog RowSet is written to an externa output file using the OUTPUT statement. A built-in Outputter called Outputters.Tsv is available to write a file in TSV format, similarly we also have Outputters.Csv for outputting the data into CSV format.

Query Execution Environment

The ADL query can be executed locally or in the Azure ADL Account. The input data and output data can be placed locally as well as in the Azure ADL account, but while you execute the query locally it may not store (Known bug) the output in the defined location and instead it will store in the  below path.

“C:\Users\UserAccount<>\AppData\Local\USQLDataRoot”

Note: To check the exact path, follow the below screenshots.

Step -1


Step 2

Key Things Learned

Always use “/” as the path Separator

Likewise, stick the "/" separator as it works both locally and for remote execution unlike "\" which only works on local execution.

U-SQL Keywords are Upper-Case

U-SQL is very case sensitive, for e.g. writing extract instead of EXTRACT will cause a compilation error, so make sure to use EXTRACT, FROM, and USING in upper case.

About Header rows

Azure ADL doesn’t understand the first row that contains column headers, so make sure the input data doesn’t contain any header column. Similarly the output data produced by ADL will also not contain any column header.  However, this feature is expected to be added while the ADL and store generally available.