Saturday, April 9, 2016

U-SQL Data Transformations - Part1

In the last post, we looked at the detailed step by step process in getting started with the U-SQL Hello world project. Now that we are comfortable in the basic querying and rowset creation process, lets look how we can selectively filter the dataset by applying conditions.

Selection and Filtering

The SELECT clause, as in SQL, allows you to pick the columns of interest.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = SELECT Start, Region, Duration
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();


If you notice the line no 12, we are creating another rowset called @output by selecting only “Start,Region,Duration” columns from the @Searchlog. Similarly we can create as many filtered rowsets as we want from other Rowsets

Now let’s apply WHERE and HAVING Clauses to further filter @Output rowset as we do in SQL. Logical operators such as AND and OR are supported.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog
    WHERE Region == "en-gb";

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();



Note: Notice the use of "==" in the example above instead of "=". This is because expressions in the SELECT statement are C# expressions.

Refining RowSets

A rowset can be created from itself – this allows you to refine a rowset one statement at a time which may be useful for debugging or simply to make your script easier to use.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog;

@output = 
    SELECT *
    FROM @output
    WHERE Region == "en-gb";

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Filter Rowsets by Dates

A rowset can be created from itself – this allows you to refine a rowset one statement at a time which may be useful for debugging or simply to make your script easier to use.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog;

@output = 
    SELECT *
    FROM @output
    WHERE  Start >= DateTime.Parse("2012/03/01") AND Start <= DateTime.Parse("2012/03/30");

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Logical Operators

The AND/OR/NOT operators can be combined with parentheses for more complex expressions

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT Start, Region, Duration
    FROM @searchlog 
    WHERE (Duration >= 2*60 AND Duration <= 5*60) OR NOT (Region == "en-gb");


OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

In addition to AND/OR/NOT, U-SQL also supports C#-style logical operators. These are equivalent to their AND/OR/NOT. However, || and && perform short-circuiting which AND/OR do not.


C#
Logical AND with short-circuiting
(a && b && c)
Logical OR with short-circuiting
(a || b || c)
Logical NOT
!a