Sunday, April 10, 2016

U-SQL Data Transformations - Part2

Create new columns

In certain scenarios, we may need to create a new column as part of the data transformation. The below sample shows how we can add a new column into the rowsets.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT 
        Start,
        Region,
        Duration + 5.0 AS Duration2
    FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

Find Distinct records

The below snippet shows the application of "DISTINCT" to fetch only unique values.
@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT DISTINCT Region FROM @searchlog;

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

using IN Operator in U-SQL

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT Region FROM @searchlog where Region in ("USA","India");

OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv(); 

Numbering Rows

We can use use ROW_NUMBER windowing function aggregate to assign row numbers to the dataset.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();


@output = 
    SELECT ROW_NUMBER() OVER ( ) AS RowNumber, 
           Start, 
           Region 
       FROM @searchlog
       ORDER BY Start;


OUTPUT @output
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();