Tuesday, April 12, 2016

Sorting Operations in U-SQL

Just like SQL we can sort rowsets by using the ORDER BY operator in U-SQL by specifying the ASC and DESC keyword controls whether the sort is ascending or descending, respectively. In the below U-SQL SELECT * FROM statement you can order the output rows IF and ONLY you also specify how many rows you want to fetch back, failing to using FETCH First clause will result in error. The reason for this behavior is, when running U-SQL scripts in the Azure Datalake analytics distributed environment, performing any sorting will be a very expensive operation. It is for this reason that the ability to order rows has been deliberately restricted to the cases shown above.
DECLARE Out_File string = "/Smples/Output.tsv";

@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// List the sessions in increasing order of Duration
@output = 
    SELECT * 
    FROM @searchlog
    ORDER BY Duration ASC
    FETCH FIRST 3 ROWS;

OUTPUT @output
    TO @Out_File
    USING Outputters.Tsv();

Alternatively, you can also specify ordering when using an OUTPUT statement which does not require a FETCH FIRST.
@output = 
    SELECT * 
    FROM @searchlog;

OUTPUT @output 
    TO @"/Samples/Output/SearchLog_output.tsv"
    ORDER BY Duration ASC
    USING Outputters.Tsv();