Monday, April 11, 2016

Grouping and Aggregation in U-SQL

Group By function helps in collapsing multiple rows into single rows based on a criteria. By using Group By operation, some fields in the output rowset must be aggregated into some meaningful value (or discarded if no possible or meaningful aggregation can be done). Let’s apply the group by operation in an example
@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// find the total Duration by Region
@output = 
    SELECT 
        Region, 
        SUM(Duration) AS TotalDuration
    FROM searchlog
    GROUP BY Region; 
Now lest look at the common use of HAVING operator. We can use HAVING to restrict the output rowset to those rows that have aggregate values we are interested in. For example, we want to find all the Regions where total dwell time is above some value.
@searchlog = 
EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// find all the Regions where the total dwell time is > 200
@output = 
    SELECT 
        Region, 
        SUM(Duration) AS TotalDuration
    FROM @searchlog
    GROUP BY Region
    HAVING TotalDuration > 200;
Sample for Count Function (Count the number of total sessions by Region)
// Count the number of total sessions.
@output =
    SELECT 
        COUNT() AS NumSessions
    FROM @searchlog;
 

@output = 
    SELECT 
        COUNT() AS NumSessions, 
        Region
    FROM @searchlog
    GROUP BY Region;
Count the number of total sessions by Region and include total duration for that language.
@output = 
    SELECT 
COUNT() AS NumSessions, 
Region, 
SUM(Duration) AS TotalDuration, 
AVG(Duration) AS AvgDwellTtime, 
MAX(Duration) AS MaxDuration, 
MIN(Duration) AS MinDuration
    FROM @searchlog
    GROUP BY Region;
Note: Dealing with Aggregations by Data Types If the input data type is double, the output datatype while applying the Aggregate functions will be as follows
SUM(double) -> double
COUNT(double) -> long(int64)
If the input data type is numeric (long/int/short/byte, etc.):
SUM(type) -> long(int64)
COUNT(type) -> long(int64)
Tip: Aggregate functions can ONLY be used in a SELECT clause and not in Extract statement. Usage of DISTINCT Operator with Aggregates Every aggregate function can take a DISTINCT qualifier. For example
COUNT(DISTINCT x) 
DISTINCT also works for user-defined aggregates.
MyAggregator(DISTINCT x,y,z)