Saturday, April 9, 2016

Writing your first U-SQL Hello World

Writing your first U-SQL Hello World

To get started with writing your first U-SQL query, open visual studio and click on new project à and click on U-SQL (If you don’t see the U-SQL project, you have to install U-SQL tools for Visual Studio, click here for instructions)

Create new U-SQL Project

Sample data and script


The sample datasets are available in the Azure Data lake analytics account for us to get started with ADL. To get the sample datasets, click on Explore Sample Jobs and click on copy samples, this process will copy the entire dataset samples to our ADL store account. Additionally you can also download the sample datasets and keep it in your local environment.

Copy Sample Datasets

Sample Data sets in your Azure Data Analytics Store

Now that we have the data, let’s go ahead and execute the below sample script. This script is a very basic query and it essentially just copies the contents of one TSV file into another TSV file.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int,
            Urls            string,
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Sample Script in the Solution Explorer

Compile and run the script

Click on start button, this will compile the script and run it against the Azure ADL account. After successful completion of the execution, the output file will be created in the output folder inside the samples folder in the Azure Data Analytics Store.

Key aspects to note

Extracting data from a File

You must have noticed that there is no SELECT statement in the first U-SQL, it’s because the input data has no pre-defined schema and hence EXTRACT statement is used to read data from files. However once the data is extracted and created as a Rowset (Schematized), we can apply SELECT on top of it. While using EXTRACT you must specify both the column name and column type.
To perform the EXTRACT function, we must use an Extractor – it’s an inbuilt function of Azure ADL that knows how to parse the file and produce rowsets. U-SQL comes with several such built-in extractors for reading data from files. The one used in this script is Extractors.Tsv. (“TSV” stands for tab-separated value).

Rowsets

The rows derived from the EXTRACT statement are defined by a RowSet named @searchlog. Rowsets – is essentially an intermediate table structure created by U-SQL to internally pass data during script execution. The rowsets could be further transformed to one or more rowsets.

Outputters

The @searchlog RowSet is written to an externa output file using the OUTPUT statement. A built-in Outputter called Outputters.Tsv is available to write a file in TSV format, similarly we also have Outputters.Csv for outputting the data into CSV format.

Query Execution Environment

The ADL query can be executed locally or in the Azure ADL Account. The input data and output data can be placed locally as well as in the Azure ADL account, but while you execute the query locally it may not store (Known bug) the output in the defined location and instead it will store in the  below path.

“C:\Users\UserAccount<>\AppData\Local\USQLDataRoot”

Note: To check the exact path, follow the below screenshots.

Step -1


Step 2

Key Things Learned

Always use “/” as the path Separator

Likewise, stick the "/" separator as it works both locally and for remote execution unlike "\" which only works on local execution.

U-SQL Keywords are Upper-Case

U-SQL is very case sensitive, for e.g. writing extract instead of EXTRACT will cause a compilation error, so make sure to use EXTRACT, FROM, and USING in upper case.

About Header rows

Azure ADL doesn’t understand the first row that contains column headers, so make sure the input data doesn’t contain any header column. Similarly the output data produced by ADL will also not contain any column header.  However, this feature is expected to be added while the ADL and store generally available.