Monday, April 11, 2016

EXTRACTing data from Multiple Files in U-SQL

We’ve seen this example EXTRACT Operator to extract data from a single file, but what if we want to extract data from multiple identical/un-identical data from similar or from different locations.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
The below example shows how to read data from multiple files from different locations.

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM 
        @"/Samples/Data/SearchLog1.tsv",
        @"/Samples/Data/SearchLog2.tsv",
        @"/Samples/Data/Innerfolder/SearchLog3.tsv"
    USING Extractors.Tsv();

Important pointers to note: 

  • Notice that the name of each input file must be specified 
  • If one of the specified input files does not exist, then the script will fail to compile 
  • Not obvious in this example, the files don’t have to have any common naming or location