Analytics Workshop Sofia2. Data Ingestion. (Part 1/4)

The goal of this workshop is to create a recommendation system based on user ratings. The workshop is based on one of the exercises proposed at the Spark Summit.

We’ll use one of the Movielens datasets that already reside on the platform. We’ll do it in four steps:

  • Ingestion and data preparation using Pipelines.
  • Creating the model using a Notebook.
  • Ontology Generation.
  • Creating a simple display.

Data Ingestion

Pipeline Creation

We’ll perform the data ingestion of films with the Dataflow. First thing to do is to create a Pipeline from scratch. Go to ‘Analytics’ menu and click on ‘My Pipelines’. Click on ‘Create’. Write the name of the Pipeline and a description. The timer does not apply for this practice:


After creating the pipeline, you may navigate to the workspace where you will be able to create the information flow.

Source Component Definition

Data is already downloaded on Sofia2 machine. Depending on the environment, we might find it under two different directories. The route is “/datadrive/movielens” for, while “/datadrive/ftp/movielens” is for sofia2-analytic. In this directory there should be two files: movies.dat and ratings.dat. To do this pipeline we are interested in the data of the films.

If the files are missing, you will have to download them for this workshop.

First of all, you will need to create the Data Source. Since the files already reside on Sofia2 machine, the required component is ‘Directory’. Click on the component and it will appear in the workspace. You’ll see error alerts. Don’t worry, when you create the empty component, the required configuration parameters are empty. Next step is to edit these parameters.


Click on the component and you will access its configuration. For the local directory source, the required configuration parameters are:

  • Files Data Format: Represents the input data format. There are different options, but the one needed in this example is ‘Text’.
  • Files Files Directory: It is the input directory where the files to read are located. In our case, this route is /datadrive/ftp/movielens. (If you work from the route is /datadrive/movielens).
  • Files Name Pattern: It refers to the regular expression that will be used to search the files to load within the directory configured in the previous parameter. We are interested in reading a single file, so we have to assign this field ‘movies.dat’.

Depending on the chosen input format , the corresponding tab is activated in the configuration window. You will see that, in this case, the active tab is ‘Text’. It only has one parameter (‘Max Line Length’) which has a default value that we won’t modify.

Now you have the source configurated. To start, it is highly recommended to take a look at the data to be read. To do this, we can configure a “Dummy” destination and preview the information. Access to ‘Destinations’ components and choose ‘Trash’. As before, when you click on the icon, the component appears in the workspace. Connect origin and destiny, and this flow is almost ready.

As you’ll notice, there are still configuration errors. This is because in the general configuration it is necessary to define the management of erroneous records. Just click anywhere but on the components, and you will see the alert appearing in the ‘Error Records’ tab, located in the bottom window.


Choose ‘Discard’ inside this options. After this operation, there should be no errors, but still, we will validate the flow. Click on ‘Validate’ button, at menú on top:


If everything is correct, it will display an OK message.

Now we can do the preview. The button just on the left of ‘Validate’ is ‘Preview’. Click on it and a window will appear with a configuration data. Be aware of ‘Write to destination checkbox. If it is checked, in addition to previewing the data, it will write the data to the destination. For now just uncheck it and click on ‘Run Preview’:


You may now see in ‘input data’ what you read in each record and in each of the components. If you click on the ‘directory’ component, you will see the generated output and if you click on ‘Trash’ what it receives. In this case it’s the same.

Data Processing

Now we are going to preparate the data. As you could see in the ‘preview’, fields are separated by ‘::’. Dataflow interprets separators as a single character, so it cannot be defined as a delimiter ‘::’. That will be the next step..

We will include, before the delimiter change, a field renamer. In the ‘preview’ operation, when displaying each record, the defined fields appeared. By reading as ‘Text’ format, for each line, a default field called ‘text’ is generated. This is the field we are going to rename. To do this, within ‘Processors’, click on ‘Field Renamer’.


Create a stream like this:


We are going to configurate it. This component is very simple. Click on it, and in the configuration go to the ‘Rename’ tab. In ‘Fields to Rename’ you have to enter the source field and the name to change it. Write ‘/text’ in the ‘From Field’ section and ‘/data’ in the ‘To Field’ one.


You can try previewing to verify you are effectively renaming the field.

Now we can create the component that replaces the delimiter. To carry out this task you can use different processors, specifically all those called ‘Evaluators’. We’ll do it with JavaScript.


Click on the component and create a flow like the following:


Access to the component configuration, and go to ‘Javascript’ tab. You will see a text editor called ‘script’, which already has predefined code inside. It’s the template on which we will define our changes. Inside the ‘for’ loop, add the following line of code:

records[i].value[‘data’] = records[i].value[‘data’].replace(/::/g, “%”);

This line replace ‘::’ for ‘%’. We have chosen this delimiter because the typical ones that are usually ‘;’, ‘,’ and ‘|’ appear in the dataset as part of the fields. Launch the ‘preview’ again and verify the change has been made correctly.

Destination Component Definition


Click on the component and create a flow like the following:

Access to the settings area of destination. You have to modify 3 tabs:

Hadoop FS: Corresponds to HDFS connections and routes.

  • Hadoop FS URI: hdfs://
  • HDFS User: cloudera-scm

Output Files: This is the definition of the output files, routes, format, etc.

  • File Type: Text Files
  • Data Format: Text
  • Files Prefix: movie
  •  Directory Template: /user/cloudera-scm/movielens/student_alias/

Text: It is the configuration of the format chosen in the previous tab.

  • Text Field Path: /data

Launch the ‘preview’ again and verify that the data arrives correctly to the destination.


If everything seems correct, click on the ‘Start’ button, to the right of the validation button you used previously. You will see that another window opens with the statistics of the data being read, processing times of each component, etc. This window will stop showing new data when the process is finished and all input files are processed. As we don’t need more data, we can stop the pipeline.

Are you now able to do the same for the ‘Ratings’ file?

What about file generation in the HDFS as delimited by defining the field’s names separated by ‘;’?

So far the first part of this workshop. In next posts we’ll continue with it, creating Notebooks to generate the model of recommendations of films using Spark and Scala.

We’ll wait for you!

Analytics Workshop Sofia2. Data Ingestion. (Part 1/4)


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s