Analytics Workshop Sofia2. Notebook. (Part 2/4)

The goal of this workshop is to create a recommendation system based on user ratings. The workshop is based on one of the exercises proposed at the Spark Summit.

 

We’ll use one of the Movielens datasets that already reside on the platform. We’ll do it in four steps:

 

  • Ingestion and data preparation using Pipelines.
  • Creating the model using a Notebook.
  • Ontology Generation.
  • Creating a simple display.

 

With the help of Sofia2 notebooks we are going to generate the movie recommendation model using the data we uploaded on the platform in the previous exercise. We propose to carry it out with Spark using Scala, and more concretely, we’ll implement the ALS.

 

 

 

Input data paths Definition

 

The first step is reading movie data and ratings, and, to do this, you have to define the data path. Define ratings_path and movies_path variables with the corresponding paths where you have loaded to the platform. For example:

 

Downloaded data paths Definition

 

image340

 

Tip: If we do this workshop at Sofia2.com/console we have to change ‘sofia2-analytic:8020’ by ‘localhost:8020

 

Structure the data

 

The next thing to do is to save movie information and ratings. We’ll read this information through Spark RDDs.

 

You need to define a specific format for both movies: (movieId, movieName) and rating: (timestamp % 10, Rating (userId, movieId, rating)).

 

We also took advantage of importing Mlib libraries we will use in the example. In particular, you need ALS, Rating y MatrixFactorizationModel.

 

Save data in Rdd

image341

Data checks

 

Now, check that the data has been read. How many ratings did you download? How many films are in the catalog? How many movies have been scored? And how many users have done it?

image342

 

Split Dataset

 

Before building the model, the dataset must be splitted into smaller parts, one for training (60%), one for validation (20%) and another for testing (20%).

image343

Function to evaluate the model

 

Once data is splitted, define the function that will evaluate the performance of the model. In particular we will use Root Mean Squared Error (RMSE)) and this is the version in Scala:

image344

Choice of model

 

Now you can use this function to define the parameters for the training algorithm. The ALS algorithm requires 3 parameters: the range of the factor matrix, the number of iterations and a lambda. We will define different values for these parameters and try different combinations of them to determine which one is the best:

image345

 

Wich one do you think is the best model?
Now let’s launch our function on data test.

image346

User recommendations Performance

 

Once the best model is chosen, the next step is to know the recommendations of movies per user. The idea is to ask for the user, which for the Dataset used is a numeric. Let’s do it form type, so first will ask for the user, insert it into a text field and finally release the recommendation. To ask for the user:

image347

 

For this example, we configured it to show the 10 best recommendations for the user inserted in the text field.

image348

 

Persist the recommendations

 

Now we only have to save the best recommendations for each user in ontology. The idea is to save records of the form: UserId, MovieName, MovieGenre.

image349

 

We create the HIVE table with data stored in the DataFrame. It modifies the name of the table of the image ” recomendaciones_arturo ” by a unique identifier, for example, recomendaciones_yourname.

image350

Analytics Workshop Sofia2. Notebook. (Part 2/4)

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s