Analytics Workshop Sofia2. Notebook. (Part 2/4)

The goal of this workshop is to create a recommendation system based on user ratings. The workshop is based on one of the exercises proposed at the Spark Summit.


We’ll use one of the Movielens datasets that already reside on the platform. We’ll do it in four steps:


  • Ingestion and data preparation using Pipelines.
  • Creating the model using a Notebook.
  • Ontology Generation.
  • Creating a simple display.


With the help of Sofia2 notebooks we are going to generate the movie recommendation model using the data we uploaded on the platform in the previous exercise. We propose to carry it out with Spark using Scala, and more concretely, we’ll implement the ALS.




Input data paths Definition


The first step is reading movie data and ratings, and, to do this, you have to define the data path. Define ratings_path and movies_path variables with the corresponding paths where you have loaded to the platform. For example:


Downloaded data paths Definition




Tip: If we do this workshop at we have to change ‘sofia2-analytic:8020’ by ‘localhost:8020


Structure the data


The next thing to do is to save movie information and ratings. We’ll read this information through Spark RDDs.


You need to define a specific format for both movies: (movieId, movieName) and rating: (timestamp % 10, Rating (userId, movieId, rating)).


We also took advantage of importing Mlib libraries we will use in the example. In particular, you need ALS, Rating y MatrixFactorizationModel.


Save data in Rdd


Data checks


Now, check that the data has been read. How many ratings did you download? How many films are in the catalog? How many movies have been scored? And how many users have done it?



Split Dataset


Before building the model, the dataset must be splitted into smaller parts, one for training (60%), one for validation (20%) and another for testing (20%).


Function to evaluate the model


Once data is splitted, define the function that will evaluate the performance of the model. In particular we will use Root Mean Squared Error (RMSE)) and this is the version in Scala:


Choice of model


Now you can use this function to define the parameters for the training algorithm. The ALS algorithm requires 3 parameters: the range of the factor matrix, the number of iterations and a lambda. We will define different values for these parameters and try different combinations of them to determine which one is the best:



Wich one do you think is the best model?
Now let’s launch our function on data test.


User recommendations Performance


Once the best model is chosen, the next step is to know the recommendations of movies per user. The idea is to ask for the user, which for the Dataset used is a numeric. Let’s do it form type, so first will ask for the user, insert it into a text field and finally release the recommendation. To ask for the user:



For this example, we configured it to show the 10 best recommendations for the user inserted in the text field.



Persist the recommendations


Now we only have to save the best recommendations for each user in ontology. The idea is to save records of the form: UserId, MovieName, MovieGenre.



We create the HIVE table with data stored in the DataFrame. It modifies the name of the table of the image ” recomendaciones_arturo ” by a unique identifier, for example, recomendaciones_yourname.


Analytics Workshop Sofia2. Notebook. (Part 2/4)


Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión /  Cambiar )

Google photo

Estás comentando usando tu cuenta de Google. Cerrar sesión /  Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión /  Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión /  Cambiar )

Conectando a %s