The goal of this workshop is to create a recommendation system based on user ratings. The workshop is based on one of the exercises proposed at the Spark Summit.
We’ll use one of the Movielens datasets that already reside on the platform. We’ll do it in four steps:
- Ingestion and data preparation using Pipelines.
- Creating the model using a Notebook.
- Ontology Generation.
- Creating a simple display.
With the help of Sofia2 notebooks we are going to generate the movie recommendation model using the data we uploaded on the platform in the previous exercise. We propose to carry it out with Spark using Scala, and more concretely, we’ll implement the ALS.
Input data paths Definition
The first step is reading movie data and ratings, and, to do this, you have to define the data path. Define ratings_path and movies_path variables with the corresponding paths where you have loaded to the platform. For example:
Downloaded data paths Definition
Tip: If we do this workshop at Sofia2.com/console we have to change ‘sofia2-analytic:8020’ by ‘localhost:8020’
Structure the data
The next thing to do is to save movie information and ratings. We’ll read this information through Spark RDDs
You need to define a specific format for both movies: (movieId, movieName) and rating: (timestamp % 10, Rating (userId, movieId, rating)).
Save data in Rdd
Now, check that the data has been read. How many ratings did you download? How many films are in the catalog? How many movies have been scored? And how many users have done it?
Before building the model, the dataset must be splitted into smaller parts, one for training (60%), one for validation (20%) and another for testing (20%).
Function to evaluate the model
Once data is splitted, define the function that will evaluate the performance of the model. In particular we will use Root Mean Squared Error (RMSE)) and this is the version in Scala:
Choice of model
Now you can use this function to define the parameters for the training algorithm. The ALS algorithm requires 3 parameters: the range of the factor matrix, the number of iterations and a lambda. We will define different values for these parameters and try different combinations of them to determine which one is the best:
Wich one do you think is the best model?
Now let’s launch our function on data test.
User recommendations Performance
Once the best model is chosen, the next step is to know the recommendations of movies per user. The idea is to ask for the user, which for the Dataset used is a numeric. Let’s do it form type, so first will ask for the user, insert it into a text field and finally release the recommendation. To ask for the user:
For this example, we configured it to show the 10 best recommendations for the user inserted in the text field.
Persist the recommendations
Now we only have to save the best recommendations for each user in ontology. The idea is to save records of the form: UserId, MovieName, MovieGenre.
We create the HIVE table with data stored in the DataFrame. It modifies the name of the table of the image ” recomendaciones_arturo ” by a unique identifier, for example, recomendaciones_yourname.