Content Filtering Recommender System

Content-based filtering in recommender systems involves recommending items to users based on the characteristics of the items and the preferences expressed by the user. It assesses the content of items and user preferences to make personalized recommendations, offering suggestions similar to those the user has shown interest in.

Author

Vraj Shah

Published

September 28, 2023

Import Libraries

import numpy as np
import tensorflow as tf
from tensorflow import keras
from numpy import loadtxt
import pandas as pd

WARNING:tensorflow:From C:\Users\vrajs\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

Dataset

file = open('data/small_movies_X.csv', 'rb')
X = loadtxt(file, delimiter=",")
file = open('data/small_movies_W.csv', 'rb')
W = loadtxt(file, delimiter=",")
file = open('data/small_movies_b.csv', 'rb')
b = loadtxt(file, delimiter=",")
b = b.reshape(1, -1)
file = open('data/small_movies_Y.csv', 'rb')
Y = loadtxt(file, delimiter=",")
file = open('data/small_movies_R.csv', 'rb')
R = loadtxt(file, delimiter=",")

movieList_df = pd.read_csv('data/small_movie_list.csv',
                           header=0, index_col=0,  delimiter=',', quotechar='"')
movieList = movieList_df["title"].to_list()

num_movies, num_features = X.shape
num_users, _ = W.shape

print("Y", Y.shape)
print("R", R.shape)
print("X", X.shape)
print("W", W.shape)
print("b", b.shape)
print("num_features", num_features)
print("num_movies",   num_movies)
print("num_users",    num_users)

Y (4778, 443)
R (4778, 443)
X (4778, 10)
W (443, 10)
b (1, 443)
num_features 10
num_movies 4778
num_users 443

mean = np.mean(Y[152, R[152, :].astype(bool)])
print(f"Average rating for movie 153 : {mean:0.3f} / 5")

Average rating for movie 153 : 1.833 / 5

Cost Function

\[ J = \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right] + \underbrace{\left[ \frac{\lambda}{2} \sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2 + \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2 \right]}_{regularization} \]

\[ = \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right] +\text{regularization} \]

def cost_fxn(X, W, b, Y, R, lambda_):

    # diff = np.dot(X, W.T) + b - Y
    # squared_error = np.square(diff)

    # squared_error *= R

    # J = 0.5 * np.sum(squared_error) + (lambda_ / 2) * \
    #     (np.sum(np.square(W)) + np.sum(np.square(X)))

    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * \
        (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

J = cost_fxn(X, W, b, Y, R, 0)
print(f"Cost: {J:0.2f}")

J = cost_fxn(X, W, b, Y, R, 1.5)
print(f"Cost (with regularization): {J:0.2f}")

Cost: 270821.25
Cost (with regularization): 306504.87

My Ratings

my_ratings = np.zeros(num_movies)

my_ratings[2700] = 5    # Toy Story 3 (2010)
my_ratings[2609] = 2    # Persuasion (2007)
my_ratings[929] = 5     # Lord of the Rings: The Return of the King, The
my_ratings[246] = 5     # Shrek (2001)
my_ratings[2716] = 3    # Inception
my_ratings[1150] = 5    # Incredibles, The (2004)
my_ratings[382] = 2     # Amelie (Fabuleux destin d'Amélie Poula
my_ratings[366] = 5     # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622] = 5     # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988] = 3     # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1    # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1    # Nothing to Declare (Rien à déclarer)
my_ratings[793] = 5     # Pirates of the Caribbean: The Curse of the Black Pearl (2003)

my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]

print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(f'Rated {my_ratings[i]} for {movieList_df.loc[i,"title"]}')

Y = np.c_[my_ratings, Y]

R = np.c_[(my_ratings != 0).astype(int), R]


New user ratings:

Rated 5.0 for Shrek (2001)
Rated 5.0 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Rated 2.0 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Rated 5.0 for Harry Potter and the Chamber of Secrets (2002)
Rated 5.0 for Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Rated 5.0 for Lord of the Rings: The Return of the King, The (2003)
Rated 3.0 for Eternal Sunshine of the Spotless Mind (2004)
Rated 5.0 for Incredibles, The (2004)
Rated 2.0 for Persuasion (2007)
Rated 5.0 for Toy Story 3 (2010)
Rated 3.0 for Inception (2010)
Rated 1.0 for Louis Theroux: Law & Disorder (2008)
Rated 1.0 for Nothing to Declare (Rien à déclarer) (2010)

Ymean = (np.sum(Y*R, axis=1)/(np.sum(R, axis=1)+1e-12)).reshape(-1, 1)
Ynorm = Y - np.multiply(Ymean, R)

Training the Model

num_movies, num_users = Y.shape
num_features = 100

tf.random.set_seed(1234)
W = tf.Variable(tf.random.normal(
    (num_users,  num_features), dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal(
    (num_movies, num_features), dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal(
    (1,          num_users),   dtype=tf.float64),  name='b')

optimizer = keras.optimizers.Adam(learning_rate=1e-1)

iterations = 201
lambda_ = 1

for iter in range(iterations):
    with tf.GradientTape() as tape:
        cost_value = cost_fxn(X, W, b, Ynorm, R, lambda_)

    grads = tape.gradient(cost_value, [X, W, b])

    optimizer.apply_gradients(zip(grads, [X, W, b]))

    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 2321191.3
Training loss at iteration 20: 136169.3
Training loss at iteration 40: 51863.7
Training loss at iteration 60: 24599.0
Training loss at iteration 80: 13630.6
Training loss at iteration 100: 8487.7
Training loss at iteration 120: 5807.8
Training loss at iteration 140: 4311.6
Training loss at iteration 160: 3435.3
Training loss at iteration 180: 2902.1
Training loss at iteration 200: 2566.6

Predictions

p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()

pm = p + Ymean

my_predictions = pm[:, 0]

ix = tf.argsort(my_predictions, direction='DESCENDING')

for i in range(17):
    j = ix[i]
    if j not in my_rated:
        print(
            f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')

print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(
            f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')

Predicting rating 4.49 for movie My Sassy Girl (Yeopgijeogin geunyeo) (2001)
Predicting rating 4.48 for movie Martin Lawrence Live: Runteldat (2002)
Predicting rating 4.48 for movie Memento (2000)
Predicting rating 4.47 for movie Delirium (2014)
Predicting rating 4.47 for movie Laggies (2014)
Predicting rating 4.47 for movie One I Love, The (2014)
Predicting rating 4.47 for movie Particle Fever (2013)
Predicting rating 4.45 for movie Eichmann (2007)
Predicting rating 4.45 for movie Battle Royale 2: Requiem (Batoru rowaiaru II: Chinkonka) (2003)
Predicting rating 4.45 for movie Into the Abyss (2011)


Original vs Predicted ratings:

Original 5.0, Predicted 4.90 for Shrek (2001)
Original 5.0, Predicted 4.84 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Original 2.0, Predicted 2.13 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Original 5.0, Predicted 4.88 for Harry Potter and the Chamber of Secrets (2002)
Original 5.0, Predicted 4.87 for Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Original 5.0, Predicted 4.89 for Lord of the Rings: The Return of the King, The (2003)
Original 3.0, Predicted 3.00 for Eternal Sunshine of the Spotless Mind (2004)
Original 5.0, Predicted 4.90 for Incredibles, The (2004)
Original 2.0, Predicted 2.11 for Persuasion (2007)
Original 5.0, Predicted 4.80 for Toy Story 3 (2010)
Original 3.0, Predicted 3.00 for Inception (2010)
Original 1.0, Predicted 1.41 for Louis Theroux: Law & Disorder (2008)
Original 1.0, Predicted 1.26 for Nothing to Declare (Rien à déclarer) (2010)