# Quickstart
In this example, we'll build an implicit feedback recommender using the Movielens 100k dataset (http://grouplens.org/datasets/movielens/100k/).

The code behind this example is available as a [Jupyter notebook](https://github.com/lyst/lightfm/tree/master/examples/quickstart/quickstart.ipynb)

LightFM includes functions for getting and processing this dataset, so obtaining it is quite easy.

In [1]:
import numpy as np

from lightfm.datasets import fetch_movielens

data = fetch_movielens(min_rating=5.0)



This downloads the dataset and automatically pre-processes it into sparse matrices suitable for further calculation. In particular, it prepares the sparse user-item matrices, containing positive entries where a user interacted with a product, and zeros otherwise.

We have two such matrices, a training and a testing set. Both have around 1000 users and 1700 items. We'll train the model on the train matrix but test it on the test matrix.

In [2]:
print(repr(data['train']))
print(repr(data['test']))

<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 19048 stored elements in COOrdinate format>
<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 2153 stored elements in COOrdinate format>


We need to import the model class to fit the model:

In [3]:
from lightfm import LightFM

We're going to use the WARP (Weighted Approximate-Rank Pairwise) model. WARP is an implicit feedback model: all interactions in the training matrix are treated as positive signals, and products that users did not interact with they implicitly do not like. The goal of the model is to score these implicit positives highly while assigining low scores to implicit negatives.

Model training is accomplished via SGD (stochastic gradient descent). This means that for every pass through the data --- an epoch --- the model learns to fit the data more and more closely. We'll run it for 30 epochs in this example. We can also run it on multiple cores, so we'll set that to 2. (The dataset in this example is too small for that to make a difference, but it will matter on bigger datasets.)

In [4]:
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

CPU times: user 441 ms, sys: 2.66 ms, total: 443 ms
Wall time: 446 ms


<lightfm.lightfm.LightFM at 0x114e13910>

Done! We should now evaluate the model to see how well it's doing. We're most interested in how good the ranking produced by the model is. Precision@k is one suitable metric, expressing the percentage of top k items in the ranking the user has actually interacted with. `lightfm` implements a number of metrics in the `evaluation` module. 

In [5]:
# from lightfm.evaluation import precision_at_k

We'll measure precision in both the train and the test set.

In [6]:
# print("Train precision: %.2f" % precision_at_k(model, data['train'], k=5).mean())
# print("Test precision: %.2f" % precision_at_k(model, data['test'], k=5).mean())

Unsurprisingly, the model fits the train set better than the test set.

For an alternative way of judging the model, we can sample a couple of users and get their recommendations. To make predictions for given user, we pass the id of that user and the ids of all products we want predictions for into the `predict` method.

In [7]:
def sample_recommendation(model, data, user_ids):
    

    n_users, n_items = data['train'].shape

    for user_id in user_ids:
        known_positives = data['item_labels'][data['train'].tocsr()[user_id].indices]
        
        scores = model.predict(user_id, np.arange(n_items))
        top_items = data['item_labels'][np.argsort(-scores)]
        
        print("User %s" % user_id)
        print("     Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")
        
        for x in top_items[:3]:
            print("        %s" % x)
        
# sample_recommendation(model, data, [3, 25, 450]) 

In [8]:
import pandas as pd
import datetime

def all_recommendations(model, data, k=10):
    
    n_users, n_items = data['train'].shape
    rec_list = []
    for user_id in np.arange(n_users):
        scores = model.predict(user_id, np.arange(n_items))
        top_k_items = data['item_labels'][np.argsort(-scores)][:k]
        rec_list.append((user_id,top_k_items))
    df = pd.DataFrame(data=rec_list,columns=['uid','rec'])
    df['pred_time'] = str(datetime.datetime.now())
    return df
df = all_recommendations(model, data) 

In [9]:
df.head()

Unnamed: 0,uid,rec,pred_time
0,0,"[Fargo (1996), Star Wars (1977), Close Shave, ...",2020-09-23 19:41:17.939803
1,1,"[Dead Man Walking (1995), Contact (1997), Good...",2020-09-23 19:41:17.939803
2,2,"[Chasing Amy (1997), Fargo (1996), English Pat...",2020-09-23 19:41:17.939803
3,3,"[G.I. Jane (1997), Evita (1996), Contact (1997...",2020-09-23 19:41:17.939803
4,4,"[Blade Runner (1982), Casablanca (1942), Raide...",2020-09-23 19:41:17.939803


In [10]:
#Send predictions to BigQuery
#this assumes the GOOGLE_
from google.oauth2 import service_account
import pandas_gbq
table_id = "movie_recommendation_service.predicted_movies"
project_id = "authentic-realm-276822"
credentials = service_account.Credentials.from_service_account_file('model-user.json')
pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists = 'replace', credentials=credentials)

1it [00:04,  4.20s/it]
