The task is divided into 5 subtasks:
- Data Collection
- EDA
- Data Cleaning & Modelling
- Web Service
- Deployment
I have used pushshift
API instead of PRAW
for this task as using this we can scrape more and older articles than using the PRAW
. Link
- Using
pushshift
I have scraped around ~4 lakh submissions from: January 10 2018 to April 10 2020 - More details in the notebook
This includes:
- Length analysis
- Time series analysis
- Selecting appropriate flairs out of 214 collected flairs on basis of their count and recent trend
- ngram analysis
- Viral posts
- Mods of subreddit
- WordClouds
- Class distribution
- Cleaned the text by handling contraction, punctuations, URL, HTML, emojis..
- Initially I thought of finetuning the BERT model, but later on I realised its not possible on my GPU so I'll train it on the cloud later
- Implemented LSTM model w/o pretrained embeddings: test classification report in notebook (used in webservice)
- Implemented LSTM model with pretrained embeddings
- Implemented LSTM model using only
titles
- Implemented BERT (training in progress)
Test accuracy (LSTM w/o pretrained embeddings)
Loss (LSTM w/o pretrained embeddings)
Classification Report (LSTM w/o pretrained embeddings)
Test accuracy (LSTM with pretrained embeddings)
Loss (LSTM with pretrained embeddings)
Classification Report (LSTM with pretrained embeddings)
Test accuracy (LSTM with only Title)
Loss (LSTM with only Title)
Classification Report (LSTM with only Title)
BERT finetune (in progress)
A web service to know the flair of the submissions is developed using flask, I have used the saved model and weights to make the predictions. It consists of 2 endpoints:
-
/
: visiting this page renders a HTML page where user can post the link of the submission and on submitting, predicted flair will be displayed. Working: Using the link entered the backend search for the post with the same URL using thePRAW
after getting the submission we get theselftext
andtitle
from it and use the concatenated text for the prediction. -
/automated_testing
: This gives the predicted flairs for the links given in a txt file.Provided file is uploaded like thisfiles = {'upload_file': open('file.txt','rb')}
r = requests.post(url, files=files)
-
Output is in the JSON format
-
Open
pred_app
folder, install the dependencies inrequirements.txt
usingpip install -r requirements.txt
, and runserver.py