Data Collection

I have used pushshift API instead of PRAW for this task as using this we can scrape more and older articles than using the PRAW. Link

Using pushshift I have scraped around ~4 lakh submissions from: January 10 2018 to April 10 2020
More details in the notebook

EDA

This includes:

Length analysis
Time series analysis
Selecting appropriate flairs out of 214 collected flairs on basis of their count and recent trend
ngram analysis
Viral posts
Mods of subreddit
WordClouds
Class distribution

Data Cleaning and Modelling

Cleaned the text by handling contraction, punctuations, URL, HTML, emojis..
Initially I thought of finetuning the BERT model, but later on I realised its not possible on my GPU so I'll train it on the cloud later
Implemented LSTM model w/o pretrained embeddings: test classification report in notebook (used in webservice)
Implemented LSTM model with pretrained embeddings
Implemented LSTM model using only titles
Implemented BERT (training in progress)

Test accuracy (LSTM w/o pretrained embeddings)

Loss (LSTM w/o pretrained embeddings)

Classification Report (LSTM w/o pretrained embeddings)

Test accuracy (LSTM with pretrained embeddings)

Loss (LSTM with pretrained embeddings)

Classification Report (LSTM with pretrained embeddings)

Test accuracy (LSTM with only Title)

Loss (LSTM with only Title)

Classification Report (LSTM with only Title)

BERT finetune (in progress)

Web Service

A web service to know the flair of the submissions is developed using flask, I have used the saved model and weights to make the predictions. It consists of 2 endpoints:

/: visiting this page renders a HTML page where user can post the link of the submission and on submitting, predicted flair will be displayed. Working: Using the link entered the backend search for the post with the same URL using the PRAW after getting the submission we get the selftext and title from it and use the concatenated text for the prediction.
/automated_testing: This gives the predicted flairs for the links given in a txt file.Provided file is uploaded like this files = {'upload_file': open('file.txt','rb')} r = requests.post(url, files=files)
Output is in the JSON format
Open pred_app folder, install the dependencies in requirements.txt using pip install -r requirements.txt, and run server.py

Deployment

The aforementioned service is deployed on Heroku too.

/: https://reddit-flair-utsav.herokuapp.com/?#
/automated_testing: https://reddit-flair-utsav.herokuapp.com/automated_testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Data Collection

EDA

Data Cleaning and Modelling

Web Service

Deployment

The aforementioned service is deployed on Heroku too.

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Data Collection

EDA

Data Cleaning and Modelling

Web Service

Deployment

The aforementioned service is deployed on Heroku too.