GitHub - us241098/Reddit-Flair

The task is divided into 5 subtasks:

Data Collection
EDA
Data Cleaning & Modelling
Web Service
Deployment

Data Collection

I have used pushshift API instead of PRAW for this task as using this we can scrape more and older articles than using the PRAW. Link

Using pushshift I have scraped around ~4 lakh submissions from: January 10 2018 to April 10 2020
More details in the notebook

EDA

This includes:

Length analysis
Time series analysis
Selecting appropriate flairs out of 214 collected flairs on basis of their count and recent trend
ngram analysis
Viral posts
Mods of subreddit
WordClouds
Class distribution

Data Cleaning and Modelling

Cleaned the text by handling contraction, punctuations, URL, HTML, emojis..
Initially I thought of finetuning the BERT model, but later on I realised its not possible on my GPU so I'll train it on the cloud later
Implemented LSTM model w/o pretrained embeddings: test classification report in notebook (used in webservice)
Implemented LSTM model with pretrained embeddings
Implemented LSTM model using only titles
Implemented BERT (training in progress)

Test accuracy (LSTM w/o pretrained embeddings)

Loss (LSTM w/o pretrained embeddings)

Classification Report (LSTM w/o pretrained embeddings)

Test accuracy (LSTM with pretrained embeddings)

Loss (LSTM with pretrained embeddings)

Classification Report (LSTM with pretrained embeddings)

Test accuracy (LSTM with only Title)

Loss (LSTM with only Title)

Classification Report (LSTM with only Title)

BERT finetune (in progress)

Web Service

A web service to know the flair of the submissions is developed using flask, I have used the saved model and weights to make the predictions. It consists of 2 endpoints:

/: visiting this page renders a HTML page where user can post the link of the submission and on submitting, predicted flair will be displayed. Working: Using the link entered the backend search for the post with the same URL using the PRAW after getting the submission we get the selftext and title from it and use the concatenated text for the prediction.
/automated_testing: This gives the predicted flairs for the links given in a txt file.Provided file is uploaded like this files = {'upload_file': open('file.txt','rb')} r = requests.post(url, files=files)
Output is in the JSON format
Open pred_app folder, install the dependencies in requirements.txt using pip install -r requirements.txt, and run server.py

Deployment

The aforementioned service is deployed on Heroku too.

/: https://reddit-flair-utsav.herokuapp.com/?#
/automated_testing: https://reddit-flair-utsav.herokuapp.com/automated_testing

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
img		img
pred_app		pred_app
Classifier.ipynb		Classifier.ipynb
Data_Collection.ipynb		Data_Collection.ipynb
EDA.ipynb		EDA.ipynb
Readme.md		Readme.md
extract.py		extract.py
lstm_glove.h5		lstm_glove.h5
lstm_glove.json		lstm_glove.json
lstm_no_glove.h5		lstm_no_glove.h5
lstm_no_glove.json		lstm_no_glove.json
lstm_no_glove_title.h5		lstm_no_glove_title.h5
lstm_no_glove_title.json		lstm_no_glove_title.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Collection

EDA

Data Cleaning and Modelling

Web Service

Deployment

The aforementioned service is deployed on Heroku too.

About

Releases

Packages

Languages

us241098/Reddit-Flair

Folders and files

Latest commit

History

Repository files navigation

Data Collection

EDA

Data Cleaning and Modelling

Web Service

Deployment

The aforementioned service is deployed on Heroku too.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages