Skip to content

Latest commit

 

History

History
87 lines (58 loc) · 3.18 KB

Readme.md

File metadata and controls

87 lines (58 loc) · 3.18 KB

The task is divided into 5 subtasks:

  • Data Collection
  • EDA
  • Data Cleaning & Modelling
  • Web Service
  • Deployment

Data Collection

I have used pushshift API instead of PRAW for this task as using this we can scrape more and older articles than using the PRAW. Link

  • Using pushshift I have scraped around ~4 lakh submissions from: January 10 2018 to April 10 2020
  • More details in the notebook

EDA

This includes:

  • Length analysis
  • Time series analysis
  • Selecting appropriate flairs out of 214 collected flairs on basis of their count and recent trend
  • ngram analysis
  • Viral posts
  • Mods of subreddit
  • WordClouds
  • Class distribution

Data Cleaning and Modelling

  • Cleaned the text by handling contraction, punctuations, URL, HTML, emojis..
  • Initially I thought of finetuning the BERT model, but later on I realised its not possible on my GPU so I'll train it on the cloud later
  • Implemented LSTM model w/o pretrained embeddings: test classification report in notebook (used in webservice)
  • Implemented LSTM model with pretrained embeddings
  • Implemented LSTM model using only titles
  • Implemented BERT (training in progress)

Test accuracy (LSTM w/o pretrained embeddings)

title

Loss (LSTM w/o pretrained embeddings)

title

Classification Report (LSTM w/o pretrained embeddings)

title

Test accuracy (LSTM with pretrained embeddings)

title

Loss (LSTM with pretrained embeddings)

title

Classification Report (LSTM with pretrained embeddings)

title

Test accuracy (LSTM with only Title)

title

Loss (LSTM with only Title)

title

Classification Report (LSTM with only Title)

title

BERT finetune (in progress)

Web Service

A web service to know the flair of the submissions is developed using flask, I have used the saved model and weights to make the predictions. It consists of 2 endpoints:

  • /: visiting this page renders a HTML page where user can post the link of the submission and on submitting, predicted flair will be displayed. Working: Using the link entered the backend search for the post with the same URL using the PRAW after getting the submission we get the selftext and title from it and use the concatenated text for the prediction.

  • /automated_testing: This gives the predicted flairs for the links given in a txt file.Provided file is uploaded like this files = {'upload_file': open('file.txt','rb')} r = requests.post(url, files=files)

  • Output is in the JSON format

  • Open pred_app folder, install the dependencies in requirements.txt using pip install -r requirements.txt, and run server.py

Deployment

The aforementioned service is deployed on Heroku too.