- Victor De Lima (vad49)
- Matt Moriarty (mdm341)
In the study, we use Natural Language Processing (NLP) to determine whether we can accurately classify Reddit posts into their respective subreddit by analyzing the language contained in the post's text. We conducted the research by employing Term Frequency methods and supervised learning algorithms. Our findings show that the language individuals use when participating in discussions within a community context provides sufficient information for models to make excellent predictions. We also offer an overview of the details related to model construction and explore the implications of the results.
Our repository is organized with the following structure:
Code/
: This folder contains all code associated with our analysis, including data collection, cleaning, and modelingData/
: This folder contains the data associated with our analysis, excluding the raw Reddit data, which totals 300 MBOutput/
: This folder contains images associated with the results of our analysis, grouped by the type of model usedPoster/
: This folder contains the poster associated with our analysisReport/
: This folder contains all files required for rendering our report in Quarto