Real-world data rarely comes clean. Using Python and its libraries, I gathered data from a variety of sources and in a variety of formats, assessed its quality and tidiness, then clean it. I documented my wrangling efforts in a Jupyter Notebook, and showcased them through analyses and visualizations using Python (and its libraries).
The dataset I wrangled (and analyzing and visualizing) was the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.
WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for us to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.
This project was performed following the below process:
- Gathering Data
- Assessing Data
- Cleaning Data
- Storing Data
- Analyzing and Visualization
- Reporting
I gathered data from 3 different sources: csv, tsv, and scraped tweet
- The WeRateDogs Twitter archive: it's a csv file
- The tweet image predictions: This file (image_predictions.tsv) is present in each tweet according to a neural network. It is hosted on Udacity's servers and I used the Requests library and the following URL: link
- Data from the Twitter API: Gather each tweet's retweet count and favorite ("like") count at the minimum. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library
I visually and programmatically assessed the dataset to detect issues:
- Timestamp and retweeted_status_timestamp datatype should be datatime
- [tweet_id, retweeted_status_id, retweeted_status_user_id] should be object and not float[avoid operation being perform]
- floofe, pupper,puppo,doggo has alot of missing value, indicated as None instead of NaN
- Name has irrelevant names like "a","not","all","by","the","my" etc
- there are more than 1 dog stage e.g doggo and pupper(12), doggo and floofer(1), doggo and puppo(1)
- Duplicated value in expanded_url
- tweet_id, create_date has incorrect datatype
- duplicated values in jpg_url
- [Doggo, puppo, pupper,floofer] should be in one column as "stages"
- source having unnecessary html tags 'Twitter for iPhone'
- in_reply_to_status_id and in_reply_to_user_id are not original data
- Only the original tweet is needed, drop all retweet columnsi.e retweeted_status_id,
- retweeted_status_user_id, retweeted_status_timestamp
- create_date exist in Twitter_archive as timestamp. therefore, create date should be drop
- Ensure that all rows contain at least one true prediction
A copy of each DataFrame was created (Twitter_archive_df, Tweet_data_df, and dog_pred_df). In order to fix each quality/tidiness issue, I followed a three-stage model of programmatic data cleaning:
- Define
- Code
- Test.
Following the cleaning process, I merged the 3 cleaned datasets into 1 using pandas.merge(how=inner, on=tweet_id) library. Then, I saved the file as a CSV
I analyzed the dataset to answer the following questions
- Which device is used to tweet the most?
- Top tweet IDs with the most average likes
- When and in what month did the most tweets occur?
- What are the most common dog names