Data_wrangling_analysis_Project UDACITY

#WeRateDogs wrangling

Introduction

Real-world data rarely comes clean. Using Python and its libraries, I gathered data from a variety of sources and in a variety of formats, assessed its quality and tidiness, then clean it. I documented my wrangling efforts in a Jupyter Notebook, and showcased them through analyses and visualizations using Python (and its libraries).

The dataset I wrangled (and analyzing and visualizing) was the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for us to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

Data Wrangling Process

This project was performed following the below process:

Gathering Data
Assessing Data
Cleaning Data
Storing Data
Analyzing and Visualization
Reporting

Gathering Data

I gathered data from 3 different sources: csv, tsv, and scraped tweet

The WeRateDogs Twitter archive: it's a csv file
The tweet image predictions: This file (image_predictions.tsv) is present in each tweet according to a neural network. It is hosted on Udacity's servers and I used the Requests library and the following URL: link
Data from the Twitter API: Gather each tweet's retweet count and favorite ("like") count at the minimum. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library

Assessing Data

I visually and programmatically assessed the dataset to detect issues:

Quality Issue detected are:

Twitter_archive

Timestamp and retweeted_status_timestamp datatype should be datatime
[tweet_id, retweeted_status_id, retweeted_status_user_id] should be object and not float[avoid operation being perform]
floofe, pupper,puppo,doggo has alot of missing value, indicated as None instead of NaN
Name has irrelevant names like "a","not","all","by","the","my" etc
there are more than 1 dog stage e.g doggo and pupper(12), doggo and floofer(1), doggo and puppo(1)
Duplicated value in expanded_url

Tweet_data

tweet_id, create_date has incorrect datatype

dog_pred

duplicated values in jpg_url

Tidiness issues are:

Twitter_archive

[Doggo, puppo, pupper,floofer] should be in one column as "stages"
source having unnecessary html tags 'Twitter for iPhone'
in_reply_to_status_id and in_reply_to_user_id are not original data
Only the original tweet is needed, drop all retweet columnsi.e retweeted_status_id,
retweeted_status_user_id, retweeted_status_timestamp

Tweet_data

create_date exist in Twitter_archive as timestamp. therefore, create date should be drop

dog_pred

Ensure that all rows contain at least one true prediction

Cleaning Data

A copy of each DataFrame was created (Twitter_archive_df, Tweet_data_df, and dog_pred_df). In order to fix each quality/tidiness issue, I followed a three-stage model of programmatic data cleaning:

Define
Code
Test.

Storing Data

Following the cleaning process, I merged the 3 cleaned datasets into 1 using pandas.merge(how=inner, on=tweet_id) library. Then, I saved the file as a CSV

Analyzing and Visualization

I analyzed the dataset to answer the following questions

Which device is used to tweet the most?
Top tweet IDs with the most average likes
When and in what month did the most tweets occur?
What are the most common dog names

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
WeRateDogs.ipynb		WeRateDogs.ipynb
act_report.pdf		act_report.pdf
image-predictions.tsv		image-predictions.tsv
tweet-json.txt		tweet-json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangling_report.pdf		wrangling_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data_wrangling_analysis_Project UDACITY

#WeRateDogs wrangling

Introduction

Data Wrangling Process

Gathering Data

Assessing Data

Quality Issue detected are:

Twitter_archive

Tweet_data

dog_pred

Tidiness issues are:

Twitter_archive

Tweet_data

dog_pred

Cleaning Data

Storing Data

Analyzing and Visualization

About

Releases

Packages

Languages

olasunkanmimariam/Data_wrangling_analysis

Folders and files

Latest commit

History

Repository files navigation

Data_wrangling_analysis_Project UDACITY

#WeRateDogs wrangling

Introduction

Data Wrangling Process

Gathering Data

Assessing Data

Quality Issue detected are:

Twitter_archive

Tweet_data

dog_pred

Tidiness issues are:

Twitter_archive

Tweet_data

dog_pred

Cleaning Data

Storing Data

Analyzing and Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages