RealPython Data Cleaning
Data sets here: https://github.com/realpython/python-data-cleaning
# %%
interactive Jupyter kernel to right
SHIFT + ENTER or CTRL + ENTER runs
^SPACE
brings up intelliSense
also info on method
User Settings - Settings that apply globally to any instance of VS Code you open. Workspace Settings - Settings stored inside your workspace and only apply when the workspace is opened.
type 'code .' in any folder to start editing files in that folder.
data.assign(day= lambda df: df.loc[:,"day"] > 9 )
# T or F if > 9
.drop(
columns=[
"col1",
"date",
"univ"
]
)
Method #1: missing data (by columns) count & percentage most basic method to detect missing data among columns.
df[non_numeric_cols].info()
to remove the first duplicate data.drop_duplicates
and keep the last (keep="last") occurrence.
data.drop_duplicates(subset=["Name"], keep="last", inplace=True)
data[data.isnull().any(axis=1)]
Correcting Misspelled Words with spell-checker Python libraries
to list all unique words unique()
in the Type column. Replace Type for other columns.
unique_type = list(data["Type"].str.split(", ", expand=True).stack().unique())
two misspelled words (Posion and Fie) that should be “Poison” and “Fire.” go through the dataset, find which rows have misspelled words, and fix them.
Finding Misspelled Words
to show all rows that contains()
the word Posion from the Type column. The regex argument is set to false (regex=False) to treat the string (Posion) as a literal string and not a regular expression.
data[data["Type"].str.contains("Posion", regex=False)]
In the following output, there are four rows (Arbok(24), Nidorina(30), Nidoqueen(30) and Nidoran(32)), that have the misspelled word Posion in the Type column.
Viewing Misspelled Words
to replace Posion for all entries in the Type column with the word Poison.
data["Type"] = data["Type"].apply(lambda x: x.replace("Posion", "Poison"))
numeric_cols = df.select_dtypes(include=['number']).columns
print(numeric_cols)
non_numeric_cols = df.select_dtypes(exclude=['number']).columns
print(non_numeric_cols)
combining two datasets in one, and line up rows based on some particular or common property for data analysis.
We can do this by using the merge() function of the dataframe.
DataFrame_name.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Rebuild Missing Data
To find and fill the missing data in the dataset we will use another function. There are 4 ways to find the null values if present in the dataset. Let’s see them one by one:
Using isnull() function:
data.isnull()
Data Cleaning Cycle isnull This function provides the boolean value for the complete dataset to know if any null value is present or not.
data.isna()
isna function This is the same as the isnull() function. Ans provides the same output.
Using isna().any()
data.isna().any()
isna().any() This function also gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular format.
data.isna().sum()
This function gives the sum of the null values preset in the dataset column-wise.
data.isna().any().sum()
isna().any().sum()
This function gives output in a single value if any null is present or not.
There are no null values present in our dataset. But if there are any null values preset we can fill those places with any other value using the fillna() function of DataFrame.Following is the syntax of fillna() function:
DataFrame_name.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
This function will fill NA/NaN or 0 values in place of null spaces.
Standardization and Normalization Data Standardization and Normalization is a common practice in machine learning.
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
This step is not needed for the dataset we are using. So, we will skip this step.
De-Duplicate means remove all duplicate values. There is no need for duplicate values in data analysis. These values only affect the accuracy and efficiency of the analysis result. To find duplicate values in the dataset we will use a simple dataframe function i.e. duplicated(). Let’s see the example:
data.duplicated()
De-Duplicate
This function also provides bool values for duplicate values in the dataset. As we can see that dataset doesn’t contain any duplicate values. If a dataset contains duplicate values it can be removed using the drop_duplicates() function. Following is the syntax of this function:
DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
import csv
#import to use the csv module
with open('demo_csv.csv', mode="r") as csv_file: #"r" represents the read mode
reader = csv.reader(csv_file) #this is the reader object
for item in reader:
# you have to loop through the document to get each data
print(item)
# remove whitespaces from Name column
data["Name"].str.replace(' ', '')
# remove whitespaces from Weight column
data["Type"].str.replace(' ', '')
# remove whitespaces from Type column
data["Weaknesses"].str.replace(' ', '')
import numpy as np
df = pd.DataFrame(
[
(73, 15, 55, 33, 'foo'),
(63, 64, 11, 11, 'bar'),
(56, 72, 57, 55, 'foo'),
],
columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
# A B C D E
# 0 73 15 55 33 foo
# 1 63 64 11 11 bar
# 2 56 72 57 55 foo
(df.loc[df['B'] == 64])
A B C D E
1 63 64 11 11 bar
(df[df['B'] == 64])
A B C D E
1 63 64 11 11 bar
(df[df.B == 64])
(mask = df.B == 64)
(df[mask])
A B C D E
1 63 64 11 11 bar
(df.loc[df['B'] != 64])
A B C D E
0 73 15 55 33 foo
2 56 72 57 55 foo
(df.loc[df['B'] >= 64])
A B C D E
1 63 64 11 11 bar
2 56 72 57 55 foo
to filter a DataFrame so that you only keep rows containing a specific (sub)string in a particular column.
(df.loc[df['E'].str.contains('oo')])
A B C D E
0 73 15 55 33 foo
2 56 72 57 55 foo
(df.loc[df['B'].isin([64, 15])])
A B C D E
0 73 15 55 33 foo
1 63 64 11 11 bar
to select rows whose value is not in an iterable (e.g. a list), then you can simply use the negation character ~ :
(df.loc[~df['B'].isin([64, 15])])
A B C D E
2 56 72 57 55 foo
(df.loc[(df['A'] >= 59) & (df['E'].isin(['foo', 'boo']))])
A B C D E
0 73 15 55 33 foo
import pandas as pd
df = pd.DataFrame(
[
(1, 100, None, 'A'),
(2, None, True, 'B'),
(3, 150, None, None),
(4, 100, None, 'B'),
(5, None, False, 'B'),
(6, 120, False, 'A'),
(7, 45, True, 'C'),
],
columns=['colA', 'colB', 'colC', 'colD']
)
print(df)
colA colB colC colD
0 1 100.0 None A
1 2 NaN True B
2 3 150.0 None None
3 4 100.0 None B
4 5 NaN False B
5 6 120.0 False A
6 7 45.0 True C
Count the NaN values in a specific column
To count the number of rows containing NaN values in a specific column, you can use pandas.Series.isna
(df['colB'].isna().sum())
2
(df['colA'].isna().sum())
0
(df['colB'].isnull().sum())
2
To get the count of missing values for each individual column, you can use the pandas.DataFrame.isna() method followed by sum(). The output will be a Series object containing the counts for each column in the original DataFrame:
(df.isna().sum())
colA 0
colB 2
colC 3
colD 1
dtype: int64
If you want the counts for just a subset of the columns within the original DataFrame, you can perform the same operation as the one we used in the previous section, but this time slicing the DataFrame:
(df[['colA', 'colD']].isna().sum())
colA 0
colD 1
dtype: int64
To count the number of rows that have missing values in all columns specified you can use the notation below. For example, to count how many rows have missing values in both colC and colD:
(df[['colC', 'colD']].isna().all(axis=1).sum())
1
To count the number of rows containing only missing values in every column across the whole DataFrame, you can use the expression shown below. Note that in our example DataFrame, no such row exists and thus the output will be 0.
(df.isnull().all(axis=1).sum())
0
To count how many missing values the whole DataFrame contains across every single column, you can use the following expression:
(df.isna().sum().sum())
6
import pandas as pd
import numpy as np
olympics = pd.read_csv("olympics.csv")
olympics.head()
olympics = pd.read_csv("olympics.csv", header=1)
olympics.head()
col_names = {'Unnamed: 0': 'Country',
'? Summer': 'Summer Olympics',
'01 !': 'Gold',
'02 !': 'Silver',
'03 !': 'Bronze',
'? Winter': 'Winter Olympics',
'01 !.1': 'Gold.1',
'02 !.1': 'Silver.1',
'03 !.1': 'Bronze.1',
'? Games': '# Games',
'01 !.2': 'Gold.2',
'02 !.2': 'Silver.2',
'03 !.2': 'Bronze.2'}
olympics.rename(columns=col_names, inplace=True)
The ‘applymap’ function takes a function that should be applied on the column or columns we intend to perform the changes. Here we define the function that will take a string as a parameter and remove everything after it finds the first bracket ‘(‘.
def get_country(st):
if ' (' in st:
return st[:st.find(' (')] # string start : char before (
else:
return st
Use this function as a parameter in the ‘applymap’ function.
pd.DataFrame(olympics['Country']).applymap(get_country)
To deal with null values. One way is to drop all the null values using this code that deletes all the rows with any null values:
df.dropna()
The problem is, in this way you may be left with only a few rows of data.
Or fill up all the null values with zeros like this:
df.fillna(0)
This will fill up all the null values with zeros.
The null values can also be filled with the value immediately before that:
df.fillna(method = "bfill")
Or you can fill the null values with the value exactly after that:
df.fillna(method="ffill")
Or I fill the null values with the median. It is also common to fill up the null values with mean or standard normal values. We loop through the columns and fill the null values of all the columns using their median.
for i in df.columns:
df[i].fillna(df[i].median(), inplace=True)
Get rid of duplicate rows
people = people.drop_duplicates(subset="Name") # if the Name col has the same data
The ‘Year’ column has values not a 4 digit year
year_ex = people["Year"].str.extract(r'^(\d{4})',expand=False)
year_ex
Output:
0 1980
1 1978
2 1982
3 1992
4 1987
Name: Year, dtype: object
people["Income"].replace({"K": "*1e3", "M": "*1e6"}, regex=True).map(pd.eval).astype(int)
Output:
0 2000000
1 40000
2 120000
3 10000000
4 6000000
Name: Income, dtype: int32
In the edu column, the words Harvard, Stanford or Oxford comes with some extra words or characters. To use this data in analysis, two Harvard data will be considered as 2 different data. So, we need to get rid of those extra characters and only have exactly the same strings for the same university.
First, make three boolean variables for three different Universities.
edu = people["Education"]
harvard = edu.str.contains("Harvard")
stanford = edu.str.contains("Stanford")
oxford = edu.str.contains("Oxford")
Now, we will use ‘Harvard’ if the string contains “Harvard” use “Stanford” where the string contains “Stanford” and so on with nested np.where() functions.
people["Education"] = np.where(harvard, "Harvard",
np.where(stanford, "Stanford",
np.where(oxford, "Oxford", people["Education"])))
Output:
0 Harvard
1 Oxford
2 Oxford
3 Stanford
4 Harvard
Name: Education, dtype: object
To analyze the time or date data in pandas, it is useful to have ‘datetime’ format.
pd.to_datetime(people['Graduation'])
Output:
0 2020-03-12
1 2020-04-14
2 2020-08-01
3 2020-05-18
4 2021-05-10
Name: Graduation, dtype: datetime64[ns]
Mmake all the strings lower case. That way ‘Low’ and ‘low’ will not be different.
people["Standing"] = people["Standing"].str.lower()
Get rid of all the numbers and other characters using a simple regular expression:
people["Standing"].map(lambda x: re.sub('([^a-z]+)', '', x))
Output:
0 medium
1 low
2 excellant
3 low
4 excellant
Name: Standing, dtype: object