Making environment more useful
Enable conda in cmd
- add this in system environment path
C:\ProgramData\Anaconda3\Scripts
Run Jupyter Notebook instantly from cmd
- run
jupyter notebook
List of running Jupyter Notebook
- in cmd run
jupyter notebook list
Stop running notebook
jupyter notebook stop
Update conda
- open anaconda shell as administrator (also in cmd if enabled) and run
conda update --all
If any error related to Module not found
- name of the module e.g: yapf; run :
pip install yapf
orconda install yapf
Adding Anaconda in environment path
Note: my anaconda setup was done for user not for all user in my system
-
Goto
Edit the system environment variables
and add those in path:C:\Users\Tansen\anaconda3
C:\Users\Tansen\anaconda3\Library\mingw-w64\bin
C:\Users\Tansen\anaconda3\Library\usr\bin
C:\Users\Tansen\anaconda3\Library\bin
C:\Users\Tansen\anaconda3\Scripts
Enable Jupyter Notebook extension (If not present in notebook)
If bellow 2 process doesn't work there is a problem related to version.
- Install by one command in
Anaconda Prompt
(open as admin for write permission)conda install -c conda-forge jupyter_contrib_nbextensions
- Descriptive way:
pip install jupyter_contrib_nbextensions
- Now type and enter:
pip install jupyter_nbextensions_configurator
- Type and enter:
jupyter contrib nbextension install --user
- Type and enter:
jupyter nbextensions_configurator enable --user
- Enable/Disable extension by command:
jupyter nbextension enable/disable highlighter/highlighter
- Now open Jupyter Notebook
Nbextensions
will be there now enable required extension
Disable warning
-
Create
disable-warnings.py
at locationC:\Users\Tansen\.ipython\profile_default\startup
-
in
disable-warnings.py
import warnings warnings.filterwarnings('ignore')
-
Also this line can be written in notebook to disable warning.
Virtual Environment Setup and Package installation:
-
In project directory open cmd type and enter:
conda create --prefix ./env jupyter
-
Now activate conda:
conda activate {"location path of ./env"}
-
Now let’s install a package called
pyresparser
install one by one :
pip install nltk
pip install spacy==2.3.8 --no-cache-dir --only-binary :all:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install pyresparser
Machine Learning Study Roadmap
- Supervised Learning
- Supervised Learning Algorithm
- SVM
- RF (Random Forest)
- Decision Tree
- Linear Regression
- Naive Bayes
- Neural Network
- Supervised Learning Algorithm
- Unsupervised Learning
- Unsupervised Learning Algorithm
- K-mean clustering
- K-nearest neighbor
- Unsupervised Learning Algorithm
- Semi-Supervised Learning
- Reinforcement
- Batch and online learning
- Dataset visualization, Analysis
- Data Cleaning, Preprocessing
- Feature Extraction
- Binary and Multilevel classification
- Confusion Matrix
- ROC and AOC Curve
- Errors
- Dimensionality Reduction
Machine Learning 101
A subset of AI that focuses on building systems that can learn from data and make predictions or decisions without being explicitly programmed.
- AI—>A machine that acts like a human
- Machine Learning—> a subset of AI
- An approach to achieve artificial intelligence through systems that can find patterns in a set of data.
- Stanford university describe machine learning as the science of getting computers to act without being explicitly programmed. Which means we do not need to tell the machine do this , do that, if then , if this then that…etc.
- Deep Learning—>a subset of Machine Learning
- one of the techniques for implementing machine learning
- Deep Learning—>a subset of Machine Learning
- Data Science—>Analyzing data and then doing something. It include both Machine Learning & Deep Learning.
- Machine Learning—> a subset of AI
- Narrow AI—>A machine that acts like a human for a specific task
- Detecting heart disease from images
- Game of go or chess or Star craft and other video games
- Only work on single task
- General AI: A machine that acts like a human with multiple abilities
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%%
graph LR
A[Spreadsheets] -->|Then we move to| B[Relational DB - MySQL]
B -->|Then we move to| C[Big Data - NoSQL]
C -->|Finally| D[Machine Learning]
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%%
graph LR
A[Data Collection] --> B[Data Preprocessing]
B -->|What problem we are tring to solve?| C[Problem Definition]
C -->|What data do we have?| D[Data]
D -->|What Defines success?| E[Evaluation]
E -->|What features</br>should we model?| F[Features]
F -->|What kind of</br> model should we use?| G[Modelling]
G -->|What have we tried </br>/what else can we try?| H[Experiments]
- Data collection (Hardest part)
- How to clean noisy data?
- What can we grab data from?
- How do we find data?
- How do we clean it so we can actually learn from it?
- How to turn data from useless to useful?
- Data modelling
- Problem definition: What problem are we trying to solve?
- Data: What data do we have?
- Evaluation: What defines success?
- Features: What features should we model?
- Modelling: What kind of model should we use?
- Experiments: What have we tried / What else can we try?
Playground
- Teachable Machine by Google(easy)
- ML-Playground(easy)
- ML Playground by Mrityunjay Bhardwaj (advanced)
- ML Playground by TensorFlow (More advanced)
- Classical Learning
- Supervised
- Classification
- K-NN
- Naive Bayes
- SVM
- Decision Tress
- Logistic Regression
- Regression
- Linear Regression
- Polynomial Regression
- Ridge/Lasso Regression
- Classification
- Unsupervised
- Clustering
- Fuzzy C-Means
- Mean-Shift
- K-Means
- DBSCAN
- Agglomerative
- Pattern Search
- Eclat
- Apriori
- FP-Growth
- Dimension Reduction (Generalization)
- T-SNE
- PCA
- LSA
- SVD
- LDA
- Clustering
- Supervised
- Reinforcement Learning
- Genetic Algorithm
- A3C
- SARSA
- Q-Learning
- Deep Q-Network (DQN)
- Neural Networks and Deep learning
- Convolutional Neural Networks (CNN)
- DCNN
- Recurrent Neural Networks (RNN)
- LSM
- LSTM
- GRU
- Generative Adversarial Networks (GAN)
- Vanilla GAN
- Super Resolution GAN (SRGAN)
- Autoencoders
- Seq2seq
- Perceptrons (MLP)
- Convolutional Neural Networks (CNN)
- Ensemble Methods
- Stacking
- Bagging
- Random Forest
- Boosting
- AdaBoost
- CatBoost
- LightGBM
- XGBoost
Types of Machine Learning Diagram
Types Simplified
- Predict results based on incoming data
- Supervised: Data are labeled into categories
- Classification: is this an apple or is this a pear?
- Regression: based on input to predict stock prices
- Unsupervised: Data don't have labels
- Clustering: machine to create these groups
- Association rule learning: associate different things to predict what a customer might buy in the future
- Reinforcement: teach machines through trial and error
- Reinforcement: teach machines through rewards and punishment
- Skill acquisition
- Real time learning
What Is Machine Learning Part 02
- Now: Data -> machine learning algorithm -> pattern
- Future: New data -> Same algorithm (model) -> More patterns
- Normal algorithm: Starts with inputs and steps -> Makes output
- Machine learning algorithm
- Starts with inputs and output -> Figures out the steps
- Data analysis is looking at a set of data and gain an understanding of it by comparing different examples, different features and making visualizations like graphs
- Data science is running experiments on a set of data with the hopes of finding actionable insights within it
- One of these experiments is to build a machine learning model
- Data Science = Data analysis + Machine learning
Section Review
- Machine Learning lets computers make decisions about data
- Machine Learning lets computers learn from data and they make predictions and decisions
- Machine can learn from big data to predict future trends and make decision
Machine Learning and Data Science Framework
Steps to learn machine learning
- Create a framework
- Which match to data science and machine learning tools
- Learn by doing
6 Step Field Guide for Building Machine Learning Projects
- Problem definition — What business problem are we trying to solve? How can it be phrased as a machine learning problem?
- Supervised or Unsupervised ?
- Classification or Regression ?
- Data — what data we have? How does it match the problem definition?
- Structured or Unstructured?
- Static or streaming?
- Evaluation — What defines success? Is a 95% accurate machine learning model good enough?
- Different type of matrices
- Predicted price vs Actual price
- Features — What parts of our data are we going to use for our model? How can what we already know influence this?
- Example: Heart disease
- Turn features such as weight, gender, BP, chest pain into patterns to make predictions whether a patient has heart disease?
- Modelling — Which model should you choose based on your problem and data? How can you improve it? How do you compare it with other models?
- Different problem —> Different type of model/algorithm
- Experimentation — What else could we try? Does our deployed model do as we expected? How do the other steps change based on what we’ve found? How could we improve it?
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%%
graph LR
A[Data Collection] --> B[Data Preprocessing]
B -->|What problem we are tring to solve?| C[Problem Definition]
C -->|What data do we have?| D[Data]
D -->|What Defines success?| E[Evaluation]
E -->|What features</br>should we model?| F[Features]
F -->|What kind of</br> model should we use?| G[Modelling]
G -->|What have we tried </br>/what else can we try?| H[Experiments]
When shouldn't you use machine learning?
- if a simple hand-coded instruction based system work
Main types of machine learning
- Supervised Learning (Common)
- Unsupervised Learning (Common)
- Transfer Learning (Common)
- Reinforcement Learning (Uncommon)
Supervised Learning
- Data + labels —> make prediction
- Classification
- Both Binary & Multiclass Classification
- Binary Classification = 2 Options
- Based on 2 option choose 1
- Example : Heart disease or no heart disease?
- Multiclass Classification = More than 2 Options
- Example : Predict dog breed based on photos in images
- Binary Classification = 2 Options
- Both Binary & Multiclass Classification
- Regression
- It is used to predict number
- It is also preferred as continuous number
- A number that goes up or down
- Classical Regression problem
- Predict sell price of a house based on number of rooms, area, etc.
- How many People will buy this app based on clicks
- Classification
Unsupervised Learning
-
There is data but no labels
-
Example Scenario
- Marketing team want to send out promotion for next summer
- Here is the table
Customer ID Purchase 1 Purchase 2 1 Summer Cloth Summer Cloth 2 Winter cloth Winter cloth 3 Summer Cloth Summer Cloth - Now you have to find out which customer is interested for summer cloth from this store
- To solve this you make a
group
of customer who purchase only in summertime and agroup
of customer who purchase only in wintertime - Which is Cluster 1 (
summer
) and Cluster 2 (winter
) - Now label them in
summer
andwinter
list - This is called Clustering
Transfer Learning
- It leverages what one machine learning model has learned in another machine learning model
- Example—>Predict what dog breed appears in a photo
- Already created existing model —> Car model identify model
- Now use that foundational patterns to apply in dog breed problem
Reinforcement Learning
a computer program perform some actions within a defined space and rewarding it for doing it well or punishing it for doing poorly
- Example —> Teach a machine learning algorithm to play chess
Recap
Matching your problem
- Supervised Learning: I know my inputs and outputs
- Unsupervised Learning: I am not sure of the outputs but I have inputs
- Transfer Learning: I think my problem may be similar to something else
- Structured Data —> Rows Columns is structured
- Unstructured Data —> Videos, Photos, Audio files
- We have to structured this by converting it to number
- Static data —> doesn't change over time. e.g : csv file
- Streaming Data —> which is constantly changed over time
- Example : predict how a stock price will change based on news headlines
- News headlines are being updated constantly you'll want to see how they change stocks
A data science workflow
-
open csv file in jupyter notebook (a tool to build machine learning project)
-
perform data analysis with panda (a python library for data analysis)
-
make visualizations such as graphs and comparing different data points with Matplotlib
-
build machine learning model on the data using scikit learn to predict using these patterns
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%% graph LR A[CSV file] -->|Open|B[Jupyter Notebook] B -->|Data analysis|C[Pandas] B -->|Visualization|D[Matplotlib] C -->|Scikit Learn|E[Machine Learning Model] D -->|Scikit Learn|E E --> F[Hearth Disease or not?]
Example
-
if your problem is to use patient medical records to classify whether someone has heart disease or not you might start by saying for this project to be valuable we need a machine learning model with over 99% accuracy
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%% graph LR A[Data] -->B[Machine Learning Model] B -->C[Heart Disease or not?] C -->D[Accuracy ]
-
This type of problem required highly accurate model
Different types of metrices
Classification | Regression | Recommendation |
---|---|---|
Accuracy | Mean Absolute Error (MAE) | Precision at K |
Precision | Mean Squared Error (MSE) | |
Recall | Root Mean Squared (RMSE) |
It is another word for different forms of data
- Structured or Unstructured data
- Feature variables — Target variable
Example
ID | Weight | Gender | Heart Rate | Chest Pain | Heart Disease |
---|---|---|---|---|---|
1 | 110kg | M | 81 | 4 | Yes |
2 | 64kg | F | 61 | 1 | No |
3 | 51kg | M | 57 | 0 | No |
- Weight, Gender, Heart Rate, Chest Pain —>Feature variables
- Heart Disease —> Target variable
Different features of data
- Numerical features : number; like body weight
- Categorical features : One thing or another ; like gender or whether a patient is a smoker or not etc
- Derived features : looks at different features of data and creates a new feature / alter existing feature
-
Example: look at someone's hospital visit history timestamps and if they've had a visit in the last year you could make a categorical feature called visited in last year. If someone had visited in the last year they would get true.
-
feature engineering: process of deriving features
ID Weight Gender Heart Rate Chest Pain Heart Disease Visited Last Year 1 110kg M 81 4 Yes Yes 2 64kg F 61 1 No Yes 3 51kg M 57 0 No No
-
- Unstructured data has features too
- images of dog
- look for different shape in images
- look for similarity
- Eyes, leg, tail etc
machine learning algorithm
figure out what features are there on its own
- What features should you use?
- A machine learning algorithm learns best when all samples have similar information
- Feature coverage: process of ensuring all samples have similar information
- Every field have values , at least —> Want > 10% Coverage
Based on our problem and data, what model should we use?
3 parts to modelling
- Choosing and training a model
- Tuning a model
- Model comparison
The most important concept in machine learning
- The training
- Validation
- Test sets or 3 sets
-
Data split into 3 sets
- Training set: train your model on this
- Validation set: tune your model on this
- Test set: test and compare on this
-
Example (University)
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%% graph LR A[Course Materials<->Training] -->B[Practice Exam<->Validation] B -->C[Final Exam<->Test Set]
This process is referred as Generalization in Machine Learning
Generalization: The ability for a machine learning model to perform well on data it has not seen before
Then things goes wrong
- Everyone participate practice exam
- Everyone did good
- Now for Final exam , Professor give the same question
- Everyone get top mark
- Now this looks good but did the student learn anything?
-
This scenario need to avoid in machine learning by following:
- split 100 patient records
- training split: 70 patient records (70-80%)
- validation split: 15 patient records (10-15%)
- test split: 15 patient records (10-15%)
- After training on training set
- Validation split will be used to improve which is called tuning
- Next use the improve result to use it with test split
- split 100 patient records
-
3 parts to modelling
- Choosing and training a model—>training data
- Tuning a model—>validation data
- Model comparison—>test data
Choosing a model
-
Based on data choose a model
-
Structured Data
-
Unstructured Data
- Deep Learning
- Transfer Learning
-
Training a model
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%% graph LR A[X - Data] -->|Inputs|B[Model] B -->|Predict|C[Y - label]
- X is Feature variable and Y is Target variable
- Different machine learning algorithm (Model) had different way of doing this
-
Goal—>minimize time between experiments
-
If dataset include 100 thousand example
-
start with first 10 thousand and see how it goes
-
start with less complicated model (algorithm)
Experiment Model Accuracy Training Time 1 1 87.5% 3 min 2 2 91.3% 92 min 3 3 94.7% 176 min
-
-
Things to remember
- Some models work better than others and different problems
- Don't be afraid to try things
- Start small and build up (add complexity) as you need.
Example
- Random Forest
- adjust number of trees
- Neural Networks
- adjust number of layers
Things to remember
- Machine learning models have hyper parameters you can adjust
- A model first results are not it's last
- Tuning can take place on training or validation data sets
If the split of data goes well it will indicate how well it will perform
Testing a model
This is alright : ✅
Data Set | Performance |
---|---|
Training | 98% |
Test | 96% |
- Balanced (Goldilocks zone)
Underfitting (Potential) ❌
Data Set | Performance |
---|---|
Training | 64% |
Test | 47% |
Overfitting (Potential)❌
Data Set | Performance |
---|---|
Training | 93% |
Test | 99% |
In simple way
Underfitting | Overfitting |
---|---|
Data mismatch | Data leakage |
Test Data is different to Training Data | Training Data overlap Test Data |
Fixes for underfitting
- Try a more advanced model
- Increase model hyperparameters
- Reduce amount of features
- Train longer
Fixes for overfitting
- Collect more data
- Try a less advanced model
No Model is perfect so check good result as much as you check poor result
Experiment | Model | Accuracy | Training Time | Prediction Time |
---|---|---|---|---|
1 | 1 | 87.5% | 3 min | 0.5 sec |
2 | 2 | 91.3% | 92 min | 1 sec |
3 | 3 | 94.7% | 176 min | 4 sec |
Things to remember
- Want to avoid overfitting and underfitting (head towards generality)
- Keep the test set separate at all costs
- Compare apples to apple
- Model 1 on dataset 1
- Model 2 on dataset 1
- One best performance Metric does not equal the best model
How could we improve / what can we try next?
- Start with a problem
- Data Analysis: Data, Evaluation, Features
- Machine learning modelling: Model 1
- Experiments: Try model 2 then 3
6 Step Machine Learning Framework questions
- Problem definition: What kind of problem ?
- Data: What type of data ?
- Evaluation: What do you measure ?
- Features: What are features of your problems ?
- Modelling: What was the last thing you testing ability on?
-
Data Science: 6 Step Machine Learning Framework
-
Data Science: Anaconda, Jupyter Notebook
-
Data Analysis: Data, Evaluation and Features
-
Data Analysis: pandas, Matplotlib, NumPy
-
Machine Learning: Modelling
-
Machine Learning: TensorFlow, PyTorch, scikit-learn, XGBoost, CatBoost
Data Science Environment Setup
Steps to learn machine learning
- Create a framework (we created in previous section)
- Match to data science and machine learning tools
- Learn by doing
Machine Learning Tools
-
Anaconda: Hardware Store = 7.53GB (443 packs v2022.10 with update 30 Jan 2023)
-
Miniconda: Workbench = 200 MB
-
Choosing Anaconda vs Miniconda
Anaconda Miniconda New to conda or python familiar with conda and python Preinstalled Packages can install individual packages Have the time and disk space Not enough disk space -
Conda: Package Manager which is use to setup the rest of tools
- Data Analysis: pandas, Matplotlib, NumPy
- Machine Learning: TensorFlow, PyTorch, scikit-learn, XGBoost, CatBoost
-
Note
: miniconda required conda to install tools. Anaconda come with full packages but need to update packages (mentioned earlier how to update) . So either Miniconda+conda or Anaconda
Command Mode (press Esc to enable) | Edit Mode (press Enter to enable) |
---|---|
H: get full list of shortcuts | Shift + Enter: run the current cell and move to the next one. |
Esc: enter command mode. | Ctrl + Enter: run the current cell and keep it selected. |
A: insert a new cell above the current cell. | Ctrl + ]: indent the current block. |
B: insert a new cell below the current cell. | Ctrl + [: un-indent the current block. |
C: copy the current cell. | Ctrl + A: select all text in the current cell. |
V: paste cells below the current cell. | Ctrl + Z: undo. |
D, D: delete the current cell. | Ctrl + Y: redo. |
Shift + J or Shift + Down: select the next cell in the same column. | Ctrl + Home: go to the beginning of the cell. |
Shift + K or Shift + Up: select the previous cell in the same column. | Ctrl + End: go to the end of the cell. |
Ctrl + Shift + -: split the current cell at the cursor. | Ctrl + Left: go one word to the left. |
Z: undo cell deletion. | Ctrl + Right: go one word to the right. |
X: cut the current cell. | Tab: indent the current line. |
Shift + M: merge selected cells. | Shift + Tab: un-indent the current line. |
M: markdown , Y: Code | More added manually by editing |
import pandas as pd
df = pd.read_csv("heart-disease.csv")
df.head(10)
df.target.value_counts().plot(kind="bar")
If the above code does not work we need to import matplotlib
import matplotlib.pyplot as plt
Opening a csv file
- pd.read_csv("file.csv")
Data frame row
Pandas data frame df
where we use df.target.value_counts().plot(kind="bar")
here target
is a column name where we use value_counts()
to count & plot(kind="bar")
to make a bar graph
Image in markdown
- ![](img location or img link)
Pandas Data Analysis
Why pandas?
- Simple to use
- Integrated with many other data science & ML Python Tools
- Helps you get your data ready for machine learning
Learning on this section
- Most useful functions
- pandas Datatypes
- Importing & exporting data
- Describing data
- Viewing & Selecting data
- Manipulating data
Two Main Datatype
-
Series is 1D and similar to list in python
-
DataFrame is 2D and similar to Dictionary in python
# series = 1 dimentional series = pd.Series(['BMW','Toyota','Honda']) colours = pd.Series(['Red','Blue','White']) # DataFrame = 2 dimentional car_data = pd.DataFrame({'Car make':series, 'Colour':colours})
-
Import & export to csv
# import data car_sales = pd.read_csv('car-sales.csv') # Exporting to csv car_sales.to_csv('exported-car-sales.csv',index=False) #index won't counted
-
Import & export to excel
# import data car_sales = pd.read_csv("car-sales.csv") # Exporting to excel car_sales.to_excel("exported-car-sales.xlsx", index=False) #index won't counted export_car_sales = pd.read_excel("exported-car-sales.xlsx")
#An Attribute doesn't have bracket "()" only Function contain bracket"()"
#Attribute -- dtypes
car_sales.dtypes
#Function -- to_csv()
car_sales.to_csv()
car_sales.dtypes #get data types
car_sales.columns #get columns names
car_sales.index #get index range start,stop,step
car_sales.describe() #get statistics info of numeric columns
car_sales.info() #get more details similar to .dtypes
car_sales.mean(numeric_only=True) #get mean
#custom created series mean
car_prices = pd.Series([300,1500,111250])
car_prices.mean()
car_sales.sum() #get all column sum
car_sales['Doors'].sum() #get Door column sum
len(car_sales) #get length
car_sales #get first 10 column
car_sales.head() #get top 5 rows
car_sales.head(7) #get top 7 rows
car_sales.tail() #get bottom 5 rows
# .loc = index location & .iloc = position
animals = pd.Series(['cat','dog','bird','panda','snake'])
# Custom index
animals = pd.Series(['cat','dog','bird','panda','snake'],index=[9,3,6,2,3])
animals.loc[3] #index location
animals.iloc[3] #position
# Get first four row
car_sales.loc[:3]
car_sales.head(4)
# Selecting individual columns
car_sales['Make']
car_sales.Make
# If column name contain spaces it won't work in dot way
car_sales['Odometer (KM)']
# car_sales.Odometer (KM) <-- This will give error
# Filtering
car_sales[car_sales['Make']=='Toyota'] # This will show only Toyota data from Make column
car_sales[car_sales['Odometer (KM)']>100000]
# Crossover
pd.crosstab(car_sales['Make'], car_sales['Doors'])
# more useful as crossover is Groupby
car_sales.groupby(['Make','Colour','Price']).mean()
# Fixing the Price columns $4,000.00
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,]', '',regex=True).astype(float)
# Every data on "Make" column to lowercase
car_sales['Make']=car_sales['Make'].str.lower()
# Working with Missing data filling Odometer missing value with mean of Odometer
car_sales_missing['Odometer'].fillna(car_sales_missing['Odometer'].mean())
# Filling value in datatset 2 ways
# assigning way:
car_sales_missing['Odometer']=car_sales_missing['Odometer'].fillna(car_sales_missing['Odometer'].mean())
# inplace way:
car_sales_missing['Odometer'].fillna(car_sales_missing['Odometer'].mean(),inplace = True)
# Dropping missing value
car_sales_missing_dropped = car_sales_missing.dropna()
# Capitalize car names
car_sales['Make']=car_sales['Make'].str.capitalize()
# Column from series
seats_column = pd.Series([5,5,5,5,5])
car_sales['Seats']=seats_column
# Column from python list
fuel_economy = [7.5,9.2,5.0,9.6,8.7,4.7,7.6,8.7,3.0,4.5]
car_sales['Fuel per 100KM']=fuel_economy
car_sales
# Calculation
car_sales['Total fuel used (L)']=car_sales['Odometer (KM)']/100 *car_sales['Fuel per 100KM']
car_sales
# Creating a boolean column
car_sales['Passed road saftry']=True
# creating a column from single value
car_sales['Number of wheels']=4
car_sales
# creating a dummy column to drop
dump=pd.Series([5,3,3,4,5,3,7,3,2,4])
car_sales['Dump']=dump
# Droping the dummy column
car_sales=car_sales.drop('Dump',axis=1)
# Using lambda to convert Miles to KM
car_sales['Odometer (KM)'] = car_sales['Odometer (KM)'].apply(lambda x:x/1.6)