Wrangle data and train a model using different libraries like you're a real datascientist
For this project, there will be no "walkthrough", you will be free to do your own machine learning, as long as you follow these guidelines:
-
Choose any dataset from [http://archive.ics.uci.edu/ml/index.php] or kaggle datasets or data.gouv or outside. I need to have access to the dataset to validate the project. It might not be public.
-
Download the dataset file and upload to Databricks. If you want to use another dataset, it's okay as long as I can access it.
-
Do a preliminary exploration using pandas on a sample of the dataset (optional but recommended)
- You might want to start making hypothesis.
- Please worry about the data quality and any kid of bias within
- PLEASE WORRY ABOUT BIAS IN DATA
- Make a pretty plot about any kind of intuition.
-
Explore it using Spark DataFrames, including at least the following steps:
- Reading the file into a DataFrame
- Running some aggregations and explorations using DataFrame functions
- The fancier, the better
-
Your solution notebook must have a part using MLlib, including at least the following steps:
- Converting into MLlib matrix
- Applying some statistics with the MLlib's API
- Learning a classification or regression model
- Applying the model to the test data and computing the errors
-
You'll also train another model using Pipelines:
- Creating a pipeline with at least one feature extraction/manipulation and one model estimator
- Fitting the pipeline to the training data
- Applying the model to the test data and computing the errors
-
(OPTIONAL) Finally you will have to apply a model built with a third party ML library like tensorflow or pytorch. The endpoint here is to have you mix different framework on the same project. For instance you can call one framework from another like tune scikit-learn meta-parameters using spark. Or do something different using tensorflow or pytorch. You can use different notebook provider (colab & databricks). When you do that, please prefix the name of your notebook by the name of the cloud provider.
-
No report needed !! Just add some comment as part of the Notebook. The notebook should be self-sufficient. I'm expecting a short written analysis (and you should know that "We found nothing after testing this and that" is already a valuable result), but also some plots. You can use any library (matplotlib, bokeh...). Just be sure to watch this video first
To grade this I will :
- run the notebooks
- run the other pieces of code
- read the code (so you can add meaningfull comment even on part that are not running well)
- grade the overall code quality as well as complexity
- grade the overall presentation of the repository
If nothing is running, I'll stop at step 1 or 2. Which won't be good for you.
- Don't forget to split your data into training and test (and validation if you want) sets
- Your code should be readable. DO NO FORGET PEP8
- You can do this project in groups of 4 members
- All submissions after the deadlines won't be considered - It's much better to submit an incomplete solution than nothing at all!
- If you have any question or problem, don't hesitate to send me an e-mail; I try to answer as quickly possible (usually under 24h). If not you can ask me on gitter.
- If you want to use this project as a display of your skills, you can upload it on github when you're done. It's a good opportunity (and for some tech companies, a github can be better than a resume). If you're not sure how, let me know, I'll help you and give you advise on how to make it look pretty. This could be really important for you if you're targeting tech jobs - Even if it seems obvious, do not cheat. I won't be nice if I suspect it.