Skip to content

An automl library to predict the health of machinery used to generate renewable energy.

License

Notifications You must be signed in to change notification settings

karmake2/GreenGuard

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI Shield Travis CI Shield

Wind

Overview

The Wind project is a collection of end-to-end solutions for machine learning tasks commonly found in monitoring wind energy production systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT. This project is developed in close collaboration with Iberdrola, S.A.

The salient aspects of this customized project are:

  • A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
  • An easy interface to specify the task, pipeline, and generate results and summarize them.
  • A production ready, deployable pipeline.
  • An easy interface to tune pipelines using Bayesian Tuning and Bandits library.
  • A community oriented infrastructure to incorporate new pipelines.
  • A robust continuous integration and testing infrastructure.
  • A learning database recording all past outcomes --> tasks, pipelines, outcomes.

Concepts

Before diving into the software usage, we briefly explain some concepts and terminology.

Primitive

We call the smallest computational blocks used in a Machine Learning process primitives, which:

  • Can be either classes or functions.
  • Have some initialization arguments, which MLBlocks calls init_params.
  • Have some tunable hyperparameters, which have types and a list or range of valid values.

Template

Primitives can be combined to form what we call Templates, which:

  • Have a list of primitives.
  • Have some initialization arguments, which correspond to the initialization arguments of their primitives.
  • Have some tunable hyperparameters, which correspond to the tunable hyperparameters of their primitives.

Pipeline

Templates can be used to build Pipelines by taking and fixing a set of valid hyperparameters for a Template. Hence, Pipelines:

  • Have a list of primitives, which corresponds to the list of primitives of their template.
  • Have some initialization arguments, which correspond to the initialization arguments of their template.
  • Have some hyperparameter values, which fall within the ranges of valid tunable hyperparameters of their template.

A pipeline can be fitted and evaluated using the MLPipeline API in MLBlocks.

Current tasks and pipelines

In our current phase, we are addressing two tasks - time series classification and time series regression. To provide solutions for these two tasks we have two components.

WindPipeline

This class is the one in charge of learning from the data and making predictions by building MLBlocks and later on tuning them using BTB

WindLoader

A class responsible for loading the time series data from CSV files, and return it in the format ready to be used by the WindPipeline.

Wind Dataset

A dataset is a folder that contains time series data and information about a Machine Learning problem in the form of CSV and JSON files.

The expected contents of the dataset folder are 4 CSV files:

  • A Turbines table that contains:
    • turbine_id: column with the unique id of each turbine.
    • A number of additional columns with information about each turbine.
  • A Signals table that contains:
    • signal_id: column with the unique id of each signal.
    • A number of additional columns with information about each signal.
  • A Readings table that contains:
    • reading_id: Unique identifier of this reading.
    • turbine_id: Unique identifier of the turbine which this reading comes from.
    • signal_id: Unique identifier of the signal which this reading comes from.
    • timestamp: Time where the reading took place, as an ISO formatted datetime.
    • value: Numeric value of this reading.
  • A Targets table that contains:
    • target_id: Unique identifier of the turbine which this label corresponds to.
    • turbine_id: Unique identifier of the turbine which this label corresponds to.
    • timestamp: Time associated with this target
    • target: The value that we want to predict. This can either be a numerical value or a categorical label.

Tuning

We call tuning the process of, given a dataset and a template, find the pipeline derived from the given template that gets the best possible score on the given dataset.

This process usually involves fitting and evaluating multiple pipelines with different hyperparameter values on the same data while using optimization algorithms to deduce which hyperparameters are more likely to get the best results in the next iterations.

We call each one of these tries a tuning iteration.

Getting Started

Installation

The simplest and recommended way to install Wind is using pip:

pip install wind

For development, you can also clone the repository and install it from sources

git clone [email protected]:D3-AI/wind.git
cd wind
make install-develop

Usage Example

In this example we will load some demo data using the WindLoader and fetch it to the WindPipeline for it to find the best possible pipeline, fit it using the given data and then make predictions from it.

Load and explore the data

We first create a loader instance passing:

  • The path to the dataset folder
  • The name of the target table
  • The name of the target column
  • Optionally, the names of the readings, turbines and signals tables, in case they are different from the default ones.
from wind.loader import WindLoader

loader = WindLoader('examples/datasets/wind/', 'labels', 'label')

Then we call the loader.load method, which will return three elements:

  • X: The contents of the target table, where the training examples can be found, without the target column.
  • y: The target column, as exctracted from the the target table.
  • tables: A dictionary containing the additional tables that the Pipeline will need to run, readings, turbines and signals.
X, y, tables = loader.load()
X.head(5)
label_id turbine_id timestamp
0 0 0 2013-01-01
1 1 0 2013-01-02
2 2 0 2013-01-03
3 3 0 2013-01-04
4 4 0 2013-01-05
y.head(5)
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: label, dtype: float64
tables.keys()
dict_keys(['readings', 'signals', 'turbines'])
tables['turbines'].head()
turbine_id name
0 0 Turbine 0
tables['signals'].head()
signal_id name
0 0 WTG01_Grid Production PossiblePower Avg. (1)
1 1 WTG02_Grid Production PossiblePower Avg. (2)
2 2 WTG03_Grid Production PossiblePower Avg. (3)
3 3 WTG04_Grid Production PossiblePower Avg. (4)
4 4 WTG05_Grid Production PossiblePower Avg. (5)
tables['readings'].head()
reading_id turbine_id signal_id timestamp value
0 0 0 0 2013-01-01 817.0
1 1 0 1 2013-01-01 805.0
2 2 0 2 2013-01-01 786.0
3 3 0 3 2013-01-01 809.0
4 4 0 4 2013-01-01 755.0

Split the data

If we want to split the data in train and test subsets, we can do so by splitting the X and y variables.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Finding the best Pipeline

Once we have loaded the data, we create a WindPipeline instance by passing:

  • template (string): the name of a template or the path to a template json file.
  • metric (string or function): The name of the metric to use or a metric function to use.
  • cost (bool): Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

  • stratify
  • cv_splits
  • shuffle
  • random_state
from wind.pipeline import WindPipeline

pipeline = WindPipeline('wind_classification', 'accuracy', cv_splits=2)
Using TensorFlow backend.

Once we have created the pipeline, we can call its tune method to find the best possible hyperparameters for our data, passing the X, y, and tables variables returned by the loader, as well as an indication of the number of tuning iterations that we want to perform.

pipeline.tune(X_train, y_train, tables, iterations=0)

After the tuning process has finished, the hyperparameters have been already set in the classifier.

We can see the found hyperparameters by calling the get_hyperparameters method.

import json

print(json.dumps(pipeline.get_hyperparameters(), indent=4))
{
    "pandas.DataFrame.resample#1": {
        "rule": "1D",
        "time_index": "timestamp",
        "groupby": [
            "turbine_id",
            "signal_id"
        ],
        "aggregation": "mean"
    },
    "pandas.DataFrame.unstack#1": {
        "level": "signal_id",
        "reset_index": true
    },
    "featuretools.EntitySet.entity_from_dataframe#1": {
        "entityset_id": "entityset",
        "entity_id": "readings",
        "index": "index",
        "variable_types": null,
        "make_index": true,
        "time_index": "timestamp",
        "secondary_time_index": null,
        "already_sorted": false
    },
    "featuretools.EntitySet.entity_from_dataframe#2": {
        "entityset_id": "entityset",
        "entity_id": "turbines",
        "index": "turbine_id",
        "variable_types": null,
        "make_index": false,
        "time_index": null,
        "secondary_time_index": null,
        "already_sorted": false
    },
    "featuretools.EntitySet.entity_from_dataframe#3": {
        "entityset_id": "entityset",
        "entity_id": "signals",
        "index": "signal_id",
        "variable_types": null,
        "make_index": false,
        "time_index": null,
        "secondary_time_index": null,
        "already_sorted": false
    },
    "featuretools.EntitySet.add_relationship#1": {
        "parent": "turbines",
        "parent_column": "turbine_id",
        "child": "readings",
        "child_column": "turbine_id"
    },
    "featuretools.dfs#1": {
        "target_entity": "turbines",
        "index": "turbine_id",
        "time_index": "timestamp",
        "agg_primitives": null,
        "trans_primitives": null,
        "copy": false,
        "encode": false,
        "max_depth": 1,
        "remove_low_information": true
    },
    "mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
        "copy": true,
        "features": "auto",
        "max_labels": 0
    },
    "sklearn.impute.SimpleImputer#1": {
        "missing_values": NaN,
        "fill_value": null,
        "verbose": false,
        "copy": true,
        "strategy": "mean"
    },
    "sklearn.preprocessing.StandardScaler#1": {
        "with_mean": true,
        "with_std": true
    },
    "xgboost.XGBClassifier#1": {
        "n_jobs": -1,
        "n_estimators": 100,
        "max_depth": 3,
        "learning_rate": 0.1,
        "gamma": 0,
        "min_child_weight": 1
    }
}

as well as the obtained cross validation score by looking at the score attribute of the tsc object

pipeline.score
0.6592421640188922

Once we are satisfied with the obtained cross validation score, we can proceed to call the fit method passing again the same data elements.

pipeline.fit(X_train, y_train, tables)

After this, we are ready to make predictions on new data

predictions = pipeline.predict(X_test, tables)
predictions[0:5]
array([0., 0., 0., 0., 0.])

About

An automl library to predict the health of machinery used to generate renewable energy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.1%
  • Makefile 24.9%