The aim of this project is to allow a great modularity in the development of data processes based on the execution sequence of scripts.
The original idea was to provide a central script that would receive as input a configuration file, from which it would be able to import and execute the different scripts.
The power of this tool lies in the fact that it allows to inject the result of a given script to a target script as a parameter. All this based only on the simple configuration file (JSON).
First clone this repository
git clone https://github.com/jossefaz/pypliner-data-processor.git
Create a new configuration file (.json) following the example below.
[
{
"run_example" : {
"Tool" : "tool_example",
"Args" : {
"param1" : "This is a test",
"param2" : "This is another test"
}
},
"run_example2" : {
"Tool" : "tool_example",
"Args" : {
"param1" : "This is a test 2",
"param2" : "This is another test 2"
}
},
"order" : ["run_example", "run_example2"]
}
]
Now let’s explain the different parameters :
The configuration file is basically a list of Objects (dictionnary), where each of those represents an execution pipeline.
So it begins like a list ([]
).
Inside of it we define a global object for each pipelines.
The keys of this object represents the name of the process (feel free to give a name that is very explicit on what this process aims to achieve).
So far we have
[
{
"run_my_first_script" : {
}
}
]
Now we need some mandatory keys to indicates which script this process will execute and what are its arguments.
Let’s add the Tool
and Args
key :
"run_my_first_script" : {
"Tool" : "tool_example"
"Args" : {
"param1" : "This is a test",
"param2" : "This is another test"
}
}
The Tool
key must be the name of an existing script (that you will build) under the directory Tools
-> executables
-> <ENVIRONMENT>
The ENVIRONMENT
is a folder based on the defined runtime environment (dev, prod and test) which is defined by the --env
runtime variable (see Run section bellow)
There is a tool_example script that you see here is a script that you can find in the Tools
-> executables
-> dev
directory
And this script is very basic :
def main(param1, param2): print("param1 is : ", param1) print("param2 is : ", param2) return param1 + " from main"
As you can see here : the names of the parameters must be the sames as those defined in the configuration file (If it is not the case, an ArgumentMissingException
will be raised. This is a custom exception (see Exceptions section bellow)
Each Script must have a main
function. But this function could be parameter-less (but if you add parameters to your main function, those parameter’s names must match those in the configuration file)
The order list defines the order in which the pypliner will execute those scripts. So the processes configuration order does not matter, only the order list will define the execution order of the scripts.
In the world of data processing, it is often necessary to link the results of different processes together: for example, the result of process 1 will often be used as a basis for process 2 to run.
The pypliner allows you to injects the resulte from one script to another by just writing its name in the configuration :
[
{
"run_example1" : {
"Tool" : "tool_example",
"Args" : {
"param1" : "This is a test",
"param2" : "This is another test"
}
},
"run_example2" : {
"Tool" : "tool_example",
"Args" : {
"param1" : "run_example1", <<==== Here we inject the result of run_example1 process that we define before, as a parameter for this second script
"param2" : "This is another test 2"
}
},
"order" : ["run_example", "run_example2"] <<=== To be able to inject between processes, you must keep the order logic too. You cannot inject the result of a process that you did not execute yet
}
]
-
--logpath
,-lp
: path for writting logs (default value is defined
to ‘./logs
’ i.e the running directory) -
--logconfig
,-lc
: path for configuring the logs file (this projects come with a default logger configuration file that you can use as a base for your own logger configuration definition. So this variable has a default value of "“Config/prod/logger.json”) -
--config
,-cfg
: path of the configuration file -
--env
,-e
: define the environment of the runtime (‘DEV’, ‘PROD’, ‘TEST’ are the possible values and it has a default value of ‘DEV’)