PdpCLI
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.
Features
- Process pandas DataFrame from CLI without wrting Python scripts
- Support multiple configuration file formats: YAML, JSON, Jsonnet
- Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame
- Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
- Extensible pipeline and data readers / writers
Installation
Installing the library is simple using pip.
$ pip install "pdpcli[all]"
Tutorial
Basic Usage
-
Write a pipeline config file
config.yml
like below. Thetype
fields underpipeline
correspond to the snake-cased class names of thePdpipelineStages
. Other fields such asstage
andcolumns
are the parameters of the__init__
methods of the corresponging classes. Internally, this configuration file is converted to Python objects bycolt
.pipeline:
type: pipeline
stages:
drop_columns:
type: col_drop
columns:
- name
- jobencode: type: one_hot_encode columns: sex tokenize: type: tokenize_text columns: content vectorize: type: tfidf_vectorize_token_lists column: content max_features: 10
-
Build a pipeline by training on
train.csv
. The following command generages a pickled pipeline filepipeline.pkl
after training. If you specify a URL of file path, it will be automatically downloaded and cached.$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
-
Apply the fitted pipeline to
test.csv
and get output of a processed fileprocessed_test.jsonl
by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
-
You can also directly run the pipeline from a config file without fitting pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
-
It is possible to override or add parameters by adding command line arguments:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name
Data Reader / Writer
PdpCLI automatically detects a suitable data reader / writer based on a given file name. If you need to use the other data reader / writer, add a reader
or writer
config to config.yml
. The following config is an exmaple to use SQL data reader. SQL reader fetches records from the specified database and converts them into a pandas DataFrame.
reader:
type: sql
dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database
Config files are interpreted by OmegaConf, so ${env:...}
is interpolated by environment variables.
Prepare yuor SQL file query.sql
to fetch data from the database:
select * from your_table limit 1000
You can execute the pipeline with SQL data reader via:
$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql
Plugins
By using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.
Add a new stage
-
Write your plugin script
mypdp.py
like below.Stage.register("<stage-name>")
registers your pipeline stages, and you can specify these stages by writingtype: <stage-name>
in your config file.import pdpcli
@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
def _prec(self, df):
return Truedef _transform(self, df, verbose): print(df.to_string(index=False)) return df
-
Update
config.yml
to use your plugin.pipeline:
type: pipeline
stages:
drop_columns:
...print: type: print encode: ...
-
Execute command with
--module mypdp
and you can see the processed DataFrame after runningdrop_columns
.$ pdp apply config.yml test.csv --module mypdp
Add a new command
You can also add new commands not only stages.
-
Add the following script to
mypdp.py
. Thisgreet
command prints out a greeting message with your name.@pdpcli.Subcommand.register(
name="greet",
description="say hello",
help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
requires_plugins = Falsedef set_arguments(self): self.parser.add_argument("--name", default="world") def run(self, args): print(f"Hello, {args.name}!")
-
To register this command, you need to create the
.pdpcli_plugins
file in which module names are listed for each line. Due to module importing order, the--module
option is unavailable for command registration.$ echo "mypdp" > .pdpcli_plugins
-
Run the following command and get a message like below. By using the
.pdpcli_plugins
file, it is is not needed to add the--module
option to a command line for each execution.$ pdp greet --name altescy
Hello, altescy!