Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, “King Lear” which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. We also add the schedule date to the table name producing a sharded table for the output.

Source data:

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

  • stoix_scheduled – Scheduled datetime as RFC3339
  • input_file – Text to read
  • output_dataset – BigQuery dataset for output table
  • output_table_prefix – BigQuery output table name prefix
  • project – Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test our code

We use Tox to locally format, test and lint our code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset <DATASET> \
    --project <PROJECT ID> \
    --temp_location gs://<BUCKET>/tmp/

Build Docker image for STOIX

In order to run our pipeline we need to package the Flex Template in a Docker image and push it to our Docker image repository. For this example we are using Docker Hub.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then we upload it to our Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now we can run the Dataflow Flex Template job using Dataflow runner. Add a new job with the image stoix/dataflow-runner and following environment variables:

  • GCP_PROJECT_ID: <PROJECT ID>
  • GCP_REGION: europe-north1
  • GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
  • JOB_IMAGE: stoix/count-words:1.0.0
  • JOB_NAME_PREFIX: count-words
  • JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
  • JOB_PARAM_OUTPUT_DATASET: dataflow
  • JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
  • JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

GitHub

View Github