Toxicity comments crawler
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.
The toxic level of a given comment is calculated using the Perspective API.
Architecture
Pending
Usage
To run the crawler, you need to provide the following environment variables:
Variable | Description | Default | Required |
---|---|---|---|
AWS_ROLE_ARN | AWS Role ARN | None | Optional |
AWS_WEB_IDENTITY_TOKEN_FILE | AWS Web Identity Token File | None | Optional |
AWS_ACCESS_KEY_ID | AWS Access Key ID | None | Optional |
AWS_SECRET_ACCESS_KEY | AWS Secret Access Key | None | Optional |
AWS_SESSION_TOKEN | AWS Session Token | None | Optional |
AWS_REGION | AWS Region | us-east-1 | Required |
AWS_S3_BUCKET | AWS S3 Bucket | None | Required |
LOG_LEVEL | Log level | INFO | Optional |
PERSPECTIVE_API_KEY | Perspective API Key | None | Required |
If AWS_ROLE_ARN
and AWS_WEB_IDENTITY_TOKEN_FILE
are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_SESSION_TOKEN
.
License
The project is licensed under the Apache 2.0 License.