Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Pending

Usage

To run the crawler, you need to provide the following environment variables:

VariableDescriptionDefaultRequired
AWS_ROLE_ARNAWS Role ARNNoneOptional
AWS_WEB_IDENTITY_TOKEN_FILEAWS Web Identity Token FileNoneOptional
AWS_ACCESS_KEY_IDAWS Access Key IDNoneOptional
AWS_SECRET_ACCESS_KEYAWS Secret Access KeyNoneOptional
AWS_SESSION_TOKENAWS Session TokenNoneOptional
AWS_REGIONAWS Regionus-east-1Required
AWS_S3_BUCKETAWS S3 BucketNoneRequired
LOG_LEVELLog levelINFOOptional
PERSPECTIVE_API_KEYPerspective API KeyNoneRequired

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.

License

The project is licensed under the Apache 2.0 License.

GitHub - DougTrajano/toxicity-crawler: Crawler job that scrapes toxic comments from social media.
Crawler job that scrapes toxic comments from social media. - GitHub - DougTrajano/toxicity-crawler: Crawler job that scrapes toxic comments from social media.