pyspark_sam

This repo hosts my code for the article “Analyze Big Sequence Alignments with PySpark in AWS EMR”.

Prerequisite

Spark
AWS CLI
AWS Account

Run

Follow the instruction in the article. Once you have uploaded the files into your S3 bucket, run

aws emr create-cluster --name "Spark_step_pip" \
    --release-label emr-6.5.0 \
    --applications Name=Spark \
    --log-uri s3://[your_S3_bucket]/logs/ \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --bootstrap-actions Path=s3://[your_S3_bucket]/emr_bootstrap.sh \
    --use-default-roles --auto-terminate \
    --steps "Type=Spark,Name=SparkProgram,ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,--py-files,s3://[your_S3_bucket]/helper_function.py,s3://[your_S3_bucket]/spark_3mer.py,s3://[your_S3_bucket]/test.sam,[your_S3_bucket],sankey.json]"

When the job finishes, download the sankey.json. And run this command to visualize:

python sankey.py sankey.json

Authors

Sixing Huang – Concept and Coding

License

This project is licensed under the MIT License – see the LICENSE file for details

GitHub

View Github

Analyze Big Sequence Alignments with PySpark in AWS EMR

pyspark_sam

Prerequisite

Run

Authors

License

GitHub

John

Point-Set Registrations for Ultrasound Probe Calibrations

Versionsverwaltung & Open Source Hausaufgabe