Transfer data from GCS to AWS S3 using Apache Airflow




The problem that should be resolved is to extract data from Google Could Storage and load it into an AWS S3 bucket programatically in a scheduled manner. There are a few options for this, e.g. the implementation of a script that uses the Google Cloud and AWS APIs to extract data from GCS and load it to S3. Since, Apache Airflow is a good choice for many types of data pipeline scenarios and Airflow DAGs should be written in Python, likely AWS boto3 and Google Cloud APIs can be used for our purpose. Luckily for us, there is already an in-built Airflow Operator that does exactly the same job, called "GoogleCloudStorageToS3Operator". Under the hood, it leverages the two above mentioned APIs and implements a handy abstraction for them. If we would like to use a fully managed Apache Airflow service we can choose the Cloud Composer service of Google Cloud Platform. However, there is still one complex issue before we achieve our goal is to set up the AWS access from Composer. In this page, you can find all the steps that need to be performed to make an Airflow DAG able to access AWS S3.

Creation of AWS IAM resources for the access


Let’s assume that we already have our Google and AWS accounts and the source and target buckets in both of the storage services (GCS and S3). Furthermore, we have a preconfigured Vault server for storing sensitive data, e.g. AWS credentials. The advantage of using Vault for storing secrets is that it can be set up as a backend for Airflow variables and connections. See:

For accessing AWS resources externally, we will need some IAM identities that we can use to get the relevant permissions.

The required identities:

  • IAM User for accessing AWS resource externally

  • IAM User Policy with the proper permissions

  • IAM Access key of the User we just created

  • Vault secret that contains the Id and the secret of the Access key

    • It should be stored in the given path of the Airflow connections in Vault to make the DAG able to find it as a connection in the Vault Backend

Terraform code that is needed to create the relevant AWS resources:

Write a comment

Comments: 0