Our goal is to highlight the ability to consume streaming data from AWS Kinesis, build a Datalake in S3 and run SQL queries from Athena.
The objective is to not only show our architecture but provide actual cloudformation to create an entire datalake in matter of minutes.
If you understand the above architecture and want to jump right into actual code, here is a Link to Cloudformation template. Simply choose any stack name and create stack from the CloudFormation console or use AWS CLI. This will create all necessary resources including a test Lambda function to publish events to Kinesis. Run the test lambda from console using this sample JSON and observe the events in S3 buckets (in 60 seconds) and query from Athena.
- Running Test Lambda will publish events into Kinesis.
- Firehose consumes Kinesis events, backs up the source event into a raw event bucket(30 day retention), converts it to parquet format based on Glue schema and determines the partition based on current time and writes to final S3 bucket, our datalake.
- Glue Table holds all the partition information for S3 data.
- Glue Crawler Job runs every 30 minutes, looks for any new documents in S3 bucket and create/updates/deletes partition metadata.
- Run sql queries in Athena which uses Glue Table partition metadata to efficiently query S3.
Now lets go step by step and understand each resource in CF. Click on the heading will lead to actual resource in CloudFormation.
Resources related to storing data in S3:
- KMS Key to encrypt data written to buckets
- Allow Firehose to use this key
- S3 bucket, our actual data lake.
- Data stored in bucket is encrypted with KMS key
- S3 bucket to store raw JSON event
- Documents will be deleted after 30 days.
- Kinesis stream with 1 shard (1 partition in Kafka language)
- Data in Kinesis encrypted with KMS.
- Test Lambda written in nodejs to publish events into Kinesis.
- Any code less than 4096 characters can be embedded right into CloudFormation
All necessary policies for lambda function have been set here.
- Key needed to encrypt data written kinesis data.
- key policy includes access to use for Lambda Role.
- Role used by firehose needs several access polices.
- Some of them include access to Glue table, upload to S3 buckets, to be able to read streams, decrypt with KMS keys, etc.
- Documents are partitioned when writing to S3. We can then give partition range in where conditions in Athena queries to avoid full bucket scan and save a lot of money and time.
Once CloudFormation is executed, you should see all above resources created.
Run the Lambda from console with using this sample JSON event and you should see it S3 bucket in a minute.