AWS infrastructure for a clickstream data analysis
Problem statement
Analysing the user's behaviour is very important for making right decisions in application development and maximising business value.
AWS provides many services useful for data streaming, transformation and analysis. In this blog post I’m describing a Proof Of Concept project where I tried to simulate clickstream activity, ingest it to AWS, store, transform, aggregate and analyse.
The proposed solution
The following high level diagram shows AWS infrastructure that was built.

Next I will go through every component and explain its role in the system.
Data ingestion
We expect that clickstream events incoming to our system will be small pieces of data that are generated continuously with high speed and volume. We might need to analyse it in near real time, so Amazon Kinesis fits perfectly for this case.
Amazon Kinesis Data Streams
Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.

We can send events directly to a kinesis data stream using Kinesis Client Library or AWS API/SDK , but we can also make it a bit easier for our customers by exposing Kinesis through AWS API Gateway and they will be able to send usual HTTP requests, moreover we can take advantage of requests validation or protection with AWS Web Application Firewall (WAF). An example of API provided by AWS.

Amazon API Gateway will automatically scale to handle the amount of traffic your API receives.

I have developed a simple python script that generates JSON payload with some user’s information and sends it to API Gateway endpoint. Each user has id, some randomly generated IP address, event_name as an action (Search, AddToCart, ViewContent, Purchase)
{
"timestamp": "2021-07-12 12:01:58.732726",
"user_id": "35",
"pixel_id": "wjgao4w1oi",
"click_id": "a5cf179b9c9d483abf6d424d44a293be",
"insertion_timestamp": "2021-07-12 12:01:58.732754",
"event_name": "Search",
"user_ip": "111.33.64.227",
"additional_data": {
"time_on_data": 79,
"percent_viewed": 39.7,
"product_id": 557246,
"price": 417.97
}
}
Field “user_id” was also used as a partition key for Kinesis Data Stream to segregate and route records to different shards of a data stream.
A single shard can ingest up to 1 MB of data per second (including partition keys) or 1,000 records per second for writes. The maximum size of the data payload of a record before base64-encoding is up to 1 MB.
Kinesis data stream doesn’t scale automatically, but we can scale up the number of shards using CloudWatch + Lambda (UpdateShardCount API) to scale a stream to 5,000 shards, the stream will be able to ingest up to 5 GB per second or 5 million records per second.
