Efficiently Load Millions of Rows Daily from S3 into AWS Redshift


A few months ago I built a pipeline for a logistics analytics team that collects package events—delivery scans, route status updates, warehouse entries, etc. The events come from 11 distributed warehouses across India, aggregating to ~40M records/day.
This system is now running in production, costing under ₹20K/month (about $250–300) and scaling well.
This is how it works.
High Level requirement
Warehouse | Avg Records/Day |
Chennai | 4M |
Pune | 6M |
Delhi | 5M |
Bangalore | 9M |
Misc others (7) | 16M |
Total ≈ 40–45 million/day
Query Use Cases
Number of packages delivered per state
Delivery lag between warehouse entry → successful delivery
Per-user delivery throughput
Route delays based on timestamp ranges
SLA
Data must be available in Redshift every morning before 7:30 AM.
Final Architecture
Warehouse Systems → Kafka → S3 Raw Zone (/dt=YYYY-MM-DD)
Glue Streaming Job
↓
S3 Processed Zone (Parquet, partitioned)
Glue Batch ETL (Spark)
↓
Redshift Staging
↓
Redshift MERGE into Fact TablespgsqlWhy Glue + Spark?
Because:
✔ Data volume increases month-on-month ✔ Schema evolves weekly ✔ Redshift COPY is not cost-efficient or flexible for streaming ✔ We needed parallel transformations ✔ We needed schema enforcement, dedup, watermarking
Yes, COPY is good, but only if your files are clean, partitioned, and validated.
Our raw data was not.
Actual S3 Layout
s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000530.json
s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000531.json
...
s3://logistics/events/processed/dt=2025-12-01/hour=08/file-00212.snappy.parquetawkPartitioning by date + hour was critical Queries run mostly daily/hourly aggregations.
Glue Streaming Job (this actually runs)
We trigger a Glue streaming job every 5 minutes, reading Kafka topic output dumped into S3.
Key Logic:
Convert JSON to structured DataFrame
Validate schema
Drop duplicates using event_id
Write Parquet by partition
Job snippet:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import col, to_timestamp, year, month, dayofmonth, hour
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
df = spark.readStream.format("json") \
.load("s3://logistics/events/raw/")
cleanDF = df.select(
col("event_id"),
col("package_id"),
col("status"),
to_timestamp(col("event_ts")).alias("event_ts"),
col("warehouse"),
col("payload")
).dropDuplicates(["event_id"])
finalDF = cleanDF \
.withColumn("dt", col("event_ts").cast("date")) \
.withColumn("hh", hour(col("event_ts")))
glueContext.write_stream(
frame_or_dfc=finalDF,
connection_type="s3",
connection_options={
"path": "s3://logistics/events/processed/",
"partitionKeys": ["dt", "hh"]
},
format="parquet"
)stylusThis single decision (partitioning + parquet) reduced Redshift load time by 65%.
Batch Job for Redshift Load
Every day at 6:40 AM, Glue triggers a Spark job:
Tasks:
Read yesterday’s partition
Create surrogate keys
Compute delivery SLA metrics
Insert into Redshift staging
Run stored procedure for merge
Spark Code Excerpt:
processedDF = spark.read.parquet(
"s3://logistics/events/processed/dt=2025-12-01/"
)
processedDF.write \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", redshift_jdbc) \
.option("dbtable", "staging_delivery_events") \
.option("aws_iam_role", iam_role) \
.mode("overwrite") \
.save()stylusThis loads ~45M rows into staging in ~6 minutes. COPY couldn’t achieve this due to dedupe & schema logic.
Redshift Merge Logic
We did NOT do naive deletes.
We used Redshift MERGE, which is highly efficient on RA3 clusters.
MERGE INTO fact_delivery_events AS target
USING staging_delivery_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE
SET status = source.status,
event_ts = source.event_ts,
warehouse = source.warehouse
WHEN NOT MATCHED THEN INSERT
(
event_id, package_id, status, warehouse, event_ts
)
VALUES
(
source.event_id,
source.package_id,
source.status,
source.warehouse,
source.event_ts
);n1qlRuntime: ~8 minutes Previous naive batch INSERT runtime: ~25 minutes
Real Problems We Faced
Problem #1
Late-arriving events (sometimes 2 days old)
Fix: Glue triggers incremental backfill job → partition-based query → efficient rerun
Problem #2
Same event_id arriving twice due to upstream retries
Fix:Spark .dropDuplicates(["event_id"])stylusProblem #3
Redshift VACUUM used to run for 4+ hours
Fix:
ALTER TABLE fact_delivery_events
SET TABLE PROPERTIES (table_type = 'TABLE', autovacuume = true);sqlFinal Performance
Stage | Time |
Streaming ingestion + partitioning | Continuous |
Glue Batch Transform (~45M rows) | ~7–8 min |
Redshift Write to Staging | ~5–6 min |
Merge to Fact Table | ~8 min |
Total SLA Achieved | ~20–25 min |
And completely scalable:
New warehouses? → Just add new partitions.
Higher daily volume? → Spark parallelizes automatically.
Monthly AWS Cost (real bill numbers - 2025)
Service | Cost |
Glue Streaming | ₹11,200 (~$140) |
Glue Batch | ₹3800 (~$45) |
Redshift RA3 x2 | ₹4800 (~$60 — reserved instance) |
S3 Storage | ₹1200 (~$14) |
TOTAL | ~₹21,000 ($250–260) |
That is all for today. Thanks for reading this. if you want to learn how did i created project structure and perform scheduling and deployment Please follow and check my next blog.