Ikea Data warehousing

πŸš€ IKEA Data Warehouse | End-to-End Data Engineering & BI Project

Saquib Hazari

πŸ“… Sep 25, 2025

GitHub

Follow me >

πŸ› οΈ Skills & Technologies Used

PostgreSQL SQL ETL Data Warehousing Data Modeling CSV Processing Git Version Control

🏠 Ikea Data Warehouse

This repository showcases a comprehensive data warehousing and analytics solution, covering the end-to-end data pipelineβ€”from data ingestion and transformation to business intelligence and insights generation.

Designed as a portfolio project, it demonstrates industry best practices in data engineering, ETL processes, and analytical modeling, enabling efficient data-driven decision-making. This project serves as a blueprint for building a scalable and structured data warehouse architecture that supports business intelligence and reporting.


βš™οΈ Data Architect

Basic Data Architect used for this project to build Bronze, Silver, and Gold layers.

Data Architect

  1. Bronze Layer: The raw data storage layer, where data is ingested directly from source systems (e.g., CSV files) into the SQL Server database without modifications. This layer acts as a staging area, preserving the original data for historical reference and traceability.
  2. Silver Layer: The data transformation layer, where raw data undergoes cleansing, standardization, and normalization to ensure consistency, accuracy, and integrity, making it suitable for deeper analysis.
  3. Gold Layer: The business-ready layer, where data is modeled into a structured star schema, optimized for reporting, analytics, and decision-making.

πŸ“– Project Overview

This project involves:

  • Data Architecture: Designing a Modern Data Warehouse using the Medallion Architecture (Bronze, Silver, and Gold layers).
  • ETL Pipelines: Extracting, transforming, and loading data from source systems into the warehouse.
  • Data Modeling: Developing fact and dimension tables optimized for analytical queries.
  • Analytics & Reporting: Creating SQL-based reports and dashboards for actionable insights.

This repository is an excellent resource for professionals and students looking to showcase expertise in:

  • SQL Development
  • Data Architecture
  • Data Engineering
  • ETL Pipeline Development
  • Data Modeling
  • Data Analytics

πŸš€ Project Requirements

🧱 Building the Data Warehouse (Data Engineering)

Objective
Develop a modern data warehouse using SQL Server to consolidate sales data, enabling analytical reporting and informed decision-making.

Specifications

  • Data Sources: Import data from two source systems (ERP and CRM) provided as CSV files.
  • Data Quality: Cleanse and resolve data quality issues prior to analysis.
  • Integration: Combine both sources into a single, user-friendly data model designed for analytical queries.
  • Scope: Focus on the latest dataset only; historization of data is not required.
  • Documentation: Provide clear documentation of the data model to support both business stakeholders and analytics teams.

βš™οΈ ETL (Extract Transform and Load)

The ETL process is the backbone of this data warehouse project, ensuring seamless data ingestion, transformation, and integration from multiple sources into a structured analytical model. This project follows a layered Medallion Architecture (Bronze, Silver, Gold) to progressively refine raw data into a business-ready format for reporting and analytics.


🧰 Project Planning


πŸ“ˆ BI: Analytics & Reporting (Data Analysis)

Objective
Develop SQL-based analytics to deliver detailed insights into:

  • Total Order
  • Total Profits
  • Region Sales
  • Sales Trends Ikea Dashboard

These insights empower stakeholders with key business metrics, enabling strategic decision-making.

➑️ Key Findings

πŸ“Š Insights Delivered

  • Monthly Orders: 5,000+
  • Total Sales: $2.93M/year
  • Avg. Monthly Order Value: $4,906
  • Top Customer States: Maharashtra, Telangana, Tamil Nadu
  • High Demand Products: Beds, Lamps, Office Furniture, Living Room Sets

πŸ”Ž Exploratory Data Analysis (EDA)

This section includes powerful SQL queries to extract actionable insights from the Gold Layer tables: dim_customers, dim_products, and fact_sales.


πŸ‘€ Customer Insights

1. Schema Check


SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'gold' AND table_type = 'BASE TABLE';

SELECT
  column_name,
  data_type,
  is_nullable
FROM information_schema.columns
WHERE table_schema = 'gold'
  AND table_name = 'your_table_name';

SELECT 'dim_customers' AS table_name, COUNT(*) FROM gold.dim_customers
UNION ALL
SELECT 'dim_products', COUNT(*) FROM gold.dim_products
UNION ALL
SELECT 'fact_sales', COUNT(*) FROM gold.fact_sales;

2. Gold.dim_customers Table Check

SELECT gender, COUNT(*) FROM gold.dim_customers GROUP BY gender;

SELECT DATE_PART('year', AGE(date_of_birth)) AS age, COUNT(*)
FROM gold.dim_customers
GROUP BY age ORDER BY age;

SELECT preferred_store, COUNT(*)
FROM gold.dim_customers
GROUP BY preferred_store
ORDER BY COUNT(*) DESC;

3. Gold.dim_products Table Check

SELECT category, COUNT(*)
FROM gold.dim_products
GROUP BY category ORDER BY COUNT(*) DESC;

SELECT defect_flag, COUNT(*)
FROM gold.dim_products
GROUP BY defect_flag;

SELECT supplier_name, COUNT(*)
FROM gold.dim_products
GROUP BY supplier_name
ORDER BY COUNT(*) DESC
LIMIT 10;

4. Gold.fact_sales Table Check

SELECT DATE_TRUNC('month', order_date) AS order_month, COUNT(*)
FROM gold.fact_sales
GROUP BY order_month
ORDER BY order_month;

SELECT
  AVG(ship_date - order_date) AS avg_shipping_days,
  MAX(ship_date - order_date) AS max_shipping_days
FROM gold.fact_sales;

SELECT order_status, SUM(net_sales) AS total_sales
FROM gold.fact_sales
GROUP BY order_status;

5. Health Check

SELECT
    conname AS constraint_name,
    conrelid::regclass AS table_from,
    a.attname AS column_from,
    confrelid::regclass AS table_to,
    af.attname AS column_to
FROM   pg_constraint
JOIN   pg_class ON conrelid = pg_class.oid
JOIN   pg_attribute a ON a.attrelid = conrelid AND a.attnum = ANY(conkey)
JOIN   pg_attribute af ON af.attrelid = confrelid AND af.attnum = ANY(confkey)
WHERE  contype = 'f'
AND    pg_class.relnamespace::regnamespace::text = 'gold';

Repository Structure

β”œβ”€β”€ assets/                        # Static assets like diagrams or images
β”‚   └── images/
β”‚
β”œβ”€β”€ CSV/                           # All CSV datasets
β”‚   β”œβ”€β”€ Gold_csv/                  # Final curated data (Gold Layer)
β”‚   β”‚   β”œβ”€β”€ dim_customers.csv
β”‚   β”‚   β”œβ”€β”€ dim_products.csv
β”‚   β”‚   └── fact_sales.csv
β”‚   └── Ikea_sale/                # Raw synthetic ERP + CRM datasets
β”‚       β”œβ”€β”€ customer_preferences.csv
β”‚       β”œβ”€β”€ customers.csv
β”‚       β”œβ”€β”€ order_items.csv
β”‚       β”œβ”€β”€ orders.csv
β”‚       β”œβ”€β”€ products.csv
β”‚       └── sales.csv
β”‚
β”œβ”€β”€ License                        # Project license (e.g., MIT)
β”œβ”€β”€ Procedures/                    # ETL procedures for data loading
β”‚   β”œβ”€β”€ bronze_layer_batch.sql
β”‚   β”œβ”€β”€ main.sql
β”‚   └── silver_layer_batch.sql
β”‚
β”œβ”€β”€ project/
β”‚   └── Business analysis/         # Business logic & insight generation
β”‚       β”œβ”€β”€ advance_analysis.sql
β”‚       β”œβ”€β”€ business_analysis.sql
β”‚       └── EDA.sql
β”‚
β”œβ”€β”€ python/                        # Python automation/data generation
β”‚   └── main.py
β”‚
β”œβ”€β”€ readme.md                      # Project documentation
β”œβ”€β”€ requirements.txt               # Python package dependencies
β”‚
β”œβ”€β”€ scripts/                       # DDL scripts to define schema layers
β”‚   β”œβ”€β”€ ddl_bronze_layer.sql
β”‚   β”œβ”€β”€ ddl_gold_layer.sql
β”‚   β”œβ”€β”€ ddl_silver_layer.sql
β”‚   β”œβ”€β”€ keys_constraint.sql
β”‚   └── main.sql
β”‚
β”œβ”€β”€ tests/                         # Placeholder for test SQLs or checks
β”‚
β”œβ”€β”€ todo.todo                      # Notes or to-do list
β”‚
└── venv/                          # Virtual environment files (excluded from version control)

πŸͺͺ License

This project is licensed under the MIT License. You are free to use, modify, and share this project with proper attribution.

About Me

Hi there! I’m Saquib Hazari. With a strong foundation in analytical thinking and data-driven decision-making, I’ve worked on a range of data science projects, including predictive modeling, ETL pipelines, and data warehousing.

Currently, I focus on:

  • Building and optimizing ETL pipelines.
  • Working with large-scale databases and performing advanced data analysis.
  • Using tools like SQL, Python, and Tableau to extract actionable insights from data.

I enjoy exploring new technologies in machine learning, AI, and deep learning. My approach is hands-on, and I love learning by doing β€” whether it’s building machine learning models or fine-tuning complex data pipelines.

Current Projects:

  • Developing a predictive model to understand customer behavior.
  • Implementing an advanced data pipeline to optimize data workflows.
  • Working on various marketing analytics projects to drive business decisions.

Feel free to connect or check out my previous projects! Data warehouse project

Connect with Me

LinkedIn Twitter Discord Website Gmail

Technologies
SQL
PostgreSQL
Python
Tools
Compatibility
macOS
Windows