Rollar Coaster Analysis
roller-coaster-analysis
Saquib Hazari
đź“… Sep 25, 2025
The following tools and libraries were utilized in this analysis:
Pandas: For data manipulation and preprocessing. Pandas is essential for handling missing values, encoding categorical variables, and preparing the dataset for machine learning models.
NumPy: For numerical operations and handling arrays. NumPy complements Pandas in efficiently performing mathematical and logical operations on large datasets.
Scikit-Learn: For machine learning algorithms and model evaluation. Scikit-Learn provides a wide range of tools, including:
Matplotlib: For data visualization. Matplotlib is used to create plots and graphs that help in understanding the distribution of the data and the results of the analysis.
Jupyter Notebook: For interactive data analysis and visualization. Jupyter Notebook allows for combining code, visualizations, and narrative text in a single document, facilitating better understanding and communication of the analysis process.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
df = pd.read_csv('/Users/saquibhazari/Desktop/Python/roller_coaster/CSV/Assets/coaster_db.csv')
# Visualizing the dataset
df['year_introduced'].value_counts().sort_values(ascending=False).head(10) \
.plot(kind='bar', ylabel='counts', grid=False)
plt.title('Top 10 Roller Coasters Year Introduced')
plt.xticks(rotation=45)
plt.show()

# comparing the seed of the coasters
from matplotlib.ticker import FuncFormatter
df['speed_mph'].plot(kind='hist', bins=20, grid=False, edgecolor='white', xlabel='Speed_mph')
plt.title('Speed of the rollar coasters')
plt.tight_layout()
ax = plt.gca()
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x/1)}mph"))
plt.show()

# Making a Kde plot for the speed
df['speed_mph'].plot(kind='kde', grid=False)
plt.title('Speed of rollar coasters.')
plt.xlabel('speed_mph')
ax = plt.gca()
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x/1)}mph"))
plt.tight_layout()
plt.show()

# Making a Scatter plot between speed and height using seaborn and matplotlib
df.plot(x='speed_mph', y='height_ft', kind='scatter', title='Speed vs Height of coasters')
plt.xlabel('speed_mph')
plt.ylabel('height_ft')
plt.tight_layout()
plt.show()

# Making a Scatter plot between speed and height using seaborn and matplotlib
sns.scatterplot(x='speed_mph', y='height_ft',data=df, hue='year_introduced')
plt.title('Speed vs Height of coasters.')
plt.xlabel('speed_mph')
plt.ylabel('height_ft')
plt.tight_layout()
plt.show()

df_corr = df.drop(['coaster_name', 'location', 'status', 'manufacturer', 'type_main'],axis=1).corr()
sns.heatmap(data=df_corr, annot=True, cbar=False)
plt.title('Heat map for all the corr.')

# Making a pair plot for the corr
sns.pairplot(data=df, vars=['year_introduced', 'speed_mph', 'height_ft', 'gforce_clean', 'inversions_clean'], hue='type_main')
plt.title('Pair Plot')
plt.tight_layout()
plt.show()

# Finally finding the top most speed rollar coaster acc to the locations.
df.query("location != 'other'").groupby('location')['speed_mph'] \
.agg(['mean', 'count']).query("count > 10").sort_values('mean')['mean'] \
.plot(kind='barh', figsize=(10,5), title='Speed of coaster acc to locations.', xlabel='Speed_mph', ylabel='Locations', grid=False)
ax = plt.gca()
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x/1)}mph"))
plt.tight_layout()
plt.show()

X = df[['year_introduced', 'inversions_clean', 'gforce_clean', 'type_main']]
y = df['speed_mph']
X = pd.get_dummies(X, columns=['type_main'], drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
model.coef_
model.intercept_
predictions = model.predict(X_test)
print('MAE', mean_absolute_error(y_test, predictions))

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
X = df.drop(['coaster_name', 'manufacturer', 'location', 'status'], axis=1)
y = df['status'].apply(lambda x: 1 if x == 'Operating' else 0)
X = pd.get_dummies(X, columns=['type_main'], drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lg_model = LogisticRegression()
lg_model.fit(X_train, y_train)
lg_model.coef_
pred_lg = lg_model.predict(X_test)
print(classification_report(y_test, pred_lg))
print('\n')
print(confusion_matrix(y_test, pred_lg))

# Using Randomforest model
from sklearn.ensemble import RandomForestClassifier
rn_model = RandomForestClassifier(n_estimators=100)
rn_model.fit(X_train, y_train)
rn_pred = rn_model.predict(X_test)
print(classification_report(y_test, rn_pred))
print('\n')
print(confusion_matrix(y_test, rn_pred))

This project involved extensive practice in modeling, feature engineering, and data preprocessing. The following key techniques and methodologies were applied:
Data Cleaning and Preprocessing: The dataset underwent thorough cleaning, including handling missing values, encoding categorical variables, and standardizing numerical features. This step ensured the data was in a suitable format for analysis and modeling.
Exploratory Data Analysis (EDA): EDA was conducted to understand the distribution of the data, identify patterns, and visualize relationships between variables. Various plots and graphs provided insights into the characteristics of the roller coasters.
Feature Engineering: New features were created, and existing ones were transformed to improve the performance of the models. This included encoding categorical variables and generating dummy variables for machine learning algorithms.
Modeling: Several machine learning models were built and evaluated:
Evaluation and Visualization: The models were evaluated using various metrics such as accuracy, confusion matrix, and classification report. Visualizations were created to illustrate the results and performance of the models.
Overall, this project successfully demonstrated how to clean and preprocess data, handle missing values, and conduct effective modeling with EDA and visualization. The insights gained from this analysis and the models built provide a solid foundation for further exploration and application of machine learning techniques.