Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

Tourism

Date

April 2023

The data is set is a collection of information generated from kaggle. The data collects number of countries tourism results.

.pro-gallery-wix-wrapper {display: block !important;} .pro-gallery-wix-wrapper .gallery-item-container {opacity: 1 !important; display: block !important;}

Introduction:

The goal of this project is to analyze Spotify’s music dataset to uncover meaningful patterns and groupings of songs using clustering techniques. With over 232,000 songs and multiple audio and metadata features, the problem lies in understanding how these features interact and group songs into meaningful clusters that reflect shared characteristics.

Clustering algorithms help group songs by analyzing their audio features like tempo, energy, and danceability, identifying patterns and similarities that group similar tracks together. These clusters may sometimes align with predefined genres, but often reveal unexpected groupings that transcend traditional genre boundaries. For instance, songs from different genres could cluster together if they share similar energetic or rhythmic qualities. These insights can significantly improve music recommendation systems by offering more personalized and diverse suggestions, helping users discover music they might not typically encounter while enhancing their listening experience through tailored playlists that match their mood or preferences.

For this project, I used a dataset called Spotify Features. The dataset contains detailed information about songs available on Spotify, including various audio and metadata features. The purpose of this data is to analyze musical attributes and categorize songs based on their characteristics.

The dataset has 17 features, which include:

Genre: The category or type of music, such as Pop or Rock.
Artist Name: The name of the artist or band.
Track Name: The name of the song.
Track ID: A unique identifier for each track.
Popularity: A measure of how popular a song is, ranging from 0 to 100.
Acousticness: A score showing how acoustic (natural) a track is.
Danceability: A score representing how suitable a song is for dancing.
Duration_ms: The length of the track in milliseconds.
Energy: A measure of the intensity and activity of the song.
Instrumentalness: Indicates whether the song is mostly instrumental or contains vocals.
Key: The musical key of the track.
Liveness: A measure of the probability that the song was performed live.
Loudness: The overall volume of the track in decibels.
Mode: Indicates whether the track is in a major or minor key.
Speechiness: A measure of how much spoken word is present in the song.
Tempo: The speed of the song, measured in beats per minute (BPM).
Time Signature: The number of beats in a measure, such as 4/4 or 3/4.
Valence: A score indicating the mood of the song, from sad or angry to happy or uplifting.

This dataset offers a rich variety of features, allowing us to explore the characteristics of songs and use clustering methods to group similar tracks together for deeper analysis.

There seems to be no missing values in the dataset expediting the process of clustering the data. If there were missing values I would simply remove them if they posed no immediate effect to the desired outcome. The only issue i see is the data itself and gathering a sample size from the actual data itself. The data in itself is huge considering the values and a good portion size would suffice for the desired conclusion.

What is clustering?

Clustering is a way to group similar items together, and it's part of machine learning called unsupervised learning, where the computer doesn’t have answers beforehand. The goal is to organize data so items in the same group, or cluster, are more alike than those in other groups. Two common types of clustering are K-Means and Agglomerative Clustering. K-Means starts by picking a number of groups, then assigns data points to the nearest group center, called a centroid. The centroids move as the computer adjusts the groups to make them more accurate, repeating until the clusters stabilize. Agglomerative Clustering works differently by starting with each data point as its own cluster and gradually merging the closest clusters until only one large group or a set number of groups remain. K-Means is fast and works well with evenly sized clusters, while Agglomerative Clustering takes longer but handles more complex group shapes. Both methods help uncover patterns in data, making it easier to understand and analyze.

Data Understanding/Visualization

The chart shows how popular the sampled songs are. Most songs have a popularity score around 40–50, meaning they're moderately popular. There are fewer songs with very low or very high popularity. The spike near 0 shows many songs are not well-known, possibly from new or niche artists

The heatmap generated provides a clear visualization of the relationships between numerical features. Features like loudness, danceability, and valence might be useful predictors for popularity based on their correlations. Redundant features like energy and loudness, which are highly correlated, may need careful handling to avoid multicollinearity issues in the model.

Bar Graph illustrates captures the count for each genre in the dataset. Labeled by the percentage it holds in value. From observation one can visually see Children's music as having more counts then rap.

Some genres like Electronic or Dance might cluster towards higher energy and danceability values, whereas others might spread across a wider range.

Popular songs seem to have varying acousticness levels, indicating no immediate dominant trend. However, clustering within specific genres might reveal further insights.

For this dataset, starting with k-means is practical due to its speed and ease of implementation. the combination of computational speed, simplicity, and compatibility with numerical data makes k-means a practical and effective first choice for analyzing and clustering the dataset.

Story Telling

Cluster 0 (High Energy and Fast Songs):
- Songs with high energy and fast tempo, good for parties or upbeat playlists.
- Less acoustic, more electronic sounds.
Cluster 1 (Soft and Relaxing Songs):
- Low energy, mostly acoustic, and slower tempo.
- Perfect for relaxation, studying, or emotional moments.
Cluster 2 (Happy and Danceable Songs):
- High energy and positive mood, with good danceability.
- Great for workouts or celebrations.
Cluster 3 (Balanced and Chill Songs):
- Moderately energetic with a mix of acoustic and electronic sounds.
- Fits versatile moods, ideal for calm yet engaging playlists.

Conclusion

Songs can be grouped based on their audio features, making it easier to create playlists for specific moods or activities.

For example:
- Use Cluster 2 for high-energy moments.
- Use Cluster 1 for relaxing or emotional times.

These clusters help understand the types of songs people might enjoy for different situations, making playlist creation more personalized and fun.

Impact Section

The Spotify clustering project has the potential to improve user experiences by creating personalized playlists, enhancing user engagement, and supporting lesser-known artists by matching their music to specific audience preferences. However, it also raises concerns such as reinforcing biases by promoting popular genres over niche ones, over-relying on data that might overlook the cultural or emotional value of music, and creating privacy risks with the collection of user data. Additionally, clustering may limit music exploration by keeping users within specific preferences. Overall, while this project offers valuable personalization, it’s essential to address ethical considerations like fairness, diversity, and privacy to maximize its positive impact.

References:

https://dorazaria.github.io/machinelearning/spotify-popularity-prediction

https://github.com/mikkayadu/EDA-on-Spotify-Dataset

Lecture Videos

https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html

https://www.geeksforgeeks.org/videos/k-means-clustering-explained-machine-learning/

Jupyter NoteBook:

https://github.com/Andrad7P/Spotify_DataSet_Pro_4.git

Click the File Below to download Jupyter File.

The "elbow" point, where the rate of decrease in WCSS slows down significantly, appears to be at K = 4 or K = 5. This is likely the optimal number of clusters as it balances compactness and simplicity.

Summary:
Cluster 0: High-energy, fast-paced tracks (likely electronic or dance music).
Cluster 1: Calm, acoustic, and melancholic tracks.
Cluster 2: Happy, danceable, and upbeat tracks (modern production).
Cluster 3: Acoustic-focused tracks with moderate energy and positivity.

The cluster centroids represent the average values for features like danceability, energy, valence, acousticness, and tempo, offering insight into the characteristics of each group. One cluster has high energy and tempo, likely representing upbeat and lively songs. Another cluster shows low energy and high acousticness, suggesting calmer or acoustic tracks. A third cluster balances energy and danceability with moderate valence, hinting at a mix of rhythmic and emotionally neutral songs. The final cluster combines high danceability and energy with lower acousticness, reflecting dance or party tracks. These centroids help categorize songs into distinct groups, revealing their defining audio characteristics.

Introduction

Predicting house prices using a dataset from Kaggle that includes features like the number of bedrooms, bathrooms, living space size, population, and household price. It helps potential buyers understand fair market prices for properties with specific attributes, making it easier to decide whether a property is priced appropriately. Agents can use the model to help price homes accurately, avoiding underpricing or overpricing. Overall, it solves the problem of estimating housing prices, which can assist various stakeholders in making informed financial decisions regarding real estate.

With features covering both property-specific and locational attributes, this dataset offers a rich foundation for building a machine learning model capable of understanding the complex relationships that drive property prices. The end goal is to utilize these insights to produce reliable and interpretable price predictions, ultimately enhancing decision-making in real estate investments.

Regression and How it works

Regression method is utilized to capture a target variable by prediction given one ore more input features. Linear regression

models is as stated a straight-line(linear) comparing the target and the input variables. Basically, finding the best fit line through d data points by minimizing prediction errors.

Linear Regression: (Single feature)

y = B(0) + B(1)x+E

y is the dependent variable (e.g., house price),
x is the independent variable (e.g., square footage),
B0 is the intercept, representing the baseline value of y when x=0
B1 is the slope, indicating the change in y for a one-unit increase in x,
E represents the error term, capturing the variability in y not explained by x.

The y is the target and x is the feature, B(0) is the intercept and B(1) is the slope. Linear Regression is basically used for simplicity and interpretability but is based on a linear relationship between the two variables input and output.

Data Understanding and Pre-processing

I'll be focusing on relevant columns and ignoring irrelevant ones like 'street', 'state zip', 'country', and 'date'. By calculating correlations to see if key features, like square footage and price, are related, as this can guide our predictions. Visualizations, including heatmaps, histograms, and scatter plots, help reveal patterns, such as feature distributions and linear relationships, which lay the groundwork for effective modeling. Provides a clear picture of the data's structure and potential insights before proceeding to model building.

Code identifies missing values

df.columns # found 18 columns

df = pd.read_csv('datasets/USA Housing Dataset.csv')
# Dropping irrelevant columns
df_cleaned = df.drop(columns=['street', 'statezip', 'country', 'date'])

# Checking for missing values
missing_values = df_cleaned.isnull().sum()

missing_values

The 'street', 'statezip','country','data'are irravelent columns that don't require necessary exploration. There are no missing values in the dataset after looking for missing values. So the data is clean considering the results.

Experiment 1: Evaluating Performance

Next, I'll encode the city column using one-hot encoding and then split the data into features (X) and target (y) for model training. After that, we can scale the numerical features and proceed to build a linear regression model.
Results:

(54774079827.50566, 0.47752327662060257)

Results:

Mean Squared Error (MSE): 54,828,701,952.41
R-squared (R^(2)): 0.477
47%

The R² score indicates that the model explains approximately 47.7% of the variance in housing prices, which leaves room for improvement. The MSE suggests that the average squared difference between predicted and actual prices is quite large, which might imply that additional features or different models could improve performance.

Experiment 2:

For Experiment 2, I made changes that might improve the model’s performance. First add interaction terms or polynomial features to capture non-linear relationships between features, especially between square footage and price. Additionally, trying a more complex model like Random Forest Regression or Gradient Boosting could better capture interactions and non-linear patterns that a simple linear regression model may miss.

Alternatively, we could refine our feature selection by testing feature importance or using dimensionality reduction techniques After implementing these adjustments, re-evaluate model performance by comparing the new Mean Squared Error (MSE) and R-squared (R²) values with those from the linear regression model in Experiment 1.

Ideally, we would expect these changes to reduce MSE and increase R², indicating that the new approach captures more variance in housing prices and improves prediction accuracy.

Random Forest Model Evaluation:

Mean Squared Error (MSE): 57042027934.65769

R-squared (R²): 0.45588986717673896

After making predictions, it evaluates the model using Mean Squared Error (MSE) and R-squared (R²) metrics to gauge its performance. This provides insights into how well the Random Forest model fits the housing price data. Unfortuantely the 45.6% is still not tuning the data to our specifications. Further improvement is required to find exactly what we need to get R-Squared to increase.

Experiment 3:

Considering the previous models I want to make sure that Gradient Boosting model would perform better than the rest but unfortunately the results have shown a different story.

The evaluation results for the Gradient Boosting model :

Mean Squared Error (MSE): 108,151,522,201.65
R-squared (R²): -0.032 (indicating the model performed worse than a simple mean-based prediction)

These results suggest that the Gradient Boosting model is currently underperforming, possibly due to the standard scaling of features, which may not align with Gradient Boosting's requirements. Gradient Boosting often performs better with raw or minimally processed features. Little did I realize that to late considering the dataset. Adjusting preprocessing or fine-tuning model parameters could improve results.

Impact Section

This housing price prediction project can make real estate markets more transparent, helping buyers make informed decisions and potentially improving housing affordability. However, it raises ethical concerns, as models trained on historical data could perpetuate pricing biases across neighborhoods, affecting property values in underserved areas. Privacy is another factor, as using detailed property data requires adherence to strict privacy standards. Economically, this model could enhance market efficiency, though it might give larger firms an advantage, potentially impacting fair competition. Careful handling of these aspects is essential to ensure equitable outcomes and minimize unintended harm.

Conclusion

From this project and the experiments, I learned the importance of data preprocessing and model selection in improving predictive performance. For instance, scaling numerical features helped standardize the data, which is crucial for models sensitive to feature magnitudes, like linear regression. However, we saw that standard scaling may not benefit all models equally, as Gradient Boosting performed worse after scaling, showing that some algorithms handle raw data better.

Additionally, encoding categorical variables using one-hot encoding was essential, as it allowed us to incorporate the 'city' feature effectively without introducing categorical bias. Experimenting with different models, such as Random Forest and Gradient Boosting, highlighted the limitations and strengths of each. Random Forest captured interactions better than linear regression, though it still left room for improvement.

Through these trials, I also realized that choosing relevant features and experimenting with model parameters are crucial steps. Fine-tuning, feature selection, or using ensemble methods could potentially enhance model accuracy. Ultimately, this project emphasized that iterative experimentation with preprocessing and model adjustments is key to refining predictive performance.

References

"The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman

Module Tutorials(in class)

Gradient Boost Regressor

https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html

Kaggle.com

https://xgboost.readthedocs.io/en/latest/

https://scikit-learn.org/stable/

For jupyter file please click download

https://github.com/Andrad7P/Project-3.git

DOWNLOAD

Histograms: Used to visualize the distribution of "Visitor Count," "Rating," and "Revenue." This helped in understanding the spread of values and any potential skewness in the data.

Bar Plots for Categorical Variables: I examined the distribution of categories in “Nation” and “Type,” discovering that certain types, like “Adventure” and “Urban,” had significantly more locations without accommodations.

The confusion matrix and classification report show that the model is predicting both classes (0 and 1) with almost equal but moderate accuracy, around 51%. The precision, recall, and F1-scores for both classes are close to 50%, meaning the model is only slightly better than random guessing.

Introduction:

For this project, the goal is to solve a classification problem using a tourism dataset. Specifically, we aim to predict whether a travel destination offers accommodation based on features such as visitor count, country, type of tourism, rating, and revenue. By solving this classification problem, we seek to answer questions like: Which factors most strongly influence the availability of accommodations? and Can we accurately predict accommodation availability based on these features? Understanding these relationships can be valuable for tourism planners, businesses, and travelers in optimizing or improving services offered at various destinations.

The tourism dataset is a synthetically generated collection of data designed to simulate various travel-related metrics across different global locations. The dataset contains key columns such as "Location," "Country," "Category" (type of tourism), "Visitor Count," "Rating," "Revenue," and "Accommodation Availability." It can be utilized for a range of data analysis tasks, including examining tourism trends across nations, exploring visitor ratings, and analyzing relationships between variables like visitor count and revenue.

Although the data is randomly generated and not representative of real-world tourism statistics, it serves as an excellent resource for testing machine learning models and data preprocessing techniques. With a file size of approximately 310.43 KB, the dataset provides a sufficient number of records to facilitate robust analysis and model training without overwhelming computational resources.

Preprocessing:

Handling Missing Data: The dataset currently shows no sign of missing values, as all columns appear to be fully populated. However, if missing data were present, I would address it by using meen or median imputation for numerical columns, while for categorical columns, I would opt for mode imputation to preserve the integrity of the data.

Encoding Categorical Variables: The "Nation and "Type" columns are categorical and will need to be converted into numerical representations. I would use either one-hot encoding or label encoding, depending on the specific model requirements. The "Accommodation Available" column is binary, making it straightforward to convert to 1s and 0s for analysis.

Scaling Numerical Variables: For models sensitive to the scale of features, such as linear regression or support vector machines, I would standardize numerical columns like "Visitor Count," "Rating," and "Revenue." This process involves scaling the values so that they have a mean of 0 and a standard deviation of 1.

Outlier Detection: The "Visitor Count" and Revenue" columns could potentially contain outliers, which should be investigated before model training. Depending on how these outliers affect model performance, they can either be removed or treated to prevent skewing the results.

Data Understanding/Visualization

To further understand the dataset, I performed several exploratory data analysis steps, including:

Through this exploration, it became clear that both visitor count and revenue likely play key roles in predicting accommodation availability. These insights informed my decision to prioritize models that can handle both numerical and categorical data effectively.

Correlation Matrix: Visualized correlations between numerical features like "Visitor Count" and "Revenue." Strong positive correlations were found, indicating that locations with higher visitor counts tend to generate more revenue.

Modeling

To address the classification problem, I selected Logistic Regression model to test:

Logistic Regression: This model was chosen for its simplicity and effectiveness in binary classification problems. Logistic regression models the probability of accommodation availability as a function of the input features. Pros: Simple, interpretable. Cons: May struggle with non-linear patterns in the data.

Evaluation

I evaluated the model to see how it performed and the results can be seen above.

The model's performance is relatively weak, with accuracy, precision, recall, and F1-scores all hovering around 50-51%. This suggests that the model is only slightly better than random guessing.
The confusion matrix also shows a large number of false positives and false negatives, indicating that the model struggles to accurately differentiate between the two classes.

Storytelling

You built a model to predict accommodation availability at tourist locations, but its accuracy is only 51%, meaning it predicts correctly about half the time. The model struggles with both false positives (predicting availability when there isn’t any) and false negatives (missing locations where accommodations are available). This could lead to issues like over-promising to customers or missing out on potential business opportunities, highlighting that the model needs improvement before it can reliably assist in decision-making.

Impact Section

The model’s 51% accuracy could negatively impact business decisions by leading to false promises of accommodation availability, damaging customer trust, and causing missed revenue opportunities. It may also result in operational inefficiencies and misguided strategic planning. Further damaging the integrity of the company as a whole. To avoid these risks, the model needs improvement before it can be relied upon for key business decisions.

Resources:

https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html

https://www.kaggle.com/datasets/umeradnaan/tourism-dataset

https://www.w3schools.com/python/python_ml_linear_regression.asp

https://www.geeksforgeeks.org/linear-regression-python-implementation/

https://towardsdatascience.com/coding-linear-regression-from-scratch-c42ec079902

https://www.kdnuggets.com/linear-regression-from-scratch-with-numpy

Distribution of Locations by Country: Shows how many tourism locations exist per country, revealing geographic focus or bias in the data.
Distribution of Locations by Category: Displays the number of locations in each category (e.g., Beach, Cultural), helping identify which types of locations are most common.
Accommodation Availability by Category: Compares the availability of accommodations across different types of locations, indicating whether certain categories (e.g., Urban or Nature) are more likely to offer places to stay.

Please click download for Tourism Data Set Juypter

DOWNLOAD

Introduction to the Problem

In today's world, understanding trends and patterns in data is crucial for making informed decisions. For instance, tracking how prices change over time can reveal important insights into market behavior. The dataset provided consists of various attributes related to white wines, including their characteristics such as name, country of origin, winery, rating, and price. This dataset is crucial for analyzing white wine quality based on customer ratings, identifying the price distribution, and potentially predicting the wine's rating or price based on specific features such as the year of production, region, and winery. The goal is to preprocess this dataset for analysis and build predictive models that can assist in understanding patterns within the wine industry or help users make informed wine purchase decisions.

Preprocessing the Data

Preprocessing is essential before performing any analysis or building models. This involves cleaning the data, handling missing values, encoding categorical variables, and scaling the data as required.

The first step was to gather and organize the data so we could find the information needed to make the graphs. This allowed us to look at patterns, like how wine prices have changed over the years.

Next, we had to clean the data. This meant making sure that the columns, like prices and years, were in a format that could be easily used for calculations. For example, I focused on making sure the prices were correct and that the years were listed properly. We also removed any errors or unnecessary data to make sure everything was accurate.

Finally, for exploring the data, we grouped it by year to find the average price for each year. This made it easy to see how wine prices have changed over time. The graphs to the side show these trends, helping us spot any patterns.

Price Trends Over Time:

The question I started with was whether or not price trends fluctuated over time. I had to go on Jupiter and create a line chart to figure out if this was the case. The average wine price was used for each year. This showed if the wine market was going up or down during certain periods of the year. [Click the Information on graph 1 for further details]

The graph demonstrates the prices over the given years (1995 - 2020 on the bottom). On the side of the graph is labeled Average price starting at 250 and decreasing at a rapid pace. From 1995 and before 2000s there was lower prices across the board. Demonstrating a lower in demand market for white wine. The prices have fluctuated over time considerably. Looking at the graph prices over the years the cost have been steadily decreasing.

Scatter Plot of Price vs. Rating

The scatter plot was designed to see if there was a connection between a wine’s price and its customer rating. Surprisingly, the results showed no strong link between higher prices and better ratings. Some midpriced wines received great ratings, while more expensive wines didn't always score as well. Its better for the average consumer to purchase reasonable priced wine for the same rating as an expensive bottle. This challenges the common belief that "you get what you pay for." So, the scatter plot answered the question about whether price predicts quality, and the answer was not always. Considering that mid priced items were the same rating as expensive ones.

Country-Based Analysis

The bar and line charts comparing average wine prices and ratings from different countries showed that there are noticeable differences between regions. For instance, wines from places like France and Italy generally had higher prices probably because their wine industries are more well-known. But just because a wine was more expensive didn’t mean it was rated higher. Wines from countries like Spain had similar, and sometimes even better, ratings at lower prices. This shows that some countries can offer better quality wines for less money, which is useful for both buyers and businesses.

The impact of wine prices and quality:

This project helps people learn more about wine prices and quality, making it easier for them to choose good wines and giving businesses ideas about pricing and marketing. However, there are some risks. The data might lead to general conclusions that don’t apply to all wines. It also doesn’t include important things like expert reviews or environmental factors, which could affect wine quality or price. Some people might misunderstand the results, thinking the graphs tell the whole story when, in reality, factors like personal taste and wine age are missing. Adding more information could make the project stronger and more complete.

References:

https://dataplotplus.com/how-to-create-bar-plot-with-two-y-axis-bars-in-pandas/
https://cmdlinetips.com/2019/10/how-to-make-a-plot-with-two-different-y-axis-in-python-with-matplotlib/#google_vignette
https://www.statology.org/matplotlib-two-y-axes/

https://www.immerse.education/study-tips/tackling-homework-anxiety/

https://www.kaggle.com/datasets/budnyak/wine-rating-and-price?select=White.csv

Github: git@github.com:CoffeDemon101/WhiteWine.git

https://github.com/CoffeDemon101/WhiteWine.git

gh repo clone CoffeDemon101/WhiteWine

Create Your First Project

Tourism

Date

Story Telling

Conclusion

Songs can be grouped based on their audio features, making it easier to create playlists for specific moods or activities.

Impact Section

Introduction:

Preprocessing:

Data Understanding/Visualization

​Modeling

Evaluation

Storytelling

Impact Section

Resources:

Introduction to the Problem

Preprocessing the Data

Price Trends Over Time:

Scatter Plot of Price vs. Rating

Country-Based Analysis

The impact of wine prices and quality:

JorgeAndrade

Modeling