Lecture Notes for Lab 10, Lecture notes of Environmental Science

Lecture notes for lab 10 for GEOL 4342

Typology: Lecture notes

2025/2026

Uploaded on 05/05/2026

mustufa-khan-1
mustufa-khan-1 🇺🇸

5 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Deep Learning for Big Data Analytics
University of Houston
Lab 10: Feature Engineering
Lab Instructor: Rashik Islam
10.1 What is feature?
In the context of machine learning and data science, a "feature" is an individual measurable
property or characteristic of a phenomenon being observed. Essentially, features are the input
variables used by models to make predictions or classifications. They can be thought of as the
raw data that you feed into a model, which the model then uses to learn patterns that
correlate with particular outcomes.
10.2 Feature Engineering
Feature engineering is a critical process in the development of machine learning models that
involves creating new features or modifying existing ones to improve model performance.
Essentially, it's about transforming raw data into a format that makes it easier for machine
learning algorithms to process, helping the models to uncover patterns or insights that might
not be apparent in the initial dataset. This process can significantly influence the accuracy and
effectiveness of a model, as the right features can help the model to understand the
underlying structure of the data better.
10.2.1 Key Aspects of Feature Engineering:
1. Creation of New Features:
Domain Knowledge: Utilizing specific knowledge about the domain to generate new
features that could be predictive of the outcome. For example, in a real-estate pricing
model, creating a new feature that represents the age of a property from the 'year
built' can provide valuable information to the model.
Interaction Terms: Creating features that capture the interaction between two or more
variables. For instance, in predicting customer spend, an interaction feature between
pf3
pf4
pf5
pf8

Partial preview of the text

Download Lecture Notes for Lab 10 and more Lecture notes Environmental Science in PDF only on Docsity!

University of Houston

Lab 1 0 : Feature Engineering

Lab Instructor: Rashik Islam 10 .1 What is feature? In the context of machine learning and data science, a "feature" is an individual measurable property or characteristic of a phenomenon being observed. Essentially, features are the input variables used by models to make predictions or classifications. They can be thought of as the raw data that you feed into a model, which the model then uses to learn patterns that correlate with particular outcomes. 10 .2 Feature Engineering Feature engineering is a critical process in the development of machine learning models that involves creating new features or modifying existing ones to improve model performance. Essentially, it's about transforming raw data into a format that makes it easier for machine learning algorithms to process, helping the models to uncover patterns or insights that might not be apparent in the initial dataset. This process can significantly influence the accuracy and effectiveness of a model, as the right features can help the model to understand the underlying structure of the data better. 10 .2.1 Key Aspects of Feature Engineering:

1. Creation of New Features: - Domain Knowledge: Utilizing specific knowledge about the domain to generate new features that could be predictive of the outcome. For example, in a real-estate pricing model, creating a new feature that represents the age of a property from the 'year built' can provide valuable information to the model. - Interaction Terms: Creating features that capture the interaction between two or more variables. For instance, in predicting customer spend, an interaction feature between

University of Houston 'number of store visits' and 'average spend per visit' might be more predictive than either feature alone.

2. Feature Transformation: - Normalization/Standardization: Scaling features so they have a specific statistical property (e.g., zero mean and unit variance for standardization). This is particularly important for models sensitive to the scale of data, like SVM or k-NN. - Log Transformation: Applying logarithmic scale to features to reduce skewness, which can help linear models perform better by making the relationship between variables more linear. 3. Feature Selection: - Removing Redundant or Irrelevant Features: Identifying and removing features that do not contribute much to the predictive power of the model to reduce complexity and overfitting. - Selecting Top Features: Using statistical tests, model-based importance, or wrapper methods to select a subset of the most informative features. 4. Handling Missing Values: - Imputation: Filling missing values with statistics like mean, median, mode, or using algorithms that predict missing values. - Indicator Variables: Creating features that indicate whether a value was missing, which can sometimes capture useful information about missingness. 5. Temporal Features: Extracting information from date/time features, such as the day of the week, month, year, or even part of the day, which might influence the target variable.

University of Houston Now, if you look closely at the figure, you will see that, few features are highly correlated to each other when it is predicting PM2.5. That is called multicollinearity. 12.2.3 Multicollinearity Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning they have a strong linear relationship with each other. This condition can make it difficult to distinguish the individual effects of each predictor on the dependent variable, leading to unstable estimates of the regression coefficients. These instabilities can result in large variations in the estimated coefficients for minor changes in the data or model, which complicates the interpretation and reduces the statistical power of the independent variables. Example 1: House Prices

  • Scenario: Imagine you're trying to predict the price of houses based on various features like the size of the house (in square feet), the number of bedrooms, and the number of bathrooms.
  • Multicollinearity Issue: The size of the house is likely to be highly correlated with both the number of bedrooms and bathrooms. Larger houses tend to have more bedrooms and bathrooms. In this case, if you include all three variables (size, bedrooms, bathrooms) in your model, you might face multicollinearity because it's hard to isolate the effect of each variable on the house price. The model might struggle to determine if the price is high because the house is large, because it has many bedrooms, or because it has many bathrooms, as these factors are intertwined. Example 2: Education and Income
  • Scenario: You're studying factors that influence an individual's income. Two of the variables you consider are the number of years of education and the level of education (e.g., high school, bachelor's, master's, doctorate).
  • Multicollinearity Issue: There is a natural correlation between the number of years of education and the highest level of education achieved. Typically, achieving a higher level of education requires more years of schooling. Including both these variables in a regression model to predict income might introduce multicollinearity, making it difficult to assess the independent impact of each educational factor on income. Example 3: Air Quality and Health Impacts
  • Scenario: Researchers are investigating the impact of air pollution on public health, specifically the incidence of respiratory diseases. They decide to include various air

University of Houston pollutants as independent variables in their model, such as nitrogen dioxide (NO2), particulate matter (PM2.5), and sulfur dioxide (SO2).

  • Multicollinearity Issue: Air pollutants often have a high degree of correlation among themselves because they can be emitted from the same sources (e.g., vehicle exhaust, industrial emissions). For instance, high levels of NO2 often coincide with high levels of PM2.5 and SO2 in urban areas due to dense traffic and industrial activities. Including these correlated pollutants in the same model could lead to multicollinearity, making it difficult to determine which specific pollutant has the most significant effect on respiratory health. 3. Detection of Multicollinearity in Python that are commonly used
  • Correlation Matrices: Use pandas and seaborn to visualize correlations between predictors.
  • Variance Inflation Factor (VIF) Calculation: Demonstrating how to calculate VIF for each predictor using the statsmodels library. Variance Inflation Factor (VIF): VIF quantifies how much the variance of an estimated regression coefficient is increased because of multicollinearity. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of linear dependence on other predictors. How is VIF Calculated? For each predictor variable, VIF is calculated by taking the ratio of the variance of the regression coefficient when it is the dependent variable in a regression model with all other predictors as independent variables. Mathematically, it is defined as: Where Ri^2 is the coefficient of determination (R-squared) of a regression of predictor i on all the other predictors. A high Ri^2 indicates that predictor i can be well predicted by the other predictors, suggesting the presence of multicollinearity. Interpretation of VIF Values
  • VIF = 1: No correlation among the ith predictor and the remaining predictor variables.
  • 1 < VIF < 5: Generally, a VIF below 5 indicates moderate correlation that may not require action, but this can depend on the context and specific threshold set by the researcher.

University of Houston

Summary Statistics

summary_statistics = df.describe() summary_statistics

Option 2: Drop rows with missing values (alternative

approach) df.dropna(inplace=True) df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True)

Let's calculate the correlation matrix for the dataset to understand

the correlation among the variables. correlation_matrix = df_daily.corr()

Display the correlation matrix

correlation_matrix from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant

Prepare the dataset for VIF calculation

Adding a constant to the model for the intercept

X = add_constant(df_daily.drop(['PM2.5'], axis=1))

Calculate VIF for each feature

vif_data = pd.DataFrame() vif_data['Feature'] = X.columns vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif_data import matplotlib.pyplot as plt import seaborn as sns

Remove the constant row for plotting

vif_data_plot = vif_data.drop(index=0)

Plot

plt.figure(figsize=(12, 8)) sns.barplot(x='VIF', y='Feature', data=vif_data_plot, orient='h', palette='coolwarm') plt.title('Variance Inflation Factor (VIF) for Each Predictor') plt.xlabel('Variance Inflation Factor (VIF)') plt.ylabel('Feature') plt.axvline(x=5, color='r', linestyle='--', label='VIF Threshold = 5') plt.legend() plt.show() /

University of Houston Exercise 10 .1:

1. Use the same data, and create a regression model to predict O3 in summer months (May, June, July, and august).