Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture Notes for Lab 10, Lecture notes of Environmental Science

University of Houston (UH)Environmental Science

Lecture notes for lab 10 for GEOL 4342

Typology: Lecture notes

2025/2026

Uploaded on 05/05/2026

mustufa-khan-1 🇺🇸

5 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

Deep Learning for Big Data Analytics

University of Houston

Lab 10: Feature Engineering

Lab Instructor: Rashik Islam

10.1 What is feature?

In the context of machine learning and data science, a "feature" is an individual measurable

property or characteristic of a phenomenon being observed. Essentially, features are the input

variables used by models to make predictions or classifications. They can be thought of as the

raw data that you feed into a model, which the model then uses to learn patterns that

correlate with particular outcomes.

10.2 Feature Engineering

Feature engineering is a critical process in the development of machine learning models that

involves creating new features or modifying existing ones to improve model performance.

Essentially, it's about transforming raw data into a format that makes it easier for machine

learning algorithms to process, helping the models to uncover patterns or insights that might

not be apparent in the initial dataset. This process can significantly influence the accuracy and

effectiveness of a model, as the right features can help the model to understand the

underlying structure of the data better.

10.2.1 Key Aspects of Feature Engineering:

1. Creation of New Features:

• Domain Knowledge: Utilizing specific knowledge about the domain to generate new

features that could be predictive of the outcome. For example, in a real-estate pricing

model, creating a new feature that represents the age of a property from the 'year

built' can provide valuable information to the model.

• Interaction Terms: Creating features that capture the interaction between two or more

variables. For instance, in predicting customer spend, an interaction feature between

Discover Lecture notes of Environmental Science University of Houston (UH)

Partial preview of the text

Download Lecture Notes for Lab 10 and more Lecture notes Environmental Science in PDF only on Docsity!

University of Houston

Lab 1 0 : Feature Engineering

Lab Instructor: Rashik Islam 10 .1 What is feature? In the context of machine learning and data science, a "feature" is an individual measurable property or characteristic of a phenomenon being observed. Essentially, features are the input variables used by models to make predictions or classifications. They can be thought of as the raw data that you feed into a model, which the model then uses to learn patterns that correlate with particular outcomes. 10 .2 Feature Engineering Feature engineering is a critical process in the development of machine learning models that involves creating new features or modifying existing ones to improve model performance. Essentially, it's about transforming raw data into a format that makes it easier for machine learning algorithms to process, helping the models to uncover patterns or insights that might not be apparent in the initial dataset. This process can significantly influence the accuracy and effectiveness of a model, as the right features can help the model to understand the underlying structure of the data better. 10 .2.1 Key Aspects of Feature Engineering:

1. Creation of New Features: - Domain Knowledge: Utilizing specific knowledge about the domain to generate new features that could be predictive of the outcome. For example, in a real-estate pricing model, creating a new feature that represents the age of a property from the 'year built' can provide valuable information to the model. - Interaction Terms: Creating features that capture the interaction between two or more variables. For instance, in predicting customer spend, an interaction feature between

University of Houston 'number of store visits' and 'average spend per visit' might be more predictive than either feature alone.

2. Feature Transformation: - Normalization/Standardization: Scaling features so they have a specific statistical property (e.g., zero mean and unit variance for standardization). This is particularly important for models sensitive to the scale of data, like SVM or k-NN. - Log Transformation: Applying logarithmic scale to features to reduce skewness, which can help linear models perform better by making the relationship between variables more linear. 3. Feature Selection: - Removing Redundant or Irrelevant Features: Identifying and removing features that do not contribute much to the predictive power of the model to reduce complexity and overfitting. - Selecting Top Features: Using statistical tests, model-based importance, or wrapper methods to select a subset of the most informative features. 4. Handling Missing Values: - Imputation: Filling missing values with statistics like mean, median, mode, or using algorithms that predict missing values. - Indicator Variables: Creating features that indicate whether a value was missing, which can sometimes capture useful information about missingness. 5. Temporal Features: Extracting information from date/time features, such as the day of the week, month, year, or even part of the day, which might influence the target variable.

University of Houston Now, if you look closely at the figure, you will see that, few features are highly correlated to each other when it is predicting PM2.5. That is called multicollinearity. 12.2.3 Multicollinearity Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning they have a strong linear relationship with each other. This condition can make it difficult to distinguish the individual effects of each predictor on the dependent variable, leading to unstable estimates of the regression coefficients. These instabilities can result in large variations in the estimated coefficients for minor changes in the data or model, which complicates the interpretation and reduces the statistical power of the independent variables. Example 1: House Prices

Scenario: Imagine you're trying to predict the price of houses based on various features like the size of the house (in square feet), the number of bedrooms, and the number of bathrooms.
Multicollinearity Issue: The size of the house is likely to be highly correlated with both the number of bedrooms and bathrooms. Larger houses tend to have more bedrooms and bathrooms. In this case, if you include all three variables (size, bedrooms, bathrooms) in your model, you might face multicollinearity because it's hard to isolate the effect of each variable on the house price. The model might struggle to determine if the price is high because the house is large, because it has many bedrooms, or because it has many bathrooms, as these factors are intertwined. Example 2: Education and Income
Scenario: You're studying factors that influence an individual's income. Two of the variables you consider are the number of years of education and the level of education (e.g., high school, bachelor's, master's, doctorate).
Multicollinearity Issue: There is a natural correlation between the number of years of education and the highest level of education achieved. Typically, achieving a higher level of education requires more years of schooling. Including both these variables in a regression model to predict income might introduce multicollinearity, making it difficult to assess the independent impact of each educational factor on income. Example 3: Air Quality and Health Impacts
Scenario: Researchers are investigating the impact of air pollution on public health, specifically the incidence of respiratory diseases. They decide to include various air

University of Houston pollutants as independent variables in their model, such as nitrogen dioxide (NO2), particulate matter (PM2.5), and sulfur dioxide (SO2).

Multicollinearity Issue: Air pollutants often have a high degree of correlation among themselves because they can be emitted from the same sources (e.g., vehicle exhaust, industrial emissions). For instance, high levels of NO2 often coincide with high levels of PM2.5 and SO2 in urban areas due to dense traffic and industrial activities. Including these correlated pollutants in the same model could lead to multicollinearity, making it difficult to determine which specific pollutant has the most significant effect on respiratory health. 3. Detection of Multicollinearity in Python that are commonly used
Correlation Matrices: Use pandas and seaborn to visualize correlations between predictors.
Variance Inflation Factor (VIF) Calculation: Demonstrating how to calculate VIF for each predictor using the statsmodels library. Variance Inflation Factor (VIF): VIF quantifies how much the variance of an estimated regression coefficient is increased because of multicollinearity. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of linear dependence on other predictors. How is VIF Calculated? For each predictor variable, VIF is calculated by taking the ratio of the variance of the regression coefficient when it is the dependent variable in a regression model with all other predictors as independent variables. Mathematically, it is defined as: Where Ri^2 is the coefficient of determination (R-squared) of a regression of predictor i on all the other predictors. A high Ri^2 indicates that predictor i can be well predicted by the other predictors, suggesting the presence of multicollinearity. Interpretation of VIF Values
VIF = 1: No correlation among the ith predictor and the remaining predictor variables.
1 < VIF < 5: Generally, a VIF below 5 indicates moderate correlation that may not require action, but this can depend on the context and specific threshold set by the researcher.

University of Houston

Summary Statistics

summary_statistics = df.describe() summary_statistics

Option 2: Drop rows with missing values (alternative

approach) df.dropna(inplace=True) df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True)

Let's calculate the correlation matrix for the dataset to understand

the correlation among the variables. correlation_matrix = df_daily.corr()

Display the correlation matrix

correlation_matrix from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant

Prepare the dataset for VIF calculation

Adding a constant to the model for the intercept

X = add_constant(df_daily.drop(['PM2.5'], axis=1))

Calculate VIF for each feature

vif_data = pd.DataFrame() vif_data['Feature'] = X.columns vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif_data import matplotlib.pyplot as plt import seaborn as sns

Remove the constant row for plotting

vif_data_plot = vif_data.drop(index=0)

Plot

plt.figure(figsize=(12, 8)) sns.barplot(x='VIF', y='Feature', data=vif_data_plot, orient='h', palette='coolwarm') plt.title('Variance Inflation Factor (VIF) for Each Predictor') plt.xlabel('Variance Inflation Factor (VIF)') plt.ylabel('Feature') plt.axvline(x=5, color='r', linestyle='--', label='VIF Threshold = 5') plt.legend() plt.show() /

University of Houston Exercise 10 .1:

1. Use the same data, and create a regression model to predict O3 in summer months (May, June, July, and august).

Lecture Notes for Lab 10, Lecture notes of Environmental Science

Related documents

Partial preview of the text

Download Lecture Notes for Lab 10 and more Lecture notes Environmental Science in PDF only on Docsity!

Lab 1 0 : Feature Engineering

Summary Statistics

Option 2: Drop rows with missing values (alternative

Let's calculate the correlation matrix for the dataset to understand

Display the correlation matrix

Prepare the dataset for VIF calculation

Adding a constant to the model for the intercept

Calculate VIF for each feature

Remove the constant row for plotting

Plot