




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Lecture notes for lab 10 for GEOL 4342
Typology: Lecture notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





University of Houston
Lab Instructor: Rashik Islam 10 .1 What is feature? In the context of machine learning and data science, a "feature" is an individual measurable property or characteristic of a phenomenon being observed. Essentially, features are the input variables used by models to make predictions or classifications. They can be thought of as the raw data that you feed into a model, which the model then uses to learn patterns that correlate with particular outcomes. 10 .2 Feature Engineering Feature engineering is a critical process in the development of machine learning models that involves creating new features or modifying existing ones to improve model performance. Essentially, it's about transforming raw data into a format that makes it easier for machine learning algorithms to process, helping the models to uncover patterns or insights that might not be apparent in the initial dataset. This process can significantly influence the accuracy and effectiveness of a model, as the right features can help the model to understand the underlying structure of the data better. 10 .2.1 Key Aspects of Feature Engineering:
1. Creation of New Features: - Domain Knowledge: Utilizing specific knowledge about the domain to generate new features that could be predictive of the outcome. For example, in a real-estate pricing model, creating a new feature that represents the age of a property from the 'year built' can provide valuable information to the model. - Interaction Terms: Creating features that capture the interaction between two or more variables. For instance, in predicting customer spend, an interaction feature between
University of Houston 'number of store visits' and 'average spend per visit' might be more predictive than either feature alone.
2. Feature Transformation: - Normalization/Standardization: Scaling features so they have a specific statistical property (e.g., zero mean and unit variance for standardization). This is particularly important for models sensitive to the scale of data, like SVM or k-NN. - Log Transformation: Applying logarithmic scale to features to reduce skewness, which can help linear models perform better by making the relationship between variables more linear. 3. Feature Selection: - Removing Redundant or Irrelevant Features: Identifying and removing features that do not contribute much to the predictive power of the model to reduce complexity and overfitting. - Selecting Top Features: Using statistical tests, model-based importance, or wrapper methods to select a subset of the most informative features. 4. Handling Missing Values: - Imputation: Filling missing values with statistics like mean, median, mode, or using algorithms that predict missing values. - Indicator Variables: Creating features that indicate whether a value was missing, which can sometimes capture useful information about missingness. 5. Temporal Features: Extracting information from date/time features, such as the day of the week, month, year, or even part of the day, which might influence the target variable.
University of Houston Now, if you look closely at the figure, you will see that, few features are highly correlated to each other when it is predicting PM2.5. That is called multicollinearity. 12.2.3 Multicollinearity Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, meaning they have a strong linear relationship with each other. This condition can make it difficult to distinguish the individual effects of each predictor on the dependent variable, leading to unstable estimates of the regression coefficients. These instabilities can result in large variations in the estimated coefficients for minor changes in the data or model, which complicates the interpretation and reduces the statistical power of the independent variables. Example 1: House Prices
University of Houston pollutants as independent variables in their model, such as nitrogen dioxide (NO2), particulate matter (PM2.5), and sulfur dioxide (SO2).
University of Houston
summary_statistics = df.describe() summary_statistics
approach) df.dropna(inplace=True) df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True)
the correlation among the variables. correlation_matrix = df_daily.corr()
correlation_matrix from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant
X = add_constant(df_daily.drop(['PM2.5'], axis=1))
vif_data = pd.DataFrame() vif_data['Feature'] = X.columns vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif_data import matplotlib.pyplot as plt import seaborn as sns
vif_data_plot = vif_data.drop(index=0)
plt.figure(figsize=(12, 8)) sns.barplot(x='VIF', y='Feature', data=vif_data_plot, orient='h', palette='coolwarm') plt.title('Variance Inflation Factor (VIF) for Each Predictor') plt.xlabel('Variance Inflation Factor (VIF)') plt.ylabel('Feature') plt.axvline(x=5, color='r', linestyle='--', label='VIF Threshold = 5') plt.legend() plt.show() /
University of Houston Exercise 10 .1:
1. Use the same data, and create a regression model to predict O3 in summer months (May, June, July, and august).