Assignment 1 Business Intelligence, Assignments of Artificial Intelligence

Assignment 1 1641 pass Business Intelligence

Typology: Assignments

2023/2024

Uploaded on 10/18/2023

huhuhu-fgw-hcm
huhuhu-fgw-hcm 🇻🇳

5

(2)

6 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ASSIGNMENT 1 FRONT SHEET
Qualification
BTEC Level 5 HND Diploma in Computing
Unit number and title
Unit 14: Business Intelligence
Submission date
13/10/2023
Date Received 1st submission
13/10/2023
Re-submission Date
19/10/2023
Date Received 2nd submission
19/10/2023
Student Name
Dang Le Tuan Kiet
Student ID
GCS210900
Class
GCS1004B
Assessor name
Nguyen Xuan Sam
Student declaration
I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that
making a false declaration is a form of malpractice.
Student’s signature
Kiet
Grading grid
P2
M2
D2
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Assignment 1 Business Intelligence and more Assignments Artificial Intelligence in PDF only on Docsity!

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence Submission date 13/10/2023 Date Received 1st submission 13/10/ Re-submission Date^19 /10/2023^ Date Received 2nd submission^1 9/10/ Student Name Dang Le Tuan Kiet Student ID GCS Class GCS1004B Assessor name Nguyen Xuan Sam Student declaration I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that making a false declaration is a form of malpractice. Student’s signature Kiet Grading grid P1 P2 M1 M2 D1 D

 Summative Feedback:  Resubmission Feedback:

Grade: Assessor Signature: Date: IV Signature:

    1. Introduction Contents
  • 1.1 Real estate business process
  • 1.2 Motivations..................................................................................................................................................................
  • 1.3 Objectives
  • 1.4 Summary
    1. Related works
  • 2.1 Dataset
  • 2.1.1 Data collection
  • 2.1.2 Dataset description
  • 2.2 Linear regression
  • 2.3 Correlation
      1. Simulating scenario and results
    • 3.2 Correlation.
    • 3.3 Scenarios and analysis
    • 3.3.2 Scenario
    • 3.3.3 Scenario
    • 3.3.4 Scenario
      1. Conclusions.
    • References
  • Figure 1. The factors impact on house price. Table of Figures
  • Figure 2. HCMC and Ha Noi apartments prices.
  • Figure 3. 5 rows of data from dataset.
  • Figure 4. 21 features of dataset.....................................................................................................................................................
  • Figure 5. Linear regression model.
  • Figure 6. Two-Variable Relationships.
  • Figure 7. Correlation between Two Variables............................................................................................................................
  • Figure 8. Pearson product moment correlation.
  • Figure 9. Anaonda Website Download.
  • Figure 10. Anaconda Interface.
  • Figure 11. Jupyter Notebook.
  • Figure 12. Jupyter Notebook's homepage.
  • Figure 13. Create new folder.
  • Figure 14. Import packages.
  • Figure 15. Heatmap.
  • Figure 16. Price vs sqft_living....................................................................................................................................................
  • Figure 17. Price vs Grade.
  • Figure 18. Price vs Bathrooms.
  • Figure 19. Price vs age.

1. Introduction

As it shown in Figure 1, The housing market's dynamics are shaped by several key factors, including the overall economic conditions, interest rates, real income levels, and fluctuations in population size. In addition to these demand-related elements, the availability of housing supply also plays a critical role. When there's a surge in demand and a limited housing supply, we can expect to witness an increase in both housing prices and rental costs, which, in turn, heightens the risk of homelessness. In contemporary times, there is a multitude of data science projects dedicated to utilizing machine learning for price prediction. In the realm of machine learning, the capacity to forecast outcomes for new data points using existing features is readily achievable. Regression models are among the most prominent tools for conducting predictive analysis. It's worth noting that the primary objective of these models is to anticipate future outcomes, a capability that has found extensive applications across various domains such as economics, business, the banking sector, healthcare, e-commerce, entertainment, sports, and many others. Consequently, this technique is widely embraced for constructing predictive models based on specific features to forecast prices. Figure 1. The factors impact on house price.

  • The impact of the house's grade on its price.
  • The impact of the number of bathrooms on the house price.
  • The impact of the house’s age on its price.

1.4 Summary

The preceding section outlined the motivations that underpin my research, analysis, and statistical exploration of the data. I also have the objective of reviewing relevant literature. Additionally, this report encompasses four distinct components, with one of them focusing on the data, including its source, characteristics, and the methodologies employed for data refinement.

2. Related works

Numerous techniques and models are employed within the domain of machine learning for the prediction of housing prices. My project was conducted within King County, Washington, in the United States.

2.1 Dataset

2.1.1 Data collection

I sourced the dataset from Kaggle (Burhan Y. Kiyakoglu, 2019), which revolves around housing prices in King County, United States, spanning from May 2014 to May 2015. The dataset comprises 21 columns and a total of 21,613 rows. In this dataset, the price serves as the dependent variable, while all columns, excluding 'id' and 'date,' represent independent variables. As Figure below, the initial five rows of data are provided for reference. Figure 3. 5 rows of data from dataset.

2.1.2 Dataset description

Explanation of 21 features:

  • ID: A unique identifier for each house.
  • Date: The date when the house was purchased.
  • Price: The cost of the house.
  • Bedrooms: The number of bedrooms in the house.
  • Bathrooms: The count of bathrooms within the house.
  • Sqft_living: The width of the house in square feet.
  • Sqft_lot: The land area of the property in square feet.
  • Floors: The number of levels or floors in the house.
  • Waterfront: A binary indicator for houses situated near water bodies.
  • View: A measure of the scenic view associated with the property.
  • Condition: An assessment of the overall condition of the house.
  • Grade: An evaluation of the overall quality and grade of the house.
  • Sqft_above: The interior area of the house above the ground.
  • Sqft_basement: The living space located in the basement.
  • Yr_built: The year when construction of the house was completed.
  • Yr_renovated: The year in which the house underwent renovation.
  • Zipcode: The postal code of the house's location.
  • Lat: The latitude coordinate of the property.
  • Long: The longitude coordinate of the property. Figure 4. 21 features of dataset.

Figure 6. Two-Variable Relationships. Figure 7. Correlation between Two Variables. Figure 14.2, there are examples that visually demonstrate the correlation between two variables. It's important to note that when the correlation coefficient equals either +1.0 or - 1.0, it indicates a perfect linear relationship, meaning that all the data points represented by (x, y) pairs align perfectly on a straight line. The more the data points deviate from forming a straight line, the closer the correlation coefficient gets to 0.0, signifying a weaker or no discernible correlation between the two variables. In essence, the deviation from a straight-line pattern is an indicator of the strength of the correlation, with 0.0 representing little to no correlation, and +1. or - 1.0 indicating a strong and perfect linear correlation (Groebner et al., 2018).

Figure 8. Pearson product moment correlation. When two variables are correlated, the formula used to calculate the correlation coefficient is known as the Pearson product-moment correlation coefficient. A perfect positive correlation (r = +1.0) or a perfect negative correlation (r = - 1.0) can be found in the sample correlation coefficient. If every point on the scatter plot is on a straight line, then there is a perfect correlation. In the event when there is no linear link between two variables, the x and y variables have no linear relationship, and their correlation is 0. Thus, the linear link between the two variables is greater the further the correlation deviates from 0.0. The direction of the association is indicated by the correlation coefficient's sign (Groebner et al., 2018).

3. Simulating scenario and results

3.1 Package installation

I’m using Anaconda for my project. Step 1: In utilizing Python for data analysis with Jupyter is to set up and install Anaconda. Website that I downloaded the program is https://www.anaconda.com/download. Figure 9. Anaonda Website Download.

Launch Jupyter Notebook on Anaconda interface which is Figure 7 above. After launch, Jupyter Notebook’s home page shows in Figure 8 below. Figure 12. Jupyter Notebook's homepage. To create a new folder, click the "New" button and choose "Python 3" from the options. Figure 13. Create new folder.

Step 3: Import all the packages necessary for this work . There are 5 packages we need for the project in Figure 10:

  • NumPy: NumPy, short for "Numerical Python," is a well-known open-source Python library. It offers built-in mathematical tools for efficient data processing and is particularly adept at handling large matrices and multidimensional data. NumPy can be used for tasks like random number generation, general data storage in multidimensional arrays, and various linear algebra operations.
  • Pandas: Pandas is an open-source library under a BSD license that is widely used for data processing, cleaning, and analysis. It simplifies data manipulation and allows for straightforward data modeling without the need to switch to another language like R. Pandas is excellent for working with structured data and dataframes.
  • Matplotlib: Matplotlib is a versatile open-source toolkit designed for visualizing and exploring numerical data. It is used to create a wide range of data visualizations, including line charts, pie charts, scatterplots, and histograms. Matplotlib is often used in conjunction with other libraries like NumPy and Pandas for data visualization.
  • Statsmodels: Statsmodels is a Python package focused on statistical modeling. It provides resources for performing statistical research and developing statistical models. This package is particularly Figure 14. Import packages.

I've collected a set of high correlation pairs and some features that I consider important when looking for a house. Here are the pairs I've identified:

  • Price and sqft_living.
  • Price and grade.
  • Price and bathrooms.
  • Price and age. These pairs likely represent features that have a significant influence on the price of a house, and I can use them to analyze and make predictions related to real estate data.

3.3 Scenarios and analysis

3.3.1 Scenario 1

Scenario 1: The impact of the number of house’s living space on its price. This suggests that sqft_living accounts for approximately 49.3% of the variation in actual house prices. The figure below illustrates the substantial correlation between the square footage of living space (sqft_living) and house prices, as indicated by the R-square value of 0.493.

Figure 16. Price vs sqft_living. The analysis reveals that within the range of 0 to 6000 square feet of living space (sqft_living), house prices tend to remain relatively stable at an average level. However, as sqft_living approaches nearly 8000 square feet and goes beyond, there is a noticeable and significant upward surge in house prices, and this increase is rarely followed by a decrease. The data illustrates a striking peak in house prices, reaching its highest point near $8,000,000 when sqft_living exceeds 12,000 square feet. This observation underscores the substantial influence of sqft_living on the pricing of houses, with larger living spaces generally leading to higher property values.

Figure 18. Price vs Bathrooms. This figure illustrates the linear positive link between the number of bathrooms and property prices. The figure makes it quite evident that a house's price increases with the number of bathrooms. Finding the ideal home might be based on the number of rooms and reasonably priced homes with three to five bathrooms.

3.3.4 Scenario 4

Scenario 4: The impact of a house's age on its price.

Figure 19. Price vs age. When considering only the median price, the trend of higher prices for newer houses holds true within the age range of 11 to 50 years. Interestingly, for houses older than 50 years, there is an increase in their values, which could be attributed to their potential antique or historical appeal. It's worth noting that, as revealed in the correlation table, there isn't a strong negative correlation between the age of the house and price.