



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An assignment focused on applying k-means clustering to analyze global development patterns using the world development indicators (wdi) dataset. It covers data preprocessing challenges, model optimization strategies, and result interpretation. The assignment requires handling real-world data preprocessing, evaluating and optimizing clustering algorithms, assessing cluster quality, applying dimensionality reduction techniques, and critically analyzing design decisions. The document also includes details on dataset, k-means clustering and initialization, convergence criteria, determining the optimal k, dimensionality reduction with pca, and cluster interpretation. It is designed to enhance students' understanding of unsupervised learning and practical data analysis skills. (418 characters)
Typology: Assignments
1 / 6
This page cannot be seen from the preview
Don't miss anything!




In this assignment, you will apply K-means clustering to analyze global development patterns using the World Development Indicators (WDI) dataset. The assignment will focus on the practical challenges of applying clustering to real-world data, model optimization strategies, and interpretation of results. In addition to submitting the code for reproducing your analysis, you need to submit a report that documents your model design decisions and analyses the results.
Upon completion of this assignment, you should be able to:
The World Development Indicators database from the World Bank contains over 1,400 time series indicators for 220 countries and territories from 1960 to present. The dataset contains key development indicators such as:
Load the WDI dataset and prepare it for clustering. You will see that the data on some indicators are missing for some countries or not available for each of the years covered by the dataset as a whole. You must implement an approach to handle missing values: For each country, only include years where data is available for a sufficiently high proportion of the features. Also remove features that have missing values for too many data points (remember that each data point represents all the features for each country/year). You will need to test different thresholds for these two different strategies to remove data points to find a good balance between coverage and completeness of the remaining data. For the remaining missing values, apply and justify an appropriate data imputation strategy (i.e. estimate the missing values), e.g. based on the mean of related available values. Also consider normalizing the feature values, e.g. between 0 and 1 or - 1 and 1, as that may lead to better clustering performance. You may use a library such as Pandas for data loading and preprocessing. Conduct exploratory data analysis by visualizing the data. To do this it is recommended to use t- SNE, a dimensionality reduction method that is particularly suitable for visualizing high- dimensional data. You may use an existing implementation of t-SNE, for example using scikit- learn. Document the impact of each preprocessing decision on the final dataset size and characteristics, providing clear justification for your choices.
Implement K-means clustering in Python using NumPy data structures – you may not use an implementation in an existing package such as scikit learn for this. Implement and compare multiple initialization strategies to understand their effects on the resulting clustering. Start with random initialization, running the algorithm multiple times with different random seeds to assess the variability in results. Then implement K-means++ initialization, which selects initial centers that are far apart, and analyze how this affects convergence speed and final cluster quality. For each initialization method, track and visualize the convergence behavior across multiple runs, recording metrics such as the number of iterations to convergence, final loss function value, and the stability of cluster assignments across runs. Use this information to make a data-driven recommendation for the best initialization strategy for this dataset. Optional : Design and implement a custom initialization strategy based on the specific characteristics of the data - for instance, you might select initial centers based on countries representing different development stages or geographic regions.
Optional : Analyze differences in clustering quality using the Silhouette score and compare computational efficiency (time and iterations to convergence).
The ultimate goal of clustering is to discover meaningful patterns in the data. Provide comprehensive interpretation of your final clusters. One way to do this is to identify representative countries that exemplify each cluster's characteristics. Develop cluster "personas" that capture the essential development patterns of each group, giving them descriptive names like "Emerging Industrial Economies" or "Resource-Rich Developing Nations." Compare your clusters against the country classifications given by the Human Development Index (also included in the provided data), discussing agreements and discrepancies. Optional : Analyze how countries transition between clusters over time. Conduct temporal analysis to understand development trajectories, identifying countries that have moved between clusters and what changes drove these transitions. Identify surprising or counterintuitive groupings and provide potential explanations for these unexpected associations.
Implement at least two extensions to demonstrate advanced understanding and creativity. Below are some options, but you may also propose your own extensions. The optional functionality listed in the descriptions above will also be scored under this section. Options:
memory usage and runtime between your optimized and baseline implementations across different dataset sizes.
Submit a single compressed archive named as your student number, e.g. GWRBRA001./tar.xz. The code should be submitted as a Jupyter notebook with clear documentation. Version control with git is recommended but not required. The code should be able to run based with only the provided data and standard libraries such as numpy, scikit-learn and matplotlib (check with a tutor if you are unsure about using other libraries). A README file should include setup instructions and a description of each submitted file. Do not submit any data. Also include a report of at most 6 pages (submitted as a PDF document) that describes your design decisions, reports and discusses your results, including visualizations. Marks will be given primaily based on showing understanding of the concepts being applied and demonstrating ability to develop appropriate models and to analyse the results. Please ensure that your tarball works and is not corrupt (you can check this by trying to downloading your submission and extracting the contents of your tarball - make this a habit!). Corrupt or non-working tarballs will not be marked - no exceptions. A 10% penalty will be incurred for a submission that is one day late (handed in within 24 hours after the deadline). Any submissions later than 1 day will be given a 0% mark.
All code and analysis must be your own work. You may use standard libraries (in line with the specification of what is permitted for the different components of the assignment) and reference documentation, but must cite all sources. You may discuss your work with other students, but code sharing is prohibited. Use of AI tools must be declared and limited to debugging assistance only. AI assistance is not permitted for writing the report.
Category Marks Data Preprocessing 6 Initialization Strategies 6 Convergence Criteria 3 Optimal K Selection 3 PCA 5 Cluster Interpretation 3 Code Quality 3 Overall Report Quality 3