






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Memo and computer science research
Typology: Thesis
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Data Science Coursework: Sequential Versus Parallel Mobility Data Analysis Students Name Institutional Affiliation Module Director Date Signature
Abstract The exponential growth of mobility datasets has intro du ced significant computational challenges in processing and analysing large-scale data efficiently. The study investigates the use of parallel computing techniques to improve the performance ofh analysing mobility data provided by the Bureau of Transportation Statistics (BTS). The analysis focuses on identifying travel patterns, understanding population mobility behaviour, and developing predictive models for travel frequency based on trip distance. Both sequ ential and parallel processing approaches are implemented, with parallel execution tested using 10 and 20 processors. The results demonstrate that parallel computing significantly reduces execution time compared to sequential methods while maintaining analytical accuracy. Additionally, the findings rev eal that most travel activity is concentrated with in short distances and that higher-frequency travel patterns are relatively rare. The study conclu des that parallel computing is a critical component in modern data science workflows dealing with large datasets.
providers, ensuring high levels of accuracy and representativeness while maintaining user privacy. The dataset includes variables such as dates, population counts, and trip frequency acro ss different distance ranges. These variables enable the analysis of both tempo ral and spatial mobility patterns (Herwin et., 2022). The main challenge addressed in this study is the efficient processing of this large dataset while maintaining accuracy and scalability, particularly when performing repeated analyses. Data Pre-processing Data pre-processing was conducted to ensure the dataset was clean, consistent, and suitable for analysis. The process involved handling missing values by removing incomplete records and ensuring that all relevant data fields were properly formatted (Dean & Ghemawat, 2008). Date columns were converted into appropriate datetime formats to allow for accurate time-based analysis. Additio nally, th e dataset was aggregated on a weekly basis to simplify anal ysis and improve computational efficiency. Feature selection was also carried ou t to focus on the most relevant variables, including the number of trips, distance categories, and pop ulation mobili ty indicators (Bureau of Transportation Statistics, 2023). These steps ensured that the dataset was structured effectively for both analytical and modelling purpo ses. Data Classification Data classification was performed by grouping trips into distinct distance categories, such as short-distance, medium-distance, and long-distance travel. The classification enabled a clearer understanding of mobility patterns and facilitated more effective analysis (Dean & Ghemawat, 2008). The categorisation process was implemented using logical conditions applied to the
dataset, allowing each trip to be assigned to an appropriate category based on its distance (Sevtsuk & Ratti, 2010). The approach improved the in terp retability of the data and provided a foundation for subsequent analysis and modelling tasks, particularly in identify ing trends across different types of travel behaviour. Data Modelling A predictive model was developed to estimate the frequency of travel based on trip distance. A linear regression approach was selected due to its simplicity and effectiveness in identifying relationships between variables (Dean & Ghemawat, 2008).. The model used trip dis tance as the independent variable and the number of trips as the dependent v ariable. A scatter plot was generated to visualise the relationship between these variables, revealing a mo derate correlation between distance and travel frequency. The analysis indicated that shorter distances were associated with higher frequ encies of travel, suggesting that most mobility activities are localised (Herwin et., 2022). Although the model provides useful insights, it is limited by its assumption of linearity, which may not fully capture complex travel patterns. Model Evaluation The performance o f the predictive model was evaluated using Root Mean Square Error (RMSE) and R-squared (R²) metrics. RMSE provided a measure of the average prediction error, while R² indicated the proportion of variance in the dependent variable explained by the model. The results showed that the model achieved a moderate level of accuracy, with RMSE indicating acceptable error levels and R² demonstrating a reasonable fit to the data. However, the evaluation also highlighted limitations in the model’s predictive capability, suggesting that more advanced
Further analysis identified specific dates on which more than 10 million people conducted between 10 and 25 trips, as well as dates where a similar number of people conducted between 50 and 100 trips (Pappalardo et al., 2022). A scatter plot comparison showed that lower trip frequencies were more commo n and consistent, while higher trip frequencies occurred less frequently. The pattern suggests that extreme travel behaviour is relatively rare and may be influ en ced by external factors such as holidays or major events.The predictive modelling analysis confirmed a relationship between trip distance and travel frequency, altho ugh the strength of this relationship was moderate (Herwin et., 2022). Visualisation of travellers by distance categories further reinforced the finding that short-dis tance travel domin ates mobility patterns, while long- distance travel remains comparatively low. Data Visualisation Various visualisation techniques were emplo yed to support the analysis and enhance the interpretability of the results. Line graphs were u sed to illustrate weekly trends in mobility, while scatter plots were used to compare trip frequencies acro ss different categories. Bar charts were utilised to display the distribution of trips across distance ranges. These visualisations provided clear and intuitive representation s of th e data, enabling a better un derstanding of mobility pat terns and sup porting the conclusions drawn from the analysis (Pappalardo et al., 2022). The use of graphical representations also facilitated effective communication of findings to a broader audience. Discussion and Interpretation
The findings of this study highlight several important aspects of mobility behaviour and d ata processing techniques. The predominance of short-distance travel indicates that most individuals engage in localised activities, which is consistent with every day behaviour patterns (Pedregosa et al., 2011). The relatively low occurrence of high-frequency travel suggests that such behaviour is influenced by specific circumstances rather than being a co mmon trend. The comp arison between sequential and parallel processing demonstrates the significant adv antages of parallel computing in handling large datasets. By distributing computational tasks across multiple processors, parallel computing reduces execution time and enhances scalability (Pedregosa et al, 2011). However, the presence of overhead costs and diminishing returns highlights th e need for careful optimisation when implementing parallel systems.The limitations of the predictive model also emphasise the importance of selecting appropriate modelling techniqu es (Herwin et., 2022). While linear regression provides a useful baseline, more sophisticated models may be required to capture complex relationships within the data. Co nclusion Th e study successfully demonstrates the application of data science techniques to analyse large-scale mobility data. The integration of parallel computing significantly improves processing efficiency, making it a valuable approach for big data analysis. The findings reveal that travel behaviour is predominantly localised, with most trip s occurring over short distances. Additionally, the predictive modelling approach provides useful insights into the relationship b etween d istance and travel frequency, although further improvements could be achieved باس تخدام more advanced techniques. The study highlights the importance of combining efficient
References Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM , 51(1), 107–113. Herwin, H., Senen, A., Nurhayati, R., & Dahalan, S. C. (2022, October 10). Improving Student Learning Outcomes through Mobile Assessment: A Trend Analysis. International Journal of Information and Education Technology, 12. Pappalardo, L., Simini, F., Barlacchi, G., & Pellungrini, R. (2022). scikit-mobility: A Python Library for the Analysis, Generation, and Risk Assessment of Mobility Data. Journal of Statistical Software, 103 (4), 1–38. Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research , 12, 2825–2830. Sevtsuk, A., & Ratti, C. (2010, Mar 30 ). Does Urban Mobility Have a Daily Routine? Learning from the Aggregate Data of Mobile Networks. Journal of Urban Technology, 17 (1), 41-
Appendix The program begins by loading the dataset and performin g data cleaning to remove inconsistencies an d missing values. The dataset is then divided into smaller chu nks to enable parallel processing. Each chunk is processed independently, after which the results are combined to produce the final ou tput. A regression model is trained using the processed data, and its pe rformance is evaluated باستخدام appropriate metrics. Visualisations are generated to support the analysis, and execution times are compared between sequential and parallel approaches to assess performance improvements.