Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data science research, Thesis of Computer Science

University of Dar Es Salaam Computer Science

Prof. Bakari Mohamed

Memo and computer science research

Typology: Thesis

2025/2026

Uploaded on 04/21/2026

horace-orwa 🇹🇿

5 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

1

Data Science Coursework: Sequential Versus Parallel Mobility Data Analysis

Students Name

Institutional Affiliation

Module Director

Date

Signature

Discover Thesis of Computer Science University of Dar Es Salaam

Partial preview of the text

Download Data science research and more Thesis Computer Science in PDF only on Docsity!

Data Science Coursework: Sequential Versus Parallel Mobility Data Analysis Students Name Institutional Affiliation Module Director Date Signature

Abstract The exponential growth of mobility datasets has intro du ced significant computational challenges in processing and analysing large-scale data efficiently. The study investigates the use of parallel computing techniques to improve the performance ofh analysing mobility data provided by the Bureau of Transportation Statistics (BTS). The analysis focuses on identifying travel patterns, understanding population mobility behaviour, and developing predictive models for travel frequency based on trip distance. Both sequ ential and parallel processing approaches are implemented, with parallel execution tested using 10 and 20 processors. The results demonstrate that parallel computing significantly reduces execution time compared to sequential methods while maintaining analytical accuracy. Additionally, the findings rev eal that most travel activity is concentrated with in short distances and that higher-frequency travel patterns are relatively rare. The study conclu des that parallel computing is a critical component in modern data science workflows dealing with large datasets.

providers, ensuring high levels of accuracy and representativeness while maintaining user privacy. The dataset includes variables such as dates, population counts, and trip frequency acro ss different distance ranges. These variables enable the analysis of both tempo ral and spatial mobility patterns (Herwin et., 2022). The main challenge addressed in this study is the efficient processing of this large dataset while maintaining accuracy and scalability, particularly when performing repeated analyses. Data Pre-processing Data pre-processing was conducted to ensure the dataset was clean, consistent, and suitable for analysis. The process involved handling missing values by removing incomplete records and ensuring that all relevant data fields were properly formatted (Dean & Ghemawat, 2008). Date columns were converted into appropriate datetime formats to allow for accurate time-based analysis. Additio nally, th e dataset was aggregated on a weekly basis to simplify anal ysis and improve computational efficiency. Feature selection was also carried ou t to focus on the most relevant variables, including the number of trips, distance categories, and pop ulation mobili ty indicators (Bureau of Transportation Statistics, 2023). These steps ensured that the dataset was structured effectively for both analytical and modelling purpo ses. Data Classification Data classification was performed by grouping trips into distinct distance categories, such as short-distance, medium-distance, and long-distance travel. The classification enabled a clearer understanding of mobility patterns and facilitated more effective analysis (Dean & Ghemawat, 2008). The categorisation process was implemented using logical conditions applied to the

dataset, allowing each trip to be assigned to an appropriate category based on its distance (Sevtsuk & Ratti, 2010). The approach improved the in terp retability of the data and provided a foundation for subsequent analysis and modelling tasks, particularly in identify ing trends across different types of travel behaviour. Data Modelling A predictive model was developed to estimate the frequency of travel based on trip distance. A linear regression approach was selected due to its simplicity and effectiveness in identifying relationships between variables (Dean & Ghemawat, 2008).. The model used trip dis tance as the independent variable and the number of trips as the dependent v ariable. A scatter plot was generated to visualise the relationship between these variables, revealing a mo derate correlation between distance and travel frequency. The analysis indicated that shorter distances were associated with higher frequ encies of travel, suggesting that most mobility activities are localised (Herwin et., 2022). Although the model provides useful insights, it is limited by its assumption of linearity, which may not fully capture complex travel patterns. Model Evaluation The performance o f the predictive model was evaluated using Root Mean Square Error (RMSE) and R-squared (R²) metrics. RMSE provided a measure of the average prediction error, while R² indicated the proportion of variance in the dependent variable explained by the model. The results showed that the model achieved a moderate level of accuracy, with RMSE indicating acceptable error levels and R² demonstrating a reasonable fit to the data. However, the evaluation also highlighted limitations in the model’s predictive capability, suggesting that more advanced

Further analysis identified specific dates on which more than 10 million people conducted between 10 and 25 trips, as well as dates where a similar number of people conducted between 50 and 100 trips (Pappalardo et al., 2022). A scatter plot comparison showed that lower trip frequencies were more commo n and consistent, while higher trip frequencies occurred less frequently. The pattern suggests that extreme travel behaviour is relatively rare and may be influ en ced by external factors such as holidays or major events.The predictive modelling analysis confirmed a relationship between trip distance and travel frequency, altho ugh the strength of this relationship was moderate (Herwin et., 2022). Visualisation of travellers by distance categories further reinforced the finding that short-dis tance travel domin ates mobility patterns, while long- distance travel remains comparatively low. Data Visualisation Various visualisation techniques were emplo yed to support the analysis and enhance the interpretability of the results. Line graphs were u sed to illustrate weekly trends in mobility, while scatter plots were used to compare trip frequencies acro ss different categories. Bar charts were utilised to display the distribution of trips across distance ranges. These visualisations provided clear and intuitive representation s of th e data, enabling a better un derstanding of mobility pat terns and sup porting the conclusions drawn from the analysis (Pappalardo et al., 2022). The use of graphical representations also facilitated effective communication of findings to a broader audience. Discussion and Interpretation

The findings of this study highlight several important aspects of mobility behaviour and d ata processing techniques. The predominance of short-distance travel indicates that most individuals engage in localised activities, which is consistent with every day behaviour patterns (Pedregosa et al., 2011). The relatively low occurrence of high-frequency travel suggests that such behaviour is influenced by specific circumstances rather than being a co mmon trend. The comp arison between sequential and parallel processing demonstrates the significant adv antages of parallel computing in handling large datasets. By distributing computational tasks across multiple processors, parallel computing reduces execution time and enhances scalability (Pedregosa et al, 2011). However, the presence of overhead costs and diminishing returns highlights th e need for careful optimisation when implementing parallel systems.The limitations of the predictive model also emphasise the importance of selecting appropriate modelling techniqu es (Herwin et., 2022). While linear regression provides a useful baseline, more sophisticated models may be required to capture complex relationships within the data. Co nclusion Th e study successfully demonstrates the application of data science techniques to analyse large-scale mobility data. The integration of parallel computing significantly improves processing efficiency, making it a valuable approach for big data analysis. The findings reveal that travel behaviour is predominantly localised, with most trip s occurring over short distances. Additionally, the predictive modelling approach provides useful insights into the relationship b etween d istance and travel frequency, although further improvements could be achieved باس تخدام more advanced techniques. The study highlights the importance of combining efficient

References Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM , 51(1), 107–113. Herwin, H., Senen, A., Nurhayati, R., & Dahalan, S. C. (2022, October 10). Improving Student Learning Outcomes through Mobile Assessment: A Trend Analysis. International Journal of Information and Education Technology, 12. Pappalardo, L., Simini, F., Barlacchi, G., & Pellungrini, R. (2022). scikit-mobility: A Python Library for the Analysis, Generation, and Risk Assessment of Mobility Data. Journal of Statistical Software, 103 (4), 1–38. Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research , 12, 2825–2830. Sevtsuk, A., & Ratti, C. (2010, Mar 30 ). Does Urban Mobility Have a Daily Routine? Learning from the Aggregate Data of Mobile Networks. Journal of Urban Technology, 17 (1), 41-

doi:https://doi.org/10.1080/

Appendix The program begins by loading the dataset and performin g data cleaning to remove inconsistencies an d missing values. The dataset is then divided into smaller chu nks to enable parallel processing. Each chunk is processed independently, after which the results are combined to produce the final ou tput. A regression model is trained using the processed data, and its pe rformance is evaluated باستخدام appropriate metrics. Visualisations are generated to support the analysis, and execution times are compared between sequential and parallel approaches to assess performance improvements.

Data science research, Thesis of Computer Science

Related documents

Partial preview of the text

Download Data science research and more Thesis Computer Science in PDF only on Docsity!