















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The creation of a database to analyze digital data from secondary storage devices to find connections and reduce processing time for forensic analysis. the motivation behind the research, the growing amount of data, current challenges in forensic analysis, and methods for identifying duplicated material. It also explores statistical information and data structures that can be used to improve analysis efficiency.
Typology: Slides
1 / 55
This page cannot be seen from the preview
Don't miss anything!
















































by Jennifer M. Johnson September 2017
Thesis Advisor: Neil C. Rowe Second Reader: Michael R. McCarrin
Approved for public release; distribution is unlimited
ii
Approved for public release; distribution is unlimited
Jennifer M. Johnson Civilian, Department of Defense B.S., San José State University, 2005
Submitted in partial fulfillment of the requirements for the degree of
from the NAVAL POSTGRADUATE SCHOOL September 2017
Approved by: Neil C. Rowe Thesis Advisor
Michael R. McCarrin Second Reader
Peter J. Denning Chair, Department of Computer Science
iii
We have created a database to analyze digital data to find connections between two or more different secondary storage devices. We used MongoDB and created a document for each secondary-storage image and each unique sector. Ingesting the secondary-storage images took so much time that we had to carefully consider all the reasons for the slow down and experiment on different ways to insert the data. Using a partial database, we found the fraction of space that is empty (contains NULLS), per secondary-storage image and for the entire database. We found duplicate images. Future students may continue to grow the database. Rather then make the goal a completed database, the students will analyze the current data and add to the database.
v
vi
viii
List of Figures
Figure 2.1 Partition Table Layout, mmls Command Output.......... 9
Figure 2.2 Example of SQL table........................ 10
Figure 3.1 Four pieces of useful information.................. 16
Figure 3.2 The _id command used to identify each image in MongoDB... 17
Figure 3.3 Sector layer schema for MongoDB................. 19
Figure 3.4 MongoDB Command....................... 20
Figure 3.5 Histogram of times for inserting secondary-storage images smaller than 500 Mb into the database.................... 22
Figure 3.6 Inserting secondary-storage images that are smaller then approximately 500 Mb............................... 23
Figure 4.1 A MongoDB Command to find most common MD5 hash..... 25
Figure 4.2 Most common hash with about 980 images inserted........ 25
ix
List of Acronyms and Abbreviations
B bytes
CPU central processing units
CS Computer Science
DEEP Digital Evaluation and Exploitation
DoD Department of Defense
EWF Expert Witness Compression Format
FBI Federal Bureau of Investigation
GB gigabytes
GPU graphical processing units
KiB kibibyte
MB megabytes
MD5 message digest 5
ME Mechanical Engineer
NSF National Science Foundation
NIST National Institute of Standards and Technology
NSRL National Software Reference Library
NTFS New Technology File System
NUS non United States
RAM random access memory
xi
RDC Real Data Corpus
SHA-1 secure hash algorithm 1
SFS Scholarship For Service
SQL structured query language
TB terabytes
RCFL Regional Computer Forensics Laboratory
TSK The Sleuth Kit
xii
xiv
CHAPTER 1: Introduction
1.1 The Problem and Motivation We address two problems. The first is managing large-scale heterogeneous digital-forensic data. The second is finding a digitally forensic connection between two or more secondary-storage devices. The National Institute of Standards and Technology (NIST) defines digital forensics as “the application of science to the identification, collection, examination, and analysis of data while preserving the integrity of the information and maintaining a strict chain of custody for the data” [1].
The growing amount of data is our motivation. In recent years, the per-gigabyte price of data has been steadily decreasing [2]. It is common for the average consumer to purchase terabytes of digital storage space. As a consequence, law enforcement agencies and cyber divisions in the Department of Defense (DoD), have acquired terabytes of data while collecting criminal evidence. The Regional Computer Forensics Laboratory (RCFL), established by the FBI, has annual reports and they noted that the Chicago lab, just one of the 15 labs, had collected and processed 580 TB of digital data in one year [3].
Currently, examiners process data on secondary-storage images drive-by-drive using forensic tools designed to run on a single workstation. Each drive is considered separately, and little work is done to correlate information across different images. From an analyst’s perspective, this approach means important information may be missed. For example, there is no organized effort to detect collaboration or communication between owners of devices acquired at different times. Likewise, little has been done to study large-scale patterns in acquired data. Studying trends in data may offer insight into longstanding forensic analysis problems. Carving deleted files, for example is a longstanding forensic problem, because it can be time intensive. File carving is the method of detecting a file signature and then extracting the data associated with it [4].
A tactic that can reduce the processing time required for file carving is matching blocks that reside in allocated space with those blocks in unallocated space. Allocation means the
1
Category two is comprised of features that require more extensive analysis to measure:
In order to gather statistical information on all the secondary-storage images on the non United States (NUS) portion of the RDC, we first need to create a database for our analysis. We have two important steps. Step 1a is building the database and step 1b is the analysis. We have 124,104,544,671,744 bytes (B) of data in the NUS portion of the RDC. An important research question is how long will it take to build a database of sector hashes?
1.4 Thesis Structure In Chapter two we cover the background and related work. In Chapter three we discuss the methodology. In Chapter four we discuss our results. In Chapter five we discuss our conclusions and future work.