















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis Dictionary” is a compact and easy-to-use guide that explains important terms, concepts, and tools used in data analysis. From basic definitions like databases, queries, and visualization, to advanced topics like machine learning, predictive analytics, and business intelligence, this dictionary makes complex ideas simple and clear. Perfect for students, beginners, and professionals, these notes help you quickly understand key concepts, revise before exams or interviews, and build a strong foundation in data analysis. Written in straightforward language, it’s your go-to reference for the world of data.
Typology: Transcriptions
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















A/B testing: The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue Absolute reference: A reference within a function that is locked so that rows and columns won’t change if the function is copied Access control: Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet Accuracy: The degree to which data conforms to the actual entity being measured or described Action-oriented question: A question whose answers lead to change Administrative metadata: Metadata that indicates the technical source of a digital asset Aesthetic (R): A visual property of an object in a plot Agenda: A list of scheduled appointments Aggregation: The process of collecting or gathering many separate pieces into a whole Algorithm: A process or set of rules followed for a specific task Aliasing: Temporarily naming a table or column in a query to make it easier to read and write Alternative text: Text that provides an alternative to non-text content, such as images and videos Analytical skills: Qualities and characteristics associated with using facts to solve problems Analytical thinking: The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner Annotation: Text that briefly explains data or helps focus the audience on a particular aspect of the data in a visualization
Anscombe’s quartet: Four datasets that have nearly identical summary statistics but contain different plotted values Area chart: A data visualization that uses individual data points for a changing variable connected by a continuous line with a filled in area underneath Argument (R): Information needed by a function in R in order to run Arithmetic operator: An operator used to perform basic math operations such as addition, subtraction, multiplication, and division Array: A collection of values in spreadsheet cells Assignment operator: An operator used to assign values to variables and vectors Attribute: A characteristic or quality of data used to label a column in a table Audio file: Digitized audio storage usually in an MP3, AAC, or other compressed format AVERAGE: A spreadsheet function that returns an average of the values from a selected range AVERAGEIF: A spreadsheet function that returns the average of all cell values from a given range that meet a specified condition B Bad data source: A data source that is not reliable, original, comprehensive, current, and cited (ROCCC) Balance: The design principle of creating aesthetic appeal and clarity in a data visualization by evenly distributing visual elements Bar graph: A data visualization that uses size to contrast and compare two or more values Bias: A conscious or subconscious preference in favor of or against a person, group of people, or thing Big data: Large, complex datasets typically involving long periods of time, which enable data analysts to address far-reaching business problems Boolean data: A data type with only two possible values, usually true or false Borders: Lines that can be added around two or more cells on a spreadsheet Box plot: A data visualization that displays the distribution of values along an x-axis Bubble chart: A data visualization that displays individual data points as bubbles, comparing numeric values by their relative size
Code chunk: A piece of code added in an R Markdown file that is used to process, visualize or analyze data Coding: The process of writing instructions to a computer in the syntax of a specific programming language Column chart: A data visualization that uses individual data points for a changing variable, represented as vertical columns Combo chart: A data visualization that combines more than one visualization type Compatibility: How well two or more datasets are able to work together Completeness: The degree to which data contains all desired components or measures Computer programming: The process of giving instructions to a computer in order to perform an action or set of actions CONCAT: A SQL function that adds strings together to create new text strings that can be used as unique keys CONCATENATE: A spreadsheet function that joins together two or more text strings Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions Conditional statement: A declaration that if a certain condition holds, then a certain event must take place Confidence interval: A range of values that conveys how likely a statistical estimate reflects the population Confidence level: The probability that a sample size accurately reflects the greater population Confirmation bias: The tendency to search for or interpret information in a way that confirms pre-existing beliefs Consent: The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it Consistency: The degree to which data is repeatable from different points of entry or collection Context: The condition in which something exists or happens Continuous data: Data that is measured and can have almost any numeric value CONVERT: A SQL function that changes the unit of measurement of a value in data Cookie: A small file stored on a computer that contains information about its users
Correlation: The measure of the degree to which two variables change in relationship to each other COUNT: A spreadsheet function that counts the number of cells within a range that meet a specified condition COUNTA: A spreadsheet function that counts the total number of values within a specified range that meet specified criteria COUNTIF: A spreadsheet function that returns the number of cells within a range that match a specified value COUNT DISTINCT: A SQL function that only returns the distinct values in a specified range CRAN (Comprehensive R Archive Network) (R): An online archive with R packages, source code, manuals, and documentation CREATE TABLE: A SQL clause that adds a temporary table to a database that can be used by multiple people Cross-field validation: A process that ensures certain conditions for multiple data fields are satisfied CSS (Cascading Style Sheets): A programming language used for web page design that controls graphic elements and page presentation CSV (comma-separated values) file: A delimited text file that uses a comma to separate values Currency: The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions D Dashboard: A tool that monitors live, incoming data Data: A collection of facts Data aggregation: The process of gathering data from multiple sources and combining it into a single, summarized collection Data analysis: The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making Data analysis process: The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Data manipulation: The process of changing data to make it more organized and easier to read Data mapping: The process of matching fields from one data source to another Data merging: The process of combining two or more datasets into a single dataset Data model: A tool for organizing data elements and how they relate to one another Data privacy: Preserving a data subject’s information any time a data transaction occurs Data range: Numerical values that fall between predefined maximum and minimum values Data replication: The process of storing data in multiple locations Data science: A field of study that uses raw data to create new ways of modeling and understanding the unknown Data security: Protecting data from unauthorized access or corruption by adopting safety measures Data storytelling: Communicating the meaning of a dataset with visuals and a narrative that are customized for an audience Data strategy: The management of the people, processes, and tools used in data analysis Data structure: A format for organizing and storing data Data transfer: The process of copying data from a storage device to computer memory or from one computer to another Data type: An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform Data validation: A tool for checking the accuracy and quality of data Data validation process: The process of checking and rechecking the quality of data so that it is complete, accurate, secure and consistent Data visualization: The graphical representation of data Data warehousing specialist: A professional who develops processes and procedures to effectively store and organize data Database: A collection of data stored in a computer system Dataset: A collection of data that can be manipulated or analyzed as one unit DATEDIF: A spreadsheet function that calculates the number of days, months, or years between two dates
Decision tree: A tool that helps analysts make decisions about critical features of a visualization Delimiter: A character that indicates the beginning or end of a data item Density map: A data visualization that represents concentrations, with color representing the number or frequency of data points in a given area on a map Descriptive metadata: Metadata that describes a piece of data and can be used to identify it at a later point in time Design thinking: A process used to solve complex problems in a user-centric way Digital photo: An electronic or computer-based image usually in BMP or JPG format Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be solved Discrete data: Data that is counted and has a limited number of values DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries Distribution graph: A data visualization that displays the frequency of various outcomes in a sample Diverging color palette: A color theme that displays two ranges of data values using two different hues, with color intensity representing the magnitude of the values Donut chart: A data visualization where segments of a ring represent data values adding up to a whole dplyr (R): An R package in Tidyverse that offers a consistent set of functions to complete common data-manipulation tasks DROP TABLE: A SQL clause that removes a temporary table from a database Duplicate data: Any record that inadvertently shares data with another record Dynamic visualizations: Data visualizations that are interactive or change over time E Elevator pitch: A short statement describing an idea or concept Emphasis: The design principle of arranging visual elements to focus the audience’s attention on important information in a data visualization Engagement: Capturing and holding someone’s interest and attention during a data presentation
Function: A preset command that automatically performs a specific process or task using the data in a spreadsheet Function (R): A body of reusable code for performing specific tasks in R FWF (fixed-width file): A text file with a specific format, which enables the saving of textual data in an organized fashion G GAM (generalized additive model) smoothing (R): A process for smoothing plots with a large number of points Gantt chart: A data visualization that displays the duration of events or activities on a timeline Gap analysis: A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future Gauge chart: A data visualization that shows a single result within a progressive range of values General Data Protection Regulation of the European Union (GDPR): Policy-making body in the European Union created to help protect people and their data Geolocation: The geographical location of a person or device by means of digital information Geom (R): The geometric object used to represent data ggplot2 (R): An R package in Tidyverse that creates a variety of data visualizations by applying different visual properties to the data variables in R Good data source: A data source that is reliable, original, comprehensive, current, and cited (ROCCC) GROUP BY: A SQL clause that groups rows that have the same values from a table into summary rows H HAVING: A SQL clause that adds a filter to a query instead of the underlying table that can only be used with aggregate functions head() (R): An R function that returns a preview of the column names and the first few rows of a dataset Header: The first row in a spreadsheet that labels the type of data in each column
Headline: Text at the top of a visualization that communicates the data being presented Heat map: A data visualization that uses color contrast to compare categories in a dataset Highlight table: A data visualization that uses conditional formatting and color on a table Histogram: A data visualization that shows how often data values fall into certain ranges HTML (Hypertext Markup Language): The set of markup symbols or codes used to create a webpage HTML5: A programming language that provides structure for web pages and connects to hosting platforms Hypothesis: A theory that one might try to prove or disprove with data Hypothesis testing: A process to determine if a survey or experiment has meaningful results I IDE (Integrated Development Environment): A software application that brings together all the tools a data analyst may want to use in a single place Incomplete data: Data that is missing important fields Inconsistent data: Data that uses different formats to represent the same thing Incorrect/inaccurate data: Data that is complete but inaccurate Inline code: Code that can be inserted directly into the text of an R Markdown file INNER JOIN : A SQL function that returns records with matching values in both tables Inner query: A SQL subquery that is inside of another SQL statement Internal data: Data that lives within a company’s own systems Interpretation bias: The tendency to interpret ambiguous situations in a positive or negative way J Java: A programming language widely used to create enterprise web applications that can run on multiple clients JOIN: A SQL function that is used to combine rows from two or more tables based on a related column
M Mandatory: A data value that cannot be left blank or empty Map: A data visualization that organizes data geographically Mapping (R): The process of matching up a specific variable in a dataset with a specific aesthetic Margin of error: The maximum amount that sample results are expected to differ from those of the actual population Markdown (R): A syntax for formatting plain text files Mark: A visual object in a data visualization such as a point, line, or shape MATCH: A spreadsheet function used to locate the position of a specific lookup value Math expression: A calculation that involves addition, subtraction, multiplication, or division (also called an equation) Math function: A function that is used as part of a mathematical formula Matrix: A two-dimensional collection of data elements with rows and columns MAX: A spreadsheet function that returns the largest numeric value from a range of cells MAXIFS: A spreadsheet function that returns the maximum value from a given range that meets a specified condition McCandless Method: A method for presenting data visualizations that moves from general to specific information Measurable question: A question whose answers can be quantified and assessed Mental model: A data analyst’s thought process and approach to a problem Mentor: Someone who shares knowledge, skills, and experience to help another grow both professionally and personally Merger: An agreement that unites two organizations into a single new one Metadata: Data about data Metadata repository: A database created to store metadata Metric: A single, quantifiable type of data that is used for measurement Metric goal: A measurable goal set by a company and evaluated using metrics
MID: A function that returns a segment from the middle of a text string MIN: A spreadsheet function that returns the smallest numeric value from a range of cells MINIFS: A spreadsheet function that returns the minimum value from a given range that meets a specified condition Modulo: An operator (%) that returns the remainder when one number is divided by another Movement: The design principle of arranging visual elements to guide the audience’s eyes from one part of a data visualization to another mutate() (R): An R function that makes changes to a dataframe separating and merging columns or creating new variables N Naming conventions: Consistent guidelines that describe the content, creation date, and version of a file in its name Narrative: (Refer to Story) Nested: Code that performs a particular function and is contained within code that performs a broader function Nested function: A function that is completely contained within another function Networking: Building relationships by meeting people both in person and online Nominal data: A type of qualitative data that is categorized without a set order Normalized database: A database in which only related data is stored in each table Notebook: An interactive, editable programming environment for creating data reports and showcasing data skills Null: An indication that a value does not exist in a dataset O Observation: The attributes that describe a piece of data contained in a row of a table Observer bias: The tendency for different people to observe things differently (also called experimenter bias) Open data: Data that is available to the public
Population: In data analytics, all possible data values in a dataset Portfolio: A collection of materials that can be shared with potential employers Pre-attentive attributes: The elements of a data visualization that an audience recognizes automatically without conscious effort Primary key: An identifier in a database that references a column in which each value is unique (Refer to foreign key) Problem domain: The area of analysis that encompasses every activity affecting or affected by a problem Problem types: The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual Profit margin: A percentage that indicates how many cents of profit has been generated for each dollar of sale Programming language: A system of words and symbols used to write instructions that computers follow Proportion: The design principle of using the relative size and arrangement of visual elements to demonstrate information in a data visualization Python: A general-purpose programming language Q Qualitative data: A subjective and explanatory measure of a quality or characteristic Quantitative data: A specific and objective measure, such as a number, quantity, or range Query: A request for data or information from a database Query language: A computer programming language used to communicate with a database R R: A programming language used for statistical analysis, visualization, and other data analysis R Markdown: A file format for making dynamic documents with R R Notebook: A document for running code and displaying the graphs and charts that visualize the code
Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen Range: A collection of two or more cells in a spreadsheet Ranking: A system to position values of a dataset within a scale of achievement or status readr (R): An R package in Tidyverse used for importing data Record: A collection of related data in a data table, usually synonymous with row Redundancy: When the same piece of data is stored in two or more places Reframing: The process of restating a problem or challenge, then redirecting it toward a potential resolution Regular expression (RegEx): A rule that says the values in a table must match a prescribed pattern Relational database: A database that contains a series of tables that can be connected to form relationships Relational operator: An operator used to compare values, also known as a comparator Relativity: The process of considering observations in relation or proportion to something else Relevant question: A question that has significance to the problem to be solved Remove duplicates: A spreadsheet tool that automatically searches for and eliminates duplicate entries from a spreadsheet Repetition: The design principle of repeating visual elements to demonstrate meaning in a data visualization Report: A static collection of data periodically given to stakeholders Return on investment (ROI): A formula that uses the metrics of investment and profit to evaluate the success of an investment Revenue: The total amount of income generated by the sale of goods or services Rhythm: The design principle of creating movement and flow in a data visualization to engage an audience RIGHT: A function that returns a set number of characters from the right side of a text string RIGHT JOIN: A SQL function that will return all records from the right table and only the matching records from the left Root cause: The reason why a problem occurs
Sorting: The process of arranging data into a meaningful order to make it easier to understand, analyze, and visualize Specific question: A question that is simple, significant, and focused on a single topic or a few closely related ideas SPLIT: A spreadsheet function that divides text around a specified character and puts each fragment into a new, separate cell Sponsor: A professional advocate who is committed to moving forward the career of another Spotlightling: Scanning through data to quickly identify the most important insights Spreadsheet: A digital worksheet SQL: (Refer to Structured Query Language) Stakeholders: People who invest time and resources into a project and are interested in its outcome Static data: Data that doesn’t change once it has been recorded Static visualization: A data visualization that does not change over time unless it is edited Statistical power: The probability that a test of significance will recognize an effect that is present Statistical significance: The probability that sample results are not due to random chance Statistics: The study of how to collect, analyze, summarize, and present data Story: The narrative of a data presentation that makes it meaningful and interesting String data type: A sequence of characters and punctuation that contains textual information (also called text data type) Structural metadata: Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection Structured data: Data organized in a certain format such as rows and columns Structured Query Language: A computer programming language used to communicate with a database Structured thinking: The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options Subquery: A SQL query that is nested inside a larger query SUBSTR: A SQL function that extracts a substring from a string variable
Substring: A subset of a text string Subtitle: Text that supports a headline by adding context and description SUM: A spreadsheet function that adds the values of a selected range of cells SUMIF: A spreadsheet function that adds numeric data based on one condition Summary table: A table used to summarize statistical information about data SUMPRODUCT: A function that multiplies arrays and returns the sum of those products Swift: A programming language for macOS, iOS, watchOS, and tvOS Symbol map: A data visualization that displays a mark over a given longitude and latitude Syntax: The predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement T Tableau: A business intelligence and analytics platform that helps people visualize, understand, and make decisions with data Technical mindset: The ability to break things down into smaller steps or pieces and work with them in an orderly and logical way Temporary table: A database table that is created and exists temporarily on a database server Text data type: A sequence of characters and punctuation that contains textual information (also called string data type) Text string: A group of characters within a cell, most often composed of letters Third-party data: Data provided from outside sources that did not collect it directly Tibble (R): A streamlined variation of data frames Tidy data (R): A way of standardizing the organization of data within R tidyr (R): An R package in Tidyverse used for data cleaning to make tidy data Tidyverse (R): A system of packages in R with a common design philosophy for data manipulation, exploration, and visualization Time-bound question: A question that specifies a timeframe to be studied