














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
COMPTIA DATA+ CERTIFICATION EVALUATION EXAM Q&A: 2026 STUDY GUIDE 100% CORRECT
Typology: Exams
1 / 22
This page cannot be seen from the preview
Don't miss anything!















◍ HTML. Answer: Formatting tags in <> ◍ Ident & Auth. Answer: Separate Steps: Claim & Prove ◍ JSON. Answer: Key Value Pairs
Temp Table - table not kept outside of session, for running stats on it during that session ◍ Query Execution Plan. Answer: declarative - what to do: sequence of steps for how to run query, with speed varying based on optimizing via human or DBMS ◍ Correlation Coefficient. Answer: a statistical index of the relationship between two things (from - 1 to +1), computed via statistical apps ◍ Sample. Answer: random subset of population that is representative of population ◍ Sample Standard Deviation. Answer: sum of [ (sample value minus sample average) squared] divided by numbers of samples less one. used as approximation of population standard deviation ◍ T-Test. Answer: Compares mean values of a continuous variable between 2 categories/groups. ◍ Chi-Square Test. Answer: hypothesis testing method for whether your data is as expected. If you have a single measurement variable, you use a Chi-square goodness of fit test. If you have two measurement variables, you use a Chi-square test of independence.
◍ Extract, load, transform (ELT). Answer: An alternative to ETL used with data lakes, where the data is not transformed on entry to the data lake, but stored in its original raw format ◍ Delta Load (incremental load). Answer: delta between target and source data is dumped at regular intervals. The last extract date is stored so that only records added after this date are loaded. Incremental loads come in two flavors that vary based on the volume of data you're loading: Streaming incremental load - better for loading small data volumes Batch incremental load - better for loading large data volumes understand the time available for performing delta loads into your data warehouse. Regardless of how long your batch window is, think carefully about moving current data into the data warehouse without losing history. ◍ Non-Parametric Data. Answer: Data that does not fit a known or well-understood distribution Usually ordinal or interval data For real-valued data, nonparametric statistical methods are required in applied machine learning when you are trying to make claims on data that does not fit the familiar Gaussian distribution. ◍ Normalize Ratings Data. Answer: normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging
◍ Descriptive Statistical Methods. Answer: • Measures of central tendency
◍ P-Value. Answer: The probability of results of the experiment being attributed to chance. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. When the p-value is less than or equal to 0.05, you should reject the null hypothesis. Alternatively, when the p-value is greater than 0.05, you should retain the null hypothesis as there is not enough statistical evidence to accept the alternative hypothesis. Remember this saying: "When the p is low, the null must go. When the p is high, the null must fly!" ◍ Hypothesis Testing. Answer: a decision-making process for evaluating claims about a population When hypothesis testing, the null and alternative hypothesis describe the effect in terms of the total population. To perform the hypothesis test itself, you need sample data to make inferences about characteristics of the overall population. ◍ Type I error (alpha). Answer: false positive Type I error: is the incorrect rejection of the null hypothesis maximum probability is set in advance as alpha
is not affected by sample size as it is set in advance increases with the number of tests or end points (i.e. do 20 rejections of H0 and 1 is likely to be wrongly significant for alpha = 0.05) ◍ Type II error (beta). Answer: false negative Type II error: is the incorrect acceptance of the null hypothesis probability is beta beta depends upon sample size and alpha can't be estimated except as a function of the true population effect beta gets smaller as the sample size gets larger beta gets smaller as the number of tests or end points increases ◍ Simple Linear Regression. Answer: linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. ◍ Correlation. Answer: A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. ◍ Trend Analysis. Answer: hypothetical extension of a past series of events into the future
◍ Report Cover Page. Answer: - Instructions
Consistency - reliability of an attribute. Data consistency typically comes into play in large organizations that store the same data in multiple systems. Considering data consistency is especially important when designing a data warehouse as it sources data from multiple systems. Integrity/Validity - indicates whether or not an attribute's value is within an expected range. One way to ensure data validity is to enforce referential integrity in the database. Uniqueness - describes whether or not a data attribute exists in multiple places within your organization. Closely related to data consistency, the more unique your data is, the less you have to worry about ◍ Data Quality Validation Methods. Answer: - Cross-validation
◍ Schema. Answer: an ERD with the additional details needed to create a database. ◍ Column-family databases. Answer: use an index to identify data in groups of related columns. optimize performance when you need to examine the contents of a column across many rows. ◍ Graph databases. Answer: specialize in exploring relationships between pieces of data. Graph models map relationships between actual pieces of data. Graphs are an optimal choice if you need to create a recommendation engine, as graphs excel at exploring relationships between data. ◍ Normalization Process. Answer: Objective is to ensure that each table conforms to the concept of well-formed relations
design complex surveys without worrying about building a database. Qualtrics is a powerful tool for developing and administering surveys. What makes Qualtrics so compelling is its API, which you can use to integrate survey response data into a data warehouse for additional analysis. ◍ Data manipulation in SQL. Answer: CRUD (Create INSERT, Read SELECT, Update, Delete) ◍ Common SQL aggregate functions. Answer: COUNT MIN MAX AVG SUM STDDEV ◍ Parametrization. Answer: using variables in query ◍ DB Index. Answer: database index can point to a single column or multiple columns. When running queries on large tables, it is ideal if all of the columns you are retrieving exist in the index. If that is not feasible, you at least want the first column in your SELECT statement to be covered by an index. ◍ Star and Snowflake Schema. Answer: analytical databases prioritize reading data and follow a denormalized approach. The star and snowflake schema designs are two approaches to structuring data for analysis. Both methods implement dimensional modeling, which organizes quantitative data into facts and qualitative data into dimensions.
◍ Historical Analysis. Answer: Although an effective date approach is valid, the SQL queries to retrieve a value at a specific point in time are complex. A table design that adds start date and end date columns allows for more straightforward queries. Enhancing the design with a current flag column makes analytical queries even easier to write. ◍ Specification Mismatch. Answer: occurs when an individual component's characteristics are beyond the range of acceptable values. OR specification mismatch occurs when data doesn't conform to its destination data type. For example, you might be loading data from a file into a database. If the destination column is numeric and you have text data, you'll end up with a specification mismatch. To resolve this mismatch, you must validate that the inbound data consistently map to its target data type. ◍ Recoding Data. Answer: technique you can use to map original values for a variable into new values to facilitate analysis. Recoding groups data into multiple categories, creating a categorical variable. A categorical variable is either nominal or ordinal. Nominal variables are any variable with two or more categories where there is no natural order of the categories, like hair color or eye color. Ordinal variables are categories with an inherent rank. ◍ Derived Variable. Answer: new variable resulting from a calculation on an existing variable.
◍ Reduction. Answer: process of shrinking an extensive dataset without negatively impacting its analytical value. There are a variety of reduction techniques from which you can choose. Selecting a method depends on the type of data you have and what you are trying to analyze. Dimensionality reduction and numerosity reduction are two techniques for data reduction. dimensionality reduction removes attributes from a dataset. numerosity reduction reduces the overall volume of data. ◍ Aggregation. Answer: summarization of raw data for analysis. OR also a means of controlling privacy. ◍ Transposition. Answer: Transposing data is when you want to turn rows into columns or columns into rows to facilitate analysis. ◍ Data Profiling. Answer: statistical measures to check for data discrepancies, including values that are missing, that occur either infrequently or too frequently, or that should be eliminated. Profiling can also identify irregular patterns within your data. ◍ Data Audits. Answer: look at your data and help you understand whether or not you have the data you need to operate your business. Data audits use data profiling techniques and can help identify data integrity and security issues.
◍ Cross-Validation. Answer: statistical technique that evaluates how well predictive models perform. Cross-validation works by dividing data into two subsets. The first subset is the training set, and the second is the testing, or validation, set. You use data from the training set to build a predictive model. You then cross-validate the model using the testing subset to determine how accurate the prediction is. Cross-validation is also helpful in identifying data sampling issues. Cross-validation can help identify sampling bias since predictions using biased data are inaccurate. ◍ Skewed Distribution. Answer: has an asymmetrical shape, with a single peak and a long tail on one side. Skewed distributions have either a right (positive) or left (negative) skew. When the skew is to the right, the mean is typically greater than the median. On the other hand, a distribution with a left skew typically has a mean less than the median. ◍ Bimodal Distribution. Answer: has two distinct modes, whereas a multimodal distribution has multiple distinct modes. When you visualize a bimodal distribution, you see two separate peaks. Suppose you are analyzing the number of customers at a restaurant over time. You would expect to see a large numbers of customers at lunch and dinner. ◍ Standard Normal Distribution (Z-distribution). Answer: a special normal distribution with a mean of 0 and a standard deviation of 1. You can standardize any normal distribution by converting its values into - scores. Converting to the standard normal lets you compare normal distributions with different means and standard deviations.