Cloudera CDP Data Analyst Practice Exam, Exams of Technology

Designed for data analysts, this practice exam includes topics such as BI reporting, SQL analytics, data exploration, dashboarding, Hive/Impala usage, workload optimization, query design, and interpreting large-scale datasets within CDP environments.

Typology: Exams

2025/2026

Available from 01/06/2026

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 89

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cloudera CDP Data Analyst Practice Exam
**Question 1. Which Cloudera tool is primarily used for creating interactive visual reports and
dashboards?**
A) Apache Hive
B) Cloudera Data Visualization (CDV)
C) Apache Ranger
D) Apache Atlas
Answer: B
Explanation: CDV provides draganddrop charting, report publishing, and dashboard capabilities,
whereas Hive is for SQL, Ranger for security, and Atlas for metadata.
**Question 2. In CDV, which chart type is best suited for showing the distribution of a numeric field
across predefined ranges?**
A) Line chart
B) Pie chart
C) Histogram
D) Scatter plot
Answer: C
Explanation: A histogram bins numeric values into intervals, revealing the distribution; line charts show
trends, pie charts show parts of a whole, and scatter plots show relationships between two variables.
**Question 3. When building a dashboard, what is the purpose of a “filter widget”?**
A) To change the underlying data source
B) To allow endusers to dynamically limit the data displayed in visualizations
C) To export the dashboard as a PDF
D) To increase the refresh rate of the dashboard
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59

Partial preview of the text

Download Cloudera CDP Data Analyst Practice Exam and more Exams Technology in PDF only on Docsity!

Question 1. Which Cloudera tool is primarily used for creating interactive visual reports and dashboards? A) Apache Hive B) Cloudera Data Visualization (CDV) C) Apache Ranger D) Apache Atlas Answer: B Explanation: CDV provides drag‑and‑drop charting, report publishing, and dashboard capabilities, whereas Hive is for SQL, Ranger for security, and Atlas for metadata. Question 2. In CDV, which chart type is best suited for showing the distribution of a numeric field across predefined ranges? A) Line chart B) Pie chart C) Histogram D) Scatter plot Answer: C Explanation: A histogram bins numeric values into intervals, revealing the distribution; line charts show trends, pie charts show parts of a whole, and scatter plots show relationships between two variables. Question 3. When building a dashboard, what is the purpose of a “filter widget”? A) To change the underlying data source B) To allow end‑users to dynamically limit the data displayed in visualizations C) To export the dashboard as a PDF D) To increase the refresh rate of the dashboard Answer: B

Explanation: Filter widgets let users select values (e.g., dates, categories) that are applied to all charts on the dashboard, enabling interactive analysis. Question 4. Which SQL clause in Impala is used to limit the number of rows returned by a query? A) LIMIT B) TOP C) FETCH D) SAMPLE Answer: A Explanation: Impala follows standard SQL syntax where LIMIT specifies the maximum rows to return; TOP is not supported, FETCH is part of FETCH FIRST, and SAMPLE is for random sampling. Question 5. In Hive, which statement creates a new table that does not store data in the Hive warehouse directory? A) CREATE MANAGED TABLE B) CREATE EXTERNAL TABLE C) CREATE TEMPORARY TABLE D) CREATE VIEW Answer: B Explanation: EXTERNAL tables point to data stored outside the Hive warehouse, preserving data when the table is dropped; MANAGED tables store data inside the warehouse. Question 6. Which join type returns all rows from the left table and matching rows from the right table, filling non‑matches with NULLs? A) INNER JOIN B) LEFT OUTER JOIN C) RIGHT OUTER JOIN

B) Tag Sync C) Auditing Service D) Metadata Service Answer: A Explanation: The Policy Engine checks user, group, and resource attributes against defined policies before allowing access. Question 10. In Apache Atlas, what does a “process” entity represent? A) A user who accesses data B) A transformation or job that moves data from one dataset to another C) A storage location on HDFS D) A security policy definition Answer: B Explanation: Process entities capture lineage by linking input and output dataset entities, documenting how data is transformed. Question 11. Which role is primarily responsible for defining and approving data quality rules in a CDP environment? A) Data Engineer B) Data Steward C) System Administrator D) Business Analyst Answer: B Explanation: Data Stewards manage data quality, metadata standards, and compliance, whereas engineers build pipelines, admins manage infrastructure, and analysts consume data.

Question 12. Which DDL statement adds a new column to an existing Impala table? A) ALTER TABLE table_name ADD COLUMNS (new_col INT); B) MODIFY TABLE table_name ADD COLUMN new_col INT; C) UPDATE TABLE table_name SET new_col = NULL; D) INSERT INTO table_name (new_col) VALUES (NULL); Answer: A Explanation: Impala uses ALTER TABLE … ADD COLUMNS to add one or more columns; MODIFY is not valid, UPDATE changes data, INSERT adds rows. Question 13. Which aggregate function returns the number of distinct values in a column? A) COUNT() B) SUM() C) COUNT(DISTINCT column) D) AVG() Answer: C Explanation: COUNT(DISTINCT column) counts unique, non‑NULL values; COUNT() counts rows, SUM adds numeric values, AVG computes the average. Question 14. When should you use the PARQUET file format for a Hive table? A) When you need row‑level updates B) When you require the smallest possible file size for text data C) When you want columnar storage for faster analytical queries D) When you need to store binary large objects (BLOBs) only Answer: C Explanation: Parquet stores data column‑wise, enabling predicate pushdown and compression, ideal for analytics; it does not support row‑level updates and isn’t optimal for raw text.

Explanation: ANALYZE … COMPUTE STATISTICS gathers row counts, distinct values, and column stats; the other commands serve different purposes. Question 18. In HDFS, what does the “replication factor” control? A) Number of partitions a file is split into B) Number of copies of each block stored across the cluster C) Number of times a file can be opened simultaneously D) Size of each block in megabytes Answer: B Explanation: Replication factor defines how many block replicas are kept for fault tolerance; partitions and block size are separate concepts. Question 19. Which statement correctly creates a managed Hive table that is partitioned by the column “event_date”? A) CREATE TABLE events (id INT, value STRING) PARTITIONED BY (event_date DATE); B) CREATE EXTERNAL TABLE events PARTITIONED BY (event_date DATE) LOCATION '/data/events'; C) CREATE TABLE events (id INT, value STRING, event_date DATE) CLUSTERED BY (event_date) INTO 10 BUCKETS; D) CREATE TABLE events (id INT, value STRING) STORED AS PARQUET; Answer: A Explanation: PARTITIONED BY defines a partition column; option B creates an external table, C uses bucketing, and D lacks partitioning. Question 20. When loading data into an external Hive table, which command is used to make Hive aware of newly added files? A) LOAD DATA INPATH '/path/file' INTO TABLE table_name; B) REFRESH table_name;

C) ALTER TABLE table_name RECOVER PARTITIONS; D) TRUNCATE TABLE table_name; Answer: C Explanation: RECOVER PARTITIONS scans the table’s location for new partition directories and adds them to the metastore; LOAD DATA works for managed tables. Question 21. In Cloudera Data Warehouse (CDW), what is a “virtual warehouse”? A) A physical server dedicated to a single user B) An isolated compute cluster that can be scaled up or down on demand C) A storage bucket in S3 used for backup D) A pre‑configured set of dashboards for analytics Answer: B Explanation: Virtual warehouses are elastic compute resources that run SQL workloads; they are not physical servers nor storage buckets. Question 22. Which CDW feature provides a searchable catalog of tables, columns, and policies? A) Data Catalog Service B) Ranger Admin UI C) Atlas Lineage Viewer D) Impala Query Profiler Answer: A Explanation: The Data Catalog Service centralizes metadata, enabling discovery and governance; Ranger and Atlas focus on security and lineage, respectively. Question 23. Which of the following is NOT a valid reason to choose ORC over Parquet for a Hive table?

A) To label datasets for policy inheritance based on metadata B) To encrypt data at rest C) To schedule periodic data backups D) To define network firewall rules Answer: A Explanation: Tag-based policies allow administrators to assign policies to resources that share a common tag, simplifying management. Question 27. Which Hive command removes a managed table and deletes its data from HDFS? A) DROP TABLE table_name; B) DELETE FROM table_name; C) TRUNCATE TABLE table_name; D) ALTER TABLE table_name SET TBLPROPERTIES ('external'='TRUE'); Answer: A Explanation: DROP TABLE on a managed table removes metadata and deletes the underlying data files; DELETE removes rows, TRUNCATE empties the table but keeps files, and ALTER changes properties. Question 28. When should you use a LEFT SEMI JOIN in Impala? A) To return all columns from both tables B) To filter rows from the left table based on existence of matching rows in the right table without returning right‑table columns C) To perform a full outer join efficiently D) To update rows in the left table using values from the right table Answer: B Explanation: LEFT SEMI JOIN behaves like an EXISTS subquery, returning only left‑table rows that have matches, without duplicating right‑table columns.

Question 29. Which of the following statements about Impala’s “INSERT … SELECT” is true? A) It can only insert into external tables. B) It always overwrites the target table. C) It supports partitioned target tables and automatically writes to the correct partitions. D) It cannot be used with views. Answer: C Explanation: INSERT … SELECT can target partitioned tables; Impala determines the correct partition directories based on the partition columns. Question 30. In CDW, what does “auto‑scaling” refer to? A) Automatically increasing the size of HDFS blocks B) Dynamically adding or removing compute nodes for a virtual warehouse based on workload demand C) Changing the file format of tables from TEXT to PARQUET automatically D) Adjusting Ranger policies based on user activity Answer: B Explanation: Auto‑scaling adjusts the number of compute nodes in a virtual warehouse to meet query concurrency and latency requirements. Question 31. Which of the following is a benefit of using materialized views in Hive? A) They store query results physically, improving performance for repetitive queries B) They enforce primary key constraints C) They automatically replicate data across clusters D) They replace the need for partitioning Answer: A

D) Impala cannot query the table Answer: B Explanation: ORC is a columnar, highly compressed format that improves read/write performance for Hive; it does not affect table type or Impala compatibility (Impala can read ORC). Question 35. Which of the following best describes “data lineage” in Apache Atlas? A) The physical location of a file on HDFS B) The sequence of processes that created, transformed, and moved a dataset C) The list of users who accessed a dataset D) The encryption keys used for securing data Answer: B Explanation: Lineage tracks how data flows through processes, showing origins, transformations, and destinations. Question 36. In Ranger, a “policy condition” allows you to A) Encrypt data at rest automatically B) Apply additional attribute‑based rules (e.g., time of day, IP address) to a policy C) Replicate data across zones D) Generate data quality metrics Answer: B Explanation: Policy conditions enable fine‑grained, attribute‑based access control beyond basic user/group/resource definitions. Question 37. Which SQL function would you use to convert a string ‘2023‑ 03 ‑ 15 ’ to a DATE type in Impala? A) CAST('2023‑ 03 ‑15' AS DATE)

B) TO_DATE('2023‑ 03 ‑15')

C) DATE_PARSE('2023‑ 03 ‑15', '%Y-%m-%d') D) PARSE_DATE('2023‑ 03 ‑15') Answer: C Explanation: DATE_PARSE allows specifying the format; CAST works if the string is already in ISO format, but Impala recommends DATE_PARSE for explicit patterns. Question 38. Which of the following is true about “partition pruning”? A) It removes duplicate rows from a partitioned table B) It allows Impala to skip reading partitions that do not satisfy the WHERE clause C) It automatically merges small partitions into larger ones D) It is only applicable to external tables Answer: B Explanation: Partition pruning evaluates filter predicates on partition columns and reads only the matching partitions, reducing I/O. Question 39. In CDP, which service provides centralized authentication and single sign‑on for the platform? A) Apache Ranger B) Cloudera Manager C) Apache Sentry D) Cloudera Identity Management (CIM) Answer: D Explanation: CIM (or Cloudera SSO) handles authentication and SSO across CDP services; Ranger focuses on authorization, Manager on cluster ops, Sentry is deprecated.

Explanation: Hive does NOT manage the lifecycle (deletion) of external table files; the user is responsible. All other statements are true. Question 43. In CDW, what does the “catalog sync” feature do? A) Replicates data across multiple regions B) Synchronizes metadata between CDW and external Hive metastore C) Encrypts all tables automatically D) Generates visual dashboards from SQL queries Answer: B Explanation: Catalog sync keeps the CDW metadata store consistent with an external Hive metastore, enabling unified discovery. Question 44. Which of the following is a correct way to create a Hive table that stores data as compressed Parquet files? A) CREATE TABLE t (id INT) STORED AS TEXTFILE; B) CREATE TABLE t (id INT) STORED AS PARQUET; C) CREATE TABLE t (id INT) LOCATION '/parquet/t' COMPRESSED; D) CREATE TABLE t (id INT) USING ORC; Answer: B Explanation: STORED AS PARQUET specifies the Parquet format; compression is applied automatically based on Hive settings. Option A uses TEXTFILE, C is syntactically wrong, D uses ORC. Question 45. Which Impala function returns the current timestamp in UTC? A) NOW() B) CURRENT_TIMESTAMP() C) UTC_TIMESTAMP()

D) SYSDATE()

Answer: C Explanation: UTC_TIMESTAMP() returns the current time in UTC; NOW() and CURRENT_TIMESTAMP() return the session time zone, SYSDATE() is not an Impala function. Question 46. What is the purpose of the “CLUSTER BY” clause in Hive? A) To sort data within each reducer output and ensure a single reducer B) To define the columns used for bucketing the table C) To create a temporary view of the data D) To enforce a primary key constraint Answer: A Explanation: CLUSTER BY guarantees that rows with the same value go to the same reducer and sorts them, effectively acting like a DISTRIBUTE BY + SORT BY combo. Question 47. Which of the following best describes “row‑level security” in Ranger? A) Controlling which rows a user can see based on column values B) Encrypting each row with a different key C) Replicating rows across multiple nodes for fault tolerance D) Limiting the number of rows returned per query Answer: A Explanation: Row‑level security (or column‑masking) filters results so users only see rows that satisfy a predicate defined in a policy. Question 48. In Hive, which command creates a temporary view that exists only for the duration of the current session? A) CREATE VIEW temp_view AS SELECT …;

Question 51. Which Hive property controls whether INSERT … SELECT on a partitioned table overwrites existing partitions? A) hive.exec.dynamic.partition.mode B) hive.exec.insert.overwrite C) hive.auto.convert.join D) hive.exec.max.dynamic.partitions.pernode Answer: B Explanation: hive.exec.insert.overwrite determines if INSERT … SELECT replaces data in the target partition; the other properties relate to dynamic partitioning and joins. Question 52. Which of the following statements about Impala’s “Kudu” integration is true? A) Kudu tables can only be read, not written, from Impala B) Kudu provides low‑latency random reads and writes, suitable for OLTP workloads C) Kudu stores data in ORC format only D) Impala cannot perform joins on Kudu tables Answer: B Explanation: Kudu is designed for fast random reads/writes and works bidirectionally with Impala; it does not use ORC and joins are supported. Question 53. In Ranger, what does the “masking” feature do? A) Hides entire tables from unauthorized users B) Replaces sensitive column values with a masked representation (e.g., ****) at query time C) Encrypts data at rest automatically D) Deletes rows that contain sensitive information Answer: B

Explanation: Masking rewrites query results to hide sensitive data while still allowing the query to execute. Question 54. Which SQL clause would you use to rename a column in an existing Hive table? A) ALTER TABLE table_name RENAME COLUMN old_name TO new_name; B) MODIFY COLUMN old_name new_name; C) UPDATE TABLE table_name SET column = new_name; D) ALTER TABLE table_name CHANGE old_name new_name STRING; Answer: D Explanation: Hive uses ALTER TABLE … CHANGE to rename a column and optionally change its type; option A is not valid Hive syntax. Question 55. What is the main purpose of the “Data Lineage” feature in Apache Atlas? A) To enforce row‑level security policies B) To visualize the flow of data between datasets and processes for audit and impact analysis C) To compress data files automatically D) To manage user authentication across CDP services Answer: B Explanation: Lineage captures provenance, showing how data moves and transforms, aiding governance and troubleshooting. Question 56. Which of the following best describes “predicate pushdown” in the context of Parquet files? A) Storing predicates as metadata within the file header B) Allowing the query engine to evaluate filter conditions while reading only necessary column chunks C) Converting all predicates to OR conditions automatically