
















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Machine Learning 2026-2027Machine Learning 2026-2027Machine Learning 2026-2027Machine Learning 2026-2027Machine Learning 2026-2027Machine Learning 2026-2027
Typology: Quizzes
1 / 56
This page cannot be seen from the preview
Don't miss anything!

















































Question: 1 A Machine Learning Specialist is working with multiple data sources containing billions of records that need to be joined. What feature engineering and model development approach should the Specialist take with a dataset this large? A. Use an Amazon SageMaker notebook for both feature engineering and model development B. Use an Amazon SageMaker notebook for feature engineering and Amazon ML for model development C. Use Amazon EMR for feature engineering and Amazon SageMaker SDK for model development D. Use Amazon ML for both feature engineering and model development. Answer(s): B Question: 2 A Machine Learning Specialist is deciding between building a naive Bayesian model or a full Bayesian network for a classification problem. The Specialist computes the Pearson correlation coefficients between each feature and finds that their absolute values range between 0.1 to 0.95. Which model describes the underlying data in this situation? A. A naive Bayesian model, since the features are all conditionally independent. B. A full Bayesian network, since the features are all conditionally independent. C. A naive Bayesian model, since some of the features are statistically dependent. D. A full Bayesian network, since some of the features are statistically dependent. Answer(s): C Question: 3 A Data Scientist is building a linear regression model and will use resulting p-values to evaluate the statistical significance of each coefficient. Upon inspection of the dataset, the Data Scientist discovers that most of the features are normally distributed. The plot of one feature in the dataset is shown in the graphic. What transformation should the Data Scientist apply to satisfy the statistical assumptions of the linear regression model? A. Exponential transformation B. Logarithmic transformation C. Polynomial transformation D. Sinusoidal transformation
Answer(s): A Question: 4 A Machine Learning Specialist is assigned to a Fraud Detection team and must tune an XGBoost model, which is working appropriately for test data. However, with unknown data, it is not working as expected. The existing parameters are provided as follows. Which parameter tuning guidelines should the Specialist follow to avoid overfitting? A. Increase the max_depth parameter value. B. Lower the max_depth parameter value. C. Update the objective to binary:logistic. D. Lower the min_child_weight parameter value. Answer(s): B Question: 5 A data scientist is developing a pipeline to ingest streaming web traffic data. The data scientist needs to implement a process to identify unusual web traffic patterns as part of the pipeline. The patterns will be used downstream for alerting and incident response. The data scientist has access to unlabeled historic data to use, if needed. The solution needs to do the following:
(ML) model and the resources deployed for hosting the model. To test multiple models by distributing traffic between them, specify the percentage of the traffic that gets routed to each model by specifying the weight for each production variant in the endpoint configuration. Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html#model-testing- target-variant Question: 8 An agricultural company is interested in using machine learning to detect specific types of weeds in a 100-acre grassland field. Currently, the company uses tractor-mounted cameras to capture multiple images of the field as 10 × 10 grids. The company also has a large training dataset that consists of annotated images of popular weed classes like broadleaf and non-broadleaf docks. The company wants to build a weed detection model that will detect specific types of weeds and the location of each type within the field. Once the model is ready, it will be hosted on Amazon SageMaker endpoints. The model will perform real-time inferencing using the images captured by the cameras. Which approach should a Machine Learning Specialist take to obtain accurate predictions? A. Prepare the images in RecordIO format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an image classification algorithm to categorize images into various weed classes. B. Prepare the images in Apache Parquet format and upload them to Amazon S3. Use Amazon SageMaker to train, test, and validate the model using an object-detection single- shot multibox detector (SSD) al Answer(s): C Question: 9 A manufacturer is operating a large number of factories with a complex supply chain relationship where unexpected downtime of a machine can cause production to stop at several factories. A data scientist wants to analyze sensor data from the factories to identify equipment in need of preemptive maintenance and then dispatch a service team to prevent unplanned downtime. The sensor readings from a single machine can include up to 200 data points including temperatures, voltages, vibrations, RPMs, and pressure readings. To collect this sensor data, the manufacturer deployed Wi-Fi and LANs across the factories. Even though many factory locations do not have reliable or high-speed internet connectivity, the manufacturer would like to maintain near-real-time inference capabilities.
Which deployment architecture for the model will address these business requirements? A. Deploy the model in Amazon SageMaker. Run sensor data through this model to predict which machines need maintenance. B. Deploy the model on AWS IoT Greengrass in each factory. Run sensor data through this model to infer which machines need maintenance. C. Deploy the model to an Amazon SageMaker batch transformation Answer(s): B Reference: https://aws.amazon.com/blogs/iot/industrial-iot-from-condition-based-monitoring-to- predictive-quality-to-digitize-your-factory-with-aws-iot-services/ https://aws.amazon.com/blogs/iot/using-aws-iot-for-predictive-maintenance/ Question: 10 A Machine Learning Specialist is designing a scalable data storage solution for Amazon SageMaker. There is an existing TensorFlow-based model implemented as a train.py script that relies on static training data that is currently stored as TFRecords. Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead? A. Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data. B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S bucket without reformatting the training data. C. Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords. D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket. Answer(s): B Reference: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-horovod- inference-pipeline/train.py Question: 11 The chief editor for a product catalog wants the research and development team to build a
Answer(s): A Question: 14 An e commerce company wants to launch a new cloud-based product recommendation feature for its web application. Due to data localization regulations, any sensitive data must not leave its on- premises data center, and the product recommendation model must be trained and tested using nonsensitive data only. Data transfer to the cloud must use IPsec. The web application is hosted on premises with a PostgreSQL database that contains all the data. The company wants the data to be uploaded securely to Amazon S3 each day for model retraining. How should a machine learning specialist meet these requirements? A. Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest tables without sensitive data through an AWS Site-to-Site VPN connection directly into Amazon S3. B. Create an AWS Glue job to connect to the PostgreSQL DB instance. Ingest all data through an AWS Site- to-Site VPN connection into Amazon S3 while removing sensitive data using a PySpark job. C. Use AWS Database Migration Service (AWS DMS) with table mapping to select PostgreSQL tables with no sensitive data through an SSL connection. Replicate data directly into Amazon S3. D. Use PostgreSQL logical repl Answer(s): C Reference: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html Question: 15 A logistics company needs a forecast model to predict next month's inventory requirements for a single item in 10 warehouses. A machine learning specialist uses Amazon Forecast to develop a forecast model from 3 years of monthly data. There is no missing data. The specialist selects the DeepAR+ algorithm to train a predictor. The predictor means absolute percentage error (MAPE) is much larger than the MAPE produced by the current human forecasters. Which changes to the CreatePredictor API call could improve the MAPE? (Choose two.) A. Set PerformAutoML to true. B. Set ForecastHorizon to 4. C. Set ForecastFrequency to W for weekly. D. Set PerformHPO to true. E. Set FeaturizationMethodName to filling.
Answer(s): C,D Reference: https://docs.aws.amazon.com/forecast/latest/dg/forecast.dg.pdf Question: 16 A data scientist wants to use Amazon Forecast to build a forecasting model for inventory demand for a retail company. The company has provided a dataset of historic inventory demand for its products as a .csv file stored in an Amazon S3 bucket. The table below shows a sample of the dataset. How should the data scientist transform the data? A. Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Upload both datasets as .csv files to Amazon S3. B. Use a Jupyter notebook in Amazon SageMaker to separate the dataset into a related time series dataset and an item metadata dataset. Upload both datasets as tables in Amazon Aurora. C. Use AWS Batch jobs to separate the dataset into a target time series dataset, a related time series dataset, and an item metadata dataset. Upload them directly to Forecast from a local machine. D. Use a Jupyter notebook in Amazon SageMaker to transform the data into the optimized protobuf recordIO format. Upload the dataset in this format to Amazon S3. Answer(s): A Reference: https://docs.aws.amazon.com/forecast/latest/dg/dataset-import-guidelines- troubleshooting.html Question: 17 A machine learning specialist is running an Amazon SageMaker endpoint using the built-in object detection algorithm on a P3 instance for real-time predictions in a company's production application. When evaluating the model's resource utilization, the specialist notices that the model is using only a fraction of the GPU. Which architecture changes would ensure that provisioned resources are being utilized effectively? A. Redeploy the model as a batch transform job on an M5 instance. B. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the
B. Create a FindMatches machine learning transform in AWS Glue. C. Create an AWS Glue crawler to infer duplicate accounts in the source data. D. Search for duplicate accounts in the AWS Glue Data Catalog. Answer(s): B Reference: https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html Question: 20 A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non- fraudulent observations and 1,000 fraudulent observations. The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist needs to reduce the number of false negatives. Which combination of steps should the Data Scientist take to reduce the number of false negative predictions by the model? (Choose two.) A. Change the XGBoost eval_metric parameter to optimize based on Root Mean Square Error (RMSE). B. Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights. C. Increase the XGBoost max_depth parameter because the model is currently underfitting the data. D. Change the XGBoost eval_metric parameter to optimize based on Area Under the ROC Curve (AUC). E. Decrease the XGBoost max_depth parameter because the model is currently overfitting the data. Answer(s): B,D Question: 21 A data scientist has developed a machine learning translation model for English to Japanese by using Amazon SageMaker's built-in seq2seq algorithm with 500,000 aligned sentence pairs. While testing with sample sentences, the data scientist finds that the translation quality is reasonable for an example as short as five words. However, the quality becomes unacceptable if the sentence is 100 words long. Which action will resolve the problem? A. Change preprocessing to use n-grams.
B. Add more nodes to the recurrent neural network (RNN) than the largest sentence's word count. C. Adjust hyperparameters related to the attention mechanism. D. Choose a different weight initialization type. Answer(s): C Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/seq- 2 - seq-howitworks.html Question: 22 A financial company is trying to detect credit card fraud. The company observed that, on average, 2% of credit card transactions were fraudulent. A data scientist trained a classifier on a year's worth of credit card transactions data. The model needs to identify the fraudulent transactions (positives) from the regular ones (negatives). The company's goal is to accurately capture as many positives as possible. Which metrics should the data scientist use to optimize the model? (Choose two.) A. Specificity B. False positive rate C. Accuracy D. Area under the precision-recall curve E. True positive rate Answer(s): D,E Question: 23 A machine learning specialist is developing a proof of concept for government users whose primary concern is security. The specialist is using Amazon SageMaker to train a convolutional neural network (CNN) model for a photo classifier application. The specialist wants to protect the data so that it cannot be accessed and transferred to a remote host by malicious code accidentally installed on the training container. Which action will provide the MOST secure protection? A. Remove Amazon S3 access permissions from the SageMaker execution role. B. Encrypt the weights of the CNN model. C. Encrypt the training and validation dataset. D. Enable network isolation for training jobs. Answer(s): D Question: 24 A machine learning (ML) specialist wants to secure calls to the Amazon SageMaker Service
Answer(s): C Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private.html Question: 26 A media company with a very large archive of unlabeled images, text, audio, and video footage wishes to index its assets to allow rapid identification of relevant content by the Research team. The company wants to use machine learning to accelerate the efforts of its in-house researchers who have limited machine learning expertise. Which is the FASTEST route to index the assets? A. Use Amazon Rekognition, Amazon Comprehend, and Amazon Transcribe to tag data into distinct categories/classes. B. Create a set of Amazon Mechanical Turk Human Intelligence Tasks to label all footage. C. Use Amazon Transcribe to convert speech to text. Use the Amazon SageMaker Neural Topic Model (NTM) and Object Detection algorithms to tag data into distinct categories/classes. D. Use the AWS Deep Learning AMI and Amazon EC2 GPU instances to create custom models for audio transcription and topic modeling, and use object detection to tag data into distinct categories/classes. Answer(s): A Question: 27 A Data Scientist is developing a binary classifier to predict whether a patient has a particular disease on a series of test results. The Data Scientist has data on 400 patients randomly selected from the population. The disease is seen in 3% of the population. Which cross-validation strategy should the Data Scientist adopt? A. A k-fold cross-validation strategy with k= B. A stratified k-fold cross-validation strategy with k= C. A k-fold cross-validation strategy with k=5 and 3 repeats D. An 80/20 stratified split between training and validation Answer(s): B Question: 28 A large company has developed a B1 application that generates reports and dashboards using data collected from various operational metrics. The company wants to provide executives with an enhanced experience so they can use natural language to get data from the reports. The company wants the executives to be able ask questions using written and
spoken interlaces. Which combination of services can be used to build this conversational interface? (Select THREE) A. Alexa for Business B. Amazon Connect C. Amazon Lex D. Amazon Poly E. Amazon Comprehend F. Amazon Transcribe Answer(s): B,E,F Question: 29 A Machine Learning Specialist is applying a linear least squares regression model to a dataset with 1 000 records and 50 features Prior to training, the ML Specialist notices that two features are perfectly linearly dependent. Why could this be an issue for the linear least squares regression model? A. It could cause the backpropagation algorithm to fail during training B. It could create a singular matrix during optimization which fails to define a unique solution C. It could modify the loss function during optimization causing it to fail during training D. It could introduce non-linear dependencies within the data which could invalidate the linear assumptions of the model Answer(s): C Question: 30 A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS. How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3? A. Define security group(s) to allow all HTTP inbound/outbound traffic and assign those security group(s) to the Amazon SageMaker notebook instance. B. Сonfigure the Amazon SageMaker notebook instance to have access to the VPC. Grant permission in the KMS key policy to the notebook's KMS role. C. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant permission in the KMS key policy to that role. D. Assign the same KMS key used to encrypt data in Amazon S3 to the Amazon SageMaker notebook instance.
Question: 33 A Machine Learning team runs its own training algorithm on Amazon SageMaker. The training algorithm requires external assets. The team needs to submit both its own algorithm code and algorithm- specific parameters to Amazon SageMaker. What combination of services should the team use to build a custom algorithm in Amazon SageMaker? (Choose two.) A. AWS Secrets Manager B. AWS CodeStar C. Amazon ECR D. Amazon ECS E. Amazon S Answer(s): C,E Question: 34 A company uses a long short-term memory (LSTM) model to evaluate the risk factors of a particular energy sector. The model reviews multi-page text documents to analyze each sentence of the text and categorize it as either a potential risk or no risk. The model is not performing well, even though the Data Scientist has experimented with many different network structures and tuned the corresponding hyperparameters. Which approach will provide the MAXIMUM performance boost? A. Initialize the words by term frequency-inverse document frequency (TF-IDF) vectors pretrained on a large collection of news articles related to the energy sector. B. Use gated recurrent units (GRUs) instead of LSTM and run the training process until the validation loss stops decreasing. C. Reduce the learning rate and run the training process until the training loss stops decreasing. D. Initialize the words by word2vec embeddings pretrained on a large collection of news articles related to the energy sector. Answer(s): C Question: 35 A Machine Learning Specialist previously trained a logistic regression model using scikit- learn on a local machine, and the Specialist now wants to deploy it to production for inference only. What steps should be taken to ensure Amazon SageMaker can host a model that was trained locally? A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR.
B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3. C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub. D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR. Answer(s): D Question: 36 A trucking company is collecting live image data from its fleet of trucks across the globe. The data is growing rapidly and approximately 100 GB of new data is generated every day. The company wants to explore machine learning uses cases while ensuring the data is only accessible to specific IAM users. Which storage option provides the most processing flexibility and will allow access control with IAM? A. Use a database, such as Amazon DynamoDB, to store the images, and set the IAM policies to restrict access to only the desired IAM users. B. Use an Amazon S3-backed data lake to store the raw images, and set up the permissions using bucket policies. C. Setup up Amazon EMR with Hadoop Distributed File System (HDFS) to store the files, and restrict access to the EMR instances using IAM policies. D. Configure Amazon EFS with IAM policies to make the data available to Amazon EC instances owned by the IAM users. Answer(s): C Question: 37 A credit card company wants to build a credit scoring model to help predict whether a new credit card applicant will default on a credit card payment. The company has collected data from a large number of sources with thousands of raw attributes. Early experiments to train a classification model revealed that many attributes are highly correlated, the large number of features slows down the training speed significantly, and that there are some overfitting issues. The Data Scientist on this project would like to speed up the model training time without losing a lot of information from the original dataset. Which feature engineering technique should the Data Scientist use to meet the objectives? A. Run self-correlation on all features and remove highly correlated features B. Normalize all numerical values to be between 0 and 1 C. Use an autoencoder or principal component analysis (PCA) to replace original features with new features D. Cluster raw data using k-means and use sample data from each cluster to build a new dataset
A. Logistic regression B. Linear regression C. K-means D. Principal component analysis (PCA) Answer(s): B Question: 41 A Machine Learning Specialist wants to bring a custom algorithm to Amazon SageMaker. The Specialist implements the algorithm in a Docker container supported by Amazon SageMaker. How should the Specialist package the Docker container so that Amazon SageMaker can launch the training correctly? A. Modify the bash_profile file in the container and add a bash command to start the training program B. Use CMD config in the Dockerfile to add the training program as a CMD of the image C. Configure the training program as an ENTRYPOINT named train D. Copy the training program to directory /opt/ml/train Answer(s): B Question: 42 A Data Scientist needs to analyze employment data. The dataset contains approximately 10 million observations on people across 10 different features. During the preliminary analysis, the Data Scientist notices that income and age distributions are not normal. While income levels shows a right skew as expected, with fewer individuals having a higher income, the age distribution also show a right skew, with fewer older individuals participating in the workforce. Which feature transformations can the Data Scientist apply to fix the incorrectly skewed data? (Choose two.) A. Cross-validation B. Numerical value binning C. High-degree polynomial transformation D. Logarithmic transformation E. One hot encoding Answer(s): A,B Question: 43 A Machine Learning Specialist is given a structured dataset on the shopping habits of a company's customer base. The dataset contains thousands of columns of data and
hundreds of numerical columns for each customer. The Specialist wants to identify whether there are natural groupings for these columns across all customers and visualize the results as quickly as possible. What approach should the Specialist take to accomplish these tasks? A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t- SNE) algorithm and create a scatter plot. B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot. C. Embed the numerical features using the t-distributed stochastic neighbor embedding (t- SNE) algorithm and create a line graph. D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster. Answer(s): B Question: 44 A Machine Learning Specialist is planning to create a long-running Amazon EMR cluster. The EMR cluster will have 1 master node, 10 core nodes, and 20 task nodes. To save on costs, the Specialist will use Spot Instances in the EMR cluster. Which nodes should the Specialist launch on Spot Instances? A. Master node B. Any of the core nodes C. Any of the task nodes D. Both core and task nodes Answer(s): A Question: 45 A company wants to predict the sale prices of houses based on available historical sales data. The target variable in the company's dataset is the sale price. The features include parameters such as the lot size, living area measurements, non-living area measurements, number of bedrooms, number of bathrooms, year built, and postal code. The company wants to use multi-variable linear regression to predict house sale prices. Which step should a machine learning specialist take to remove features that are irrelevant for the analysis and reduce the model's complexity? A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance. B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance. C. Build a heatmap showing the correlation of the dataset against itself. Remove features