Nur auf Docsity: Lade Presentation about Apache Spark MLib und mehr Grafiken und Mindmaps als PDF für Data Analysis (and Programming) with R herunter!
h
Main
Concepts
- Introduction
- Features
- Classification
- Clustering
- MLlib pipeline concept
- Pros and cons
- Installation
- Demo/usecases
- Conclusion
- Resources
Introducti
on
High Level Goals Mllib is Apache Spark‘s library which is Practical ML scalable and easy Simplify the development and deployment of scalable machine learning pipelines
Introducti
on
- The primary ML API for spark is now DataFrame-based and the Mllib RDD- based API is in maintenance mode.
- What are the implications?
- Mllib will still support the RDD-based API in spark.mllib with bug fixes
- Mllib will not add new features to RDD-based API
- The RDD-based API is expected to be removed in Spark 3.
- Why is Mllib switching to the DataFrame-based API?
- DataFrames provide a more user-friendly API than RDDs
- DataFrames facilitate practical ML pipelines, particularly features and transformations
Featur
es
Classification
- Mllib provides algorithms for classification, such as Decisio Tree, Naive Bayes, …
- Naive Bayes is a simple probabilistic classification with independence assu ptions between pair of features Clustering
- K-means is one of the most commonly used clustering algorithms which clusters data points in to a predefined number of clusters
Load/Clean Data Transformer Evaluator (^) Estimator
Featur
es
MLlib Pipeline Concept
Evaluate the model performance Help with automating the model tuning process
Featur
es
Pipeline
- To represent a ML workflow
- Consist of a set stages
- Leverage the uniform API of transformer & estimator
- A type of estimator –fit()
- Can be persisted
Featur
es
Extraction
- Extacting features from raw data
- Word to Vector is an estimator which takes sequences of words representing documents
- The model maps each word to a unique fixed-size vector
- This vector can then be used as features for prediction, document similarity, calculations , …
Pros &
Cons
• Pros
- Scalability
- Performance
- User-friendly API´s
- Integration with SparkSQL, Streaming & GraphX
- Cons
- Configurability
- Reliability
- High-memory Consumption
Installin
g…
spark-2.3.0-bin-hadoop2.7.tgz jdk-8u162-macosx-x64.dmg scala-2.11.12.tgz
Codi
ng
Coding
Codi
ng
Coding