Presentation about Apache Spark MLib, Grafiken und Mindmaps von Data Analysis (and Programming) with R

Apache Spark MLib • Introduction • Features • Classification • Clustering • MLlib pipeline concept • Pros and cons • Installation • Demo/usecases • Conclusion • Resources

Art: Grafiken und Mindmaps

2021/2022

Hochgeladen am 03.12.2022

hadeel-khibaiz
hadeel-khibaiz 🇩🇪

1 dokument

1 / 25

Toggle sidebar

Diese Seite wird in der Vorschau nicht angezeigt

Lass dir nichts Wichtiges entgehen!

bg1
h
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Unvollständige Textvorschau

Nur auf Docsity: Lade Presentation about Apache Spark MLib und mehr Grafiken und Mindmaps als PDF für Data Analysis (and Programming) with R herunter!

h

Main

Concepts

  • Introduction
  • Features
    • Classification
    • Clustering
    • MLlib pipeline concept
  • Pros and cons
  • Installation
  • Demo/usecases
  • Conclusion
  • Resources

Introducti

on

High Level Goals Mllib is Apache Spark‘s library which is Practical ML scalable and easy Simplify the development and deployment of scalable machine learning pipelines

Introducti

on

  • The primary ML API for spark is now DataFrame-based and the Mllib RDD- based API is in maintenance mode.
  • What are the implications?
    • Mllib will still support the RDD-based API in spark.mllib with bug fixes
    • Mllib will not add new features to RDD-based API
    • The RDD-based API is expected to be removed in Spark 3.
  • Why is Mllib switching to the DataFrame-based API?
    • DataFrames provide a more user-friendly API than RDDs
    • DataFrames facilitate practical ML pipelines, particularly features and transformations

Featur

es

Classification

  • Mllib provides algorithms for classification, such as Decisio Tree, Naive Bayes, …
  • Naive Bayes is a simple probabilistic classification with independence assu ptions between pair of features Clustering
  • K-means is one of the most commonly used clustering algorithms which clusters data points in to a predefined number of clusters

Load/Clean Data Transformer Evaluator (^) Estimator

Featur

es

MLlib Pipeline Concept

Evaluate the model performance Help with automating the model tuning process

Featur

es

Pipeline

  • To represent a ML workflow
  • Consist of a set stages
  • Leverage the uniform API of transformer & estimator
  • A type of estimator –fit()
  • Can be persisted

Featur

es

Extraction

  • Extacting features from raw data
  • Word to Vector is an estimator which takes sequences of words representing documents
  • The model maps each word to a unique fixed-size vector
  • This vector can then be used as features for prediction, document similarity, calculations , …

Pros &

Cons

• Pros

  • Scalability
  • Performance
  • User-friendly API´s
  • Integration with SparkSQL, Streaming & GraphX
  • Cons
  • Configurability
  • Reliability
  • High-memory Consumption

Installin

g…

spark-2.3.0-bin-hadoop2.7.tgz jdk-8u162-macosx-x64.dmg scala-2.11.12.tgz

Codi

ng

Coding

Codi

ng

Coding