PRACTICAL MACHINE LEARNING, Lecture notes of Machine Learning

PRACTICAL_MACHINE_LEARNING PRACTICAL_MACHINE_LEARNINGPRACTICAL_MACHINE_LEARNINGPRACTICAL_MACHINE_LEARNING

Typology: Lecture notes

2016/2017

Uploaded on 09/08/2017

tadeusznalepa
tadeusznalepa 🇵🇱

5

(2)

1 document

1 / 468

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download PRACTICAL MACHINE LEARNING and more Lecture notes Machine Learning in PDF only on Docsity!

Practical Machine Learning

Tackle the real-world complexities of modern machine

learning with innovative and cutting-edge techniques

Sunila Gollapudi

BIRMINGHAM - MUMBAI

Credits

Author Sunila Gollapudi Reviewers Rahul Agrawal Rahul Jain Ryota Kamoshida Ravi Teja Kankanala Dr. Jinfeng Yi Commissioning Editor Akram Hussain Acquisition Editor Sonali Vernekar Content Development Editor Sumeet Sawant Technical Editor Murtaza Tinwala Copy Editor Yesha Gangani Project Coordinator Shweta H Birwatkar Proofreader Safis Editing Indexer Tejal Daruwale Soni Graphics Jason Monteiro Production Coordinator Manu Joseph Cover Work Manu Joseph

About the Author

Sunila Gollapudi works as Vice President Technology with Broadridge Financial

Solutions (India) Pvt. Ltd., a wholly owned subsidiary of the US-based Broadridge Financial Solutions Inc. (BR). She has close to 14 years of rich hands-on experience in the IT services space. She currently runs the Architecture Center of Excellence from India and plays a key role in the big data and data science initiatives. Prior to joining Broadridge she held key positions at leading global organizations and specializes in Java, distributed architecture, big data technologies, advanced analytics, Machine learning, semantic technologies, and data integration tools. Sunila represents Broadridge in global technology leadership and innovation forums, the most recent being at IEEE for her work on semantic technologies and its role in business data lakes. Sunila's signature strength is her ability to stay connected with ever changing global technology landscape where new technologies mushroom rapidly , connect the dots and architect practical solutions for business delivery. A post graduate in computer science, her first publication was on Big Data Datawarehouse solution, Greenplum titled Getting Started with Greenplum for Big Data Analytics , Packt Publishing. She's a noted Indian classical dancer at both national and international levels, a painting artist, in addition to being a mother, and a wife.

Acknowledgments

At the outset, I would like to express my sincere gratitude to Broadridge Financial Solutions (India) Pvt Ltd., for providing the platform to pursue my passion in the field of technology. My heartfelt thanks to Laxmikanth V, my mentor and Managing Director of the firm, for his continued support and the foreword for this book, Dr. Dakshinamurthy Kolluru, President, International School of Engineering (INSOFE), for helping me discover my love for Machine learning and Mr. Nagaraju Pappu, Founder & Chief Architect Canopus Consulting, for being my mentor in Enterprise Architecture. This acknowledgement is incomplete without a special mention of Packt Publications for giving this opportunity to outline, conceptualize and provide complete support in releasing this book. This is my second publication with them, and again it is a pleasure to work with a highly professional crew and the expert reviewers. To my husband, family and friends for their continued support as always. One person whom I owe the most is my lovely and understanding daughter Sai Nikita who was as excited as me throughout this journey of writing this book. I only wish there were more than 24 hours in a day and would have spent all that time with you Niki! Lastly, this book is a humble submission to all the restless minds in the technology world for their relentless pursuit to build something new every single day that makes the lives of people better and more exciting.

Ryota Kamoshida is the maintainer of Python library MALSS (https://github.

com/canard0328/malss) and now works as a researcher in computer science at a Japanese company.

Ravi Teja Kankanala is a Machine learning expert and loves making sense of

large amount of data and predicts trends through advanced algorithms. At Xlabs, he leads all research and data product development efforts, addressing HealthCare and Market Research Domain. Prior to that, he developed data science product for various use cases in telecom sector at Ericsson R&D. Ravi did his BTech in computer science from IIT Madras.

Dr. Jinfeng Yi is a research staff Member at IBM's Thomas J. Watson Research

Center, concentrating on data analytics for complex real-world applications. His research interests lie in Machine learning and its application to various domains, including recommender system, crowdsourcing, social computing, and spatio- temporal analysis. Jinfeng is particularly interested in developing theoretically principled and practically efficient algorithms for learning from massive datasets. He has published over 15 papers in top Machine learning and data mining venues, such as ICML, NIPS, KDD, AAAI, and ICDM. He also holds multiple US and international patents related to large-scale data management, electronic discovery, spatial-temporal analysis, and privacy preserved data sharing.

www.PacktPub.com

Support files, eBooks, discount offers, and more For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?
  • Fully searchable across every book published by Packt
  • Copy and paste, print, and bookmark content
  • On demand and accessible via a web browser
Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

I dedicate this work of mine to my father G V L N Sastry, and my mother, late G Vijayalakshmi. I wouldn't have been what I am today without your perseverance, love, and confidence in me.

[ i ]

Table of Contents

Table of Contents

Table of Contents

Table of Contents

  • Chapter 1: Introduction to Machine learning Preface xi
    • Machine learning
      • Definition
      • Core Concepts and Terminology
      • What is learning?
        • Data
        • Labeled and unlabeled data
        • Tasks
        • Algorithms
        • Models
      • Data and inconsistencies in Machine learning
        • Under-fitting
        • Over-fitting
        • Data instability
        • Unpredictable data formats
      • Practical Machine learning examples
      • Types of learning problems
        • Classification
        • Clustering
        • Forecasting, prediction or regression
        • Simulation
        • Optimization
        • Supervised learning
        • Unsupervised learning
        • Semi-supervised learning
        • Reinforcement learning
        • Deep learning
    • Performance measures [ ii ]
      • Is the solution good?
        • Mean squared error (MSE)
        • Mean absolute error (MAE)
        • Normalized MSE and MAE (NMSE and NMAE)
        • Solving the errors: bias and variance
    • Some complementing fields of Machine learning
      • Data mining
      • Artificial intelligence (AI)
      • Statistical learning
      • Data science
    • Machine learning process lifecycle and solution architecture
    • Machine learning algorithms
      • Decision tree based algorithms
      • Bayesian method based algorithms
      • Kernel method based algorithms
      • Clustering methods
      • Artificial neural networks (ANN)
      • Dimensionality reduction
      • Ensemble methods
      • Instance based learning algorithms
      • Regression analysis based algorithms
      • Association rule based learning algorithms
    • Machine learning tools and frameworks
    • Summary
  • Chapter 2: Machine learning and Large-scale datasets
    • Big data and the context of large-scale Machine learning
      • Functional versus Structural – A methodological mismatch
        • Commoditizing information
        • Theoretical limitations of RDBMS
        • Scaling-up versus Scaling-out storage
        • Distributed and parallel computing strategies
      • Machine learning: Scalability and Performance
        • Too many data points or instances
        • Too many attributes or features
        • real-time responses Shrinking response time windows – need for
        • Highly complex algorithm
        • Feed forward, iterative prediction cycles
      • Model selection process
      • Potential issues in large-scale Machine learning
    • Algorithms and Concurrency [ iii ]
      • Developing concurrent algorithms
    • Machine learning Technology and implementation options for scaling-up
      • MapReduce programming paradigm
      • Passing Interface (MPI) High Performance Computing (HPC) with Message
      • Language Integrated Queries (LINQ) framework
      • Manipulating datasets with LINQ
      • Graphics Processing Unit (GPU)
      • Field Programmable Gate Array (FPGA)
      • Multicore or multiprocessor systems
    • Summary
  • and Ecosystem Chapter 3: An Introduction to Hadoop's Architecture
    • Introduction to Apache Hadoop
      • Evolution of Hadoop (the platform of choice)
      • Hadoop and its core elements
    • (employing Hadoop) Machine learning solution architecture for big data
      • The Data Source layer
      • The Ingestion layer
      • The Hadoop Storage layer
      • The Hadoop (Physical) Infrastructure layer – supporting appliance
      • Hadoop platform / Processing layer
      • The Analytics layer
      • The Consumption layer
        • Explaining and exploring data with Visualizations
        • Security and Monitoring layer
        • Hadoop core components framework
        • Writing to and reading from HDFS
        • Handling failures
        • HDFS command line
        • RESTFul HDFS
      • MapReduce
        • MapReduce architecture
        • What makes MapReduce cater to the needs of large datasets?
        • MapReduce execution flow and components
        • Developing MapReduce components
    • Hadoop 2.x
      • Hadoop ecosystem components
      • Hadoop installation and setup [ iv ]
        • Installing Jdk 1.7
        • Creating a system user for Hadoop (dedicated)
        • Disable IPv6
        • Steps for installing Hadoop 2.6.0
        • Starting Hadoop
      • Hadoop distributions and vendors
    • Summary
  • Chapter 4: Machine Learning Tools, Libraries, and Frameworks
    • Machine learning tools – A landscape
    • Apache Mahout
      • How does Mahout work?
      • Installing and setting up Apache Mahout
        • Setting up Maven
        • Setting-up Apache Mahout using Eclipse IDE
        • Setting up Apache Mahout without Eclipse
      • Mahout Packages
      • Implementing vectors in Mahout
    • R
      • Installing and setting up R
      • Integrating R with Apache Hadoop
        • Approach 1 – Using R and Streaming APIs in Hadoop
        • Approach 2 – Using the Rhipe package of R
        • Approach 3 – Using RHadoop
        • Summary of R/Hadoop integration approaches
        • Implementing in R (using examples)
    • Julia
      • Installing and setting up Julia
        • Downloading and using the command line version of Julia
        • Using Juno IDE for running Julia
        • Using Julia via the browser
      • Running the Julia code from the command line
      • Implementing in Julia (with examples)
      • Using variables and assignments
        • Numeric primitives
        • Data structures
        • Working with Strings and String manipulations
        • Packages
        • Interoperability
        • Graphics and plotting
      • Benefits of adopting Julia
      • Integrating Julia and Hadoop
    • Python [ v ]
      • Toolkit options in Python
      • Implementation of Python (using examples)
        • Installing Python and setting up scikit-learn
    • Apache Spark
      • Scala
      • Programming with Resilient Distributed Datasets (RDD)
    • Spring XD
    • Summary
  • Chapter 5: Decision Tree based learning
    • Decision trees
      • Terminology
      • Purpose and uses
      • Constructing a Decision tree
        • Handling missing values
        • Considerations for constructing Decision trees
        • Decision trees in a graphical representation
        • Inducing Decision trees – Decision tree algorithms
        • Greedy Decision trees
        • Benefits of Decision trees
      • Specialized trees
        • Oblique trees
        • Random forests
        • Evolutionary trees
        • Hellinger trees
    • Implementing Decision trees
      • Using Mahout
      • Using R
      • Using Spark
      • Using Python (scikit-learn)
      • Using Julia
    • Summary
  • Chapter 6: Instance and Kernel Methods Based Learning
    • Instance-based learning (IBL)
      • Nearest Neighbors
        • Value of k in KNN
        • Distance measures in KNN
        • Case-based reasoning (CBR)
        • Locally weighed regression (LWR)
      • Implementing KNN
        • Using Mahout
        • Using R