Suche
Lesesoftware
Info / Kontakt
Machine Learning Using R
von: Karthik Ramasubramanian, Abhishek Singh
Apress, 2016
ISBN: 9781484223345 , 580 Seiten
Format: PDF, Online Lesen
Kopierschutz: Wasserzeichen
Preis: 42,79 EUR
eBook anfordern
Contents at a Glance
5
Contents
6
About the Authors
17
About the Technical Reviewer
19
Acknowledgments
20
Chapter 1: Introduction to Machine Learning and R
21
1.1 Understanding the Evolution
22
1.1.1 Statistical Learning
22
1.1.2 Machine Learning (ML)
23
1.1.3 Artificial Intelligence (AI)
23
1.1.4 Data Mining
24
1.1.5 Data Science
25
1.2 Probability and Statistics
26
1.2.1 Counting and Probability Definition
27
1.2.2 Events and Relationships
29
1.2.2.1 Independent Events
29
1.2.2.2 Conditional Independence
30
1.2.2.3 Bayes Theorem
30
1.2.3 Randomness, Probability, and Distributions
32
1.2.4 Confidence Interval and Hypothesis Testing
33
1.2.4.1 Confidence Interval
34
1.2.4.2 Hypothesis Testing
35
1.3 Getting Started with R
38
1.3.1 Basic Building Blocks
38
1.3.1.1 Calculations
38
1.3.1.2 Statistics with R
39
1.3.1.3 Packages
39
1.3.2 Data Structures in R
39
1.3.2.1 Vectors
40
1.3.2.2 List
40
1.3.2.3 Matrix
40
1.3.2.4 Data Frame
41
1.3.3 Subsetting
41
1.3.3.1 Vectors
41
1.3.3.2 Lists
42
1.3.3.3 Matrixes
42
1.3.3.4 Data Frames
43
1.3.4 Functions and Apply Family
43
1.4 Machine Learning Process Flow
46
1.4.1 Plan
46
1.4.2 Explore
46
1.4.3 Build
47
1.4.4 Evaluate
47
1.5 Other Technologies
48
1.6 Summary
48
1.7 References
48
Chapter 2: Data Preparation and Exploration
50
2.1 Planning the Gathering of Data
51
2.1.1 Variables Types
51
2.1.1.1 Categorical Variables
51
2.1.1.2 Continuous Variables
52
2.1.2 Data Formats
52
2.1.2.1 Comma-Separated Values
53
2.1.2.2 Microsoft Excel
53
2.1.2.3 Extensible Markup Language: XML
53
2.1.2.4 Hypertext Markup Language: HTML
55
2.1.2.5 JSON
57
2.1.2.6 Other Formats
59
2.1.3 Data Sources
59
2.1.3.1 Structured
59
2.1.3.2 Semi-Structured
59
2.1.3.3 Unstructured
59
2.2 Initial Data Analysis (IDA)
60
2.2.1 Discerning a First Look
60
2.2.1.1 Function str()
60
2.2.1.2 Naming Convention: make.names()
61
2.2.1.3 Table(): Pattern or Trend
62
2.2.2 Organizing Multiple Sources of Data into One
62
2.2.2.1 Merge and dplyr Joins
62
2.2.2.1.1 Using merge
63
2.2.2.1.2 dplyr
64
2.2.3 Cleaning the Data
65
2.2.3.1 Correcting Factor Variables
65
2.2.3.2 Dealing with NAs
66
2.2.3.3 Dealing with Dates and Times
67
2.2.3.3.1 Time Zone
68
2.2.3.3.2 Daylight Savings Time
68
2.2.4 Supplementing with More Information
68
2.2.4.1 Derived Variables
69
2.2.4.2 n-day Averages
69
2.2.5 Reshaping
69
2.3 Exploratory Data Analysis
70
2.3.1 Summary Statistics
71
2.3.1.1 Quantile
71
2.3.1.2 Mean
72
2.3.1.3 Frequency Plot
73
2.3.1.4 Boxplot
73
2.3.2 Moment
74
2.3.2.1 Variance
75
2.3.2.2 Skewness
76
2.3.2.3 Kurtosis
78
2.4 Case Study: Credit Card Fraud
80
2.4.1 Data Import
80
2.4.2 Data Transformation
81
2.4.3 Data Exploration
82
2.5 Summary
84
2.6 References
84
Chapter 3: Sampling and Resampling Techniques
85
3.1 Introduction to Sampling
86
3.2 Sampling Terminology
87
3.2.1 Sample
87
3.2.2 Sampling Distribution
88
3.2.3 Population Mean and Variance
88
3.2.4 Sample Mean and Variance
88
3.2.5 Pooled Mean and Variance
88
3.2.6 Sample Point
89
3.2.7 Sampling Error
89
3.2.8 Sampling Fraction
90
3.2.9 Sampling Bias
90
3.2.10 Sampling Without Replacement (SWOR)
90
3.2.11 Sampling with Replacement (SWR)
90
3.3 Credit Card Fraud: Population Statistics
91
3.3.1 Data Description
91
3.3.2 Population Mean
92
3.3.3 Population Variance
92
3.3.4 Pooled Mean and Variance
93
3.4 Business Implications of Sampling
96
3.4.1 Features of Sampling
97
3.4.2 Shortcomings of Sampling
97
3.5 Probability and Non-Probability Sampling
97
3.5.1 Types of Non-Probability Sampling
98
3.5.1.1 Convenience Sampling
98
3.5.1.2 Purposive Sampling
99
3.5.1.3 Quota Sampling
99
3.6 Statistical Theory on Sampling Distributions
99
3.6.1 Law of Large Numbers: LLN
99
3.6.1.1 Weak Law of Large Numbers
100
3.6.1.2 Strong Law of Large Numbers
100
3.6.1.3 Steps in Simulation with R Code
101
3.6.2 Central Limit Theorem
103
3.6.2.1 Steps in Simulation with R Code
103
3.7 Probability Sampling Techniques
107
3.7.1 Population Statistics
107
3.7.2 Simple Random Sampling
111
3.7.3 Systematic Random Sampling
118
3.7.4 Stratified Random Sampling
122
3.7.5 Cluster Sampling
129
3.7.6 Bootstrap Sampling
135
3.8 Monte Carlo Method: Acceptance-Rejection Method
142
3.9 A Qualitative Account of Computational Savings by Sampling
144
3.10 Summary
145
Chapter 4: Data Visualization in R
146
4.1 Introduction to the ggplot2 Package
147
4.2 World Development Indicators
149
4.3 Line Chart
149
4.4 Stacked Column Charts
155
4.5 Scatterplots
161
4.6 Boxplots
162
4.7 Histograms and Density Plots
165
4.8 Pie Charts
169
4.9 Correlation Plots
171
4.10 HeatMaps
173
4.11 Bubble Charts
175
4.12 Waterfall Charts
179
4.13 Dendogram
182
4.14 Wordclouds
184
4.15 Sankey Plots
186
4.16 Time Series Graphs
187
4.17 Cohort Diagrams
189
4.18 Spatial Maps
191
4.19 Summary
195
4.20 References
196
Chapter 5: Feature Engineering
197
5.1 Introduction to Feature Engineering
198
5.1.1 Filter Methods
200
5.1.2 Wrapper Methods
200
5.1.3 Embedded Methods
200
5.2 Understanding the Working Data
201
5.2.1 Data Summary
202
5.2.2 Properties of Dependent Variable
202
5.2.3 Features Availability: Continuous or Categorical
205
5.2.4 Setting Up Data Assumptions
207
5.3 Feature Ranking
207
5.4 Variable Subset Selection
211
5.4.1 Filter Method
211
5.4.2 Wrapper Methods
215
5.4.3 Embedded Methods
222
5.5 Dimensionality Reduction
226
5.6 Feature Engineering Checklist
231
5.7 Summary
233
5.8 References
233
Chapter 6: Machine Learning Theory and Practices
234
6.1 Machine Learning Types
237
6.1.1 Supervised Learning
237
6.1.2 Unsupervised Learning
238
6.1.3 Semi-Supervised Learning
238
6.1.4 Reinforcement Learning
238
6.2 Groups of Machine Learning Algorithms
239
6.3 Real-World Datasets
244
6.3.1 House Sale Prices
244
6.3.2 Purchase Preference
245
6.3.3 Twitter Feeds and Article
246
6.3.4 Breast Cancer
246
6.3.5 Market Basket
247
6.3.6 Amazon Food Review
247
6.4 Regression Analysis
248
6.5 Correlation Analysis
250
6.5.1 Linear Regression
253
6.5.1.2 Best Linear Predictors
254
6.5.2 Simple Linear Regression
256
6.5.3 Multiple Linear Regression
259
6.5.4 Model Diagnostics: Linear Regression
262
6.5.4.1 Influential Point Analysis
263
6.5.4.2 Normality of Residuals
267
6.5.4.3 Multicollinearity
269
6.5.4.4 Residual Autocorrelation
271
6.5.4.5 Homoscedasticity
273
6.5.5 Polynomial Regression
276
6.5.6 Logistic Regression
280
6.5.7 Logit Transformation
281
6.5.8 Odds Ratio
282
6.5.8.1 Binomial Logistic Model
284
6.5.9 Model Diagnostics: Logistic Regression
290
6.5.9.1 Wald Test
290
6.5.9.2 Deviance
291
6.5.9.3 Pseudo R-Square
292
6.5.9.4 Bivariate Plots
293
6.5.9.5 Cumulative Gains and Lift Charts
296
6.5.9.6 Concordance and Discordant Ratios
299
6.5.10 Multinomial Logistic Regression
300
6.5.11 Generalized Linear Models
304
6.5.12 Conclusion
305
6.6 Support Vector Machine SVM
305
6.6.1 Linear SVM
307
6.6.1.1 Hard Margins
307
6.6.1.2 Soft Margins
307
6.6.2 Binary SVM Classifier
308
6.6.3 Multi-Class SVM
310
6.6.4 Conclusion
312
6.7 Decision Trees
312
6.7.1 Types of Decision Trees
313
6.7.1.1 Regression Trees
314
6.7.1.2 Classification Tree
315
6.7.2 Decision Measures
315
6.7.2.1 Gini Index
315
6.7.2.2 Entropy
316
6.7.2.3 Information Gain
317
6.7.3 Decision Tree Learning Methods
317
6.7.3.1 Iterative Dichotomizer 3
319
6.7.3.2 C5.0 algorithm
322
6.7.3.3 Classification and Regression Tree: CART
327
6.7.3.4 Chi-Square Automated Interaction Detection: CHAID
330
6.7.4 Ensemble Trees
336
6.7.4.1 Boosting
336
6.7.4.2 Bagging
338
6.7.4.2.1 Bagging CART
339
6.7.4.2.2 Random Forest
341
6.7.5 Conclusion
344
6.8 The Naive Bayes Method
345
6.8.1 Conditional Probability
345
6.8.2 Bayes Theorem
345
6.8.3 Prior Probability
346
6.8.4 Posterior Probability
346
6.8.5 Likelihood and Marginal Likelihood
346
6.8.6 Naive Bayes Methods
347
6.8.7 Conclusion
352
6.9 Cluster Analysis
352
6.9.1 Introduction to Clustering
353
6.9.2 Clustering Algorithms
354
6.9.2.1 Hierarchal Clustering
356
6.9.2.2 Centroid-Based Clustering
359
6.9.2.3 Distribution-Based Clustering
362
6.9.2.4 Density-Based Clustering
364
6.9.3 Internal Evaluation
366
6.9.3.1 Dunn Index
366
6.9.3.2 Silhouette Coefficient
367
6.9.4 External Evaluation
368
6.9.4.1 Rand Measure
368
6.9.4.2 Jaccard Index
369
6.9.5 Conclusion
369
6.10 Association Rule Mining
369
6.10.1 Introduction to Association Concepts
370
6.10.1.1 Support
370
6.10.1.2 Confidence
371
6.10.1.3 Lift
371
6.10.2 Rule-Mining Algorithms
372
6.10.2.1 Apriori
375
6.10.2.2 Eclat
377
6.10.3 Recommendation Algorithms
379
6.10.3.1 User-Based Collaborative Filtering (UBCF)
380
6.10.3.2 Item-Based Collaborative Filtering (IBCF)
381
6.10.4 Conclusion
387
6.11 Artificial Neural Networks
387
6.11.1 Human Cognitive Learning
387
6.11.2 Perceptron
389
6.11.3 Sigmoid Neuron
392
6.11.4 Neural Network Architecture
392
6.11.5 Supervised versus Unsupervised Neural Nets
394
6.11.6 Neural Network Learning Algorithms
395
6.11.6.1 Evolutionary Methods
396
6.11.6.2 Gene Expression Programming
396
6.11.6.3 Simulated Annealing
396
6.11.6.4 Expectation Maximization
397
6.11.6.5 Non-Parametric Methods
397
6.11.6.6 Particle Swarm Optimization
397
6.11.7 Feed-Forward Back-Propagation
397
6.11.7.1 Purchase Prediction: Neural Network-Based Classification
399
6.11.8 Deep Learning
404
6.11.9 Conclusion
411
6.12 Text-Mining Approaches
411
6.12.1 Introduction to Text Mining
412
6.12.2 Text Summarization
413
6.12.3 TF-IDF
415
6.12.4 Part-of-Speech (POS) Tagging
417
6.12.5 Word Cloud
421
6.12.6 Text Analysis: Microsoft Cognitive Services
422
6.12.7 Conclusion
432
6.13 Online Machine Learning Algorithms
432
6.13.1 Fuzzy C-Means Clustering
434
6.13.2 Conclusion
437
6.14 Model Building Checklist
437
6.15 Summary
438
6.16 References
438
Chapter 7: Machine Learning Model Evaluation
440
7.1 Dataset
441
7.1.1 House Sale Prices
441
7.1.2 Purchase Preference
443
7.2 Introduction to Model Performance and Evaluation
445
7.3 Objectives of Model Performance Evaluation
446
7.4 Population Stability Index
447
7.5 Model Evaluation for Continuous Output
452
7.5.1 Mean Absolute Error
454
7.5.2 Root Mean Square Error
456
7.5.3 R-Square
457
7.6 Model Evaluation for Discrete Output
460
7.6.1 Classification Matrix
461
7.6.2 Sensitivity and Specificity
466
7.6.3 Area Under ROC Curve
467
7.7 Probabilistic Techniques
470
7.7.1 K-Fold Cross Validation
471
7.7.2 Bootstrap Sampling
473
7.8 The Kappa Error Metric
474
7.9 Summary
478
7.10 References
479
Chapter 8: Model Performance Improvement
480
8.1 Machine Learning and Statistical Modeling
481
8.2 Overview of the Caret Package
483
8.3 Introduction to Hyper-Parameters
485
8.4 Hyper-Parameter Optimization
489
8.4.1 Manual Search
490
8.4.2 Manual Grid Search
492
8.4.3 Automatic Grid Search
494
8.4.4 Optimal Search
496
8.4.5 Random Search
498
8.4.6 Custom Searching
500
8.5 The Bias and Variance Tradeoff
503
8.5.1 Bagging or Bootstrap Aggregation
507
8.5.2 Boosting
508
8.6 Introduction to Ensemble Learning
508
8.6.1 Voting Ensembles
509
8.6.2 Advanced Methods in Ensemble Learning
510
8.6.2.1 Bagging
510
8.6.2.2 Boosting
512
8.7 Ensemble Techniques Illustration in R
513
8.7.1 Bagging Trees
513
8.7.2 Gradient Boosting with a Decision Tree
515
8.7.3 Blending KNN and Rpart
520
8.7.4 Stacking Using caretEnsemble
521
8.8 Advanced Topic: Bayesian Optimization of Machine Learning Models
526
8.9 Summary
531
8.10 References
532
Chapter 9: Scalable Machine Learning and Related Technologies
533
9.1 Distributed Processing and Storage
534
9.1.1 Google File System (GFS)
534
9.1.2 MapReduce
536
9.1.3 Parallel Execution in R
537
9.1.3.1 Setting the Cores
537
9.1.3.2 Problem Statement
538
9.1.3.3 Building the model: Serial
539
9.1.3.4 Building the Model: Parallel
539
9.1.3.5 Stopping the Clusters
540
9.2 The Hadoop Ecosystem
540
9.2.1 MapReduce
541
9.2.1.1 MapReduce Example: Word Count
541
9.2.2 Hive
545
9.2.2.1 Creating Tables
546
9.2.2.2 Describing Tables
546
9.2.2.3 Generating Data and Storing it in a Local File
547
9.2.2.4 Loading the Data into the Hive Table
547
9.2.2.5 Selecting a Query
548
9.2.3 Apache Pig
549
9.2.3.1 Connecting to Pig
549
9.2.3.2 Loading the Data
550
9.2.3.3 Tokenizing Each Line
550
9.2.3.4 Flattening the Tokens
551
9.2.3.5 Grouping the Words
551
9.2.3.6 Counting and Sorting
552
9.2.4 HBase
552
9.2.4.1 Starting HBase
553
9.2.4.2 Creating the Table and Put Data
553
9.2.4.3 Scanning the Data
554
9.2.5 Spark
554
9.3 Machine Learning in R with Spark
555
9.3.1 Setting the Environment Variable
556
9.3.2 Initializing the Spark Session
556
9.3.3 Loading Data and the Running Pre-Process
556
9.3.4 Creating SparkDataFrame
557
9.3.5 Building the ML Model
558
9.3.6 Predicting the Test Data
559
9.3.7 Stopping the SparkR Session
560
9.4 Machine Learning in R with H2O
560
9.4.1 Installation of Packages
561
9.4.2 Initialization of H2O Clusters
561
9.4.3 Deep Learning Demo in R with H2O
562
9.4.3.1 Running the Demo
563
9.4.3.2 Loading the Testing Data
563
9.5 Summary
567
9.6 References
568
Index
569