zurück zur Seite der Buchhandlung Blendl

pfeil_rechts

Warenkorb

Suche

Lesesoftware

Adobe Digital Edition / eReader

Service

Info / Kontakt

Machine Learning Using R

von: Karthik Ramasubramanian, Abhishek Singh

Apress, 2016

ISBN: 9781484223345 , 580 Seiten

Format: PDF, Online Lesen

Kopierschutz: Wasserzeichen

Online-Lesen für: Mac OSX,Linux,Windows PC

Preis: 42,79 EUR

eBook anfordern

Mehr zum Inhalt

Machine Learning Using R

Contents at a Glance: 5
Contents: 6
About the Authors: 17
About the Technical Reviewer: 19
Acknowledgments: 20
Chapter 1: Introduction to Machine Learning and R: 21
1.1 Understanding the Evolution: 22
1.1.1 Statistical Learning: 22
1.1.2 Machine Learning (ML): 23
1.1.3 Artificial Intelligence (AI): 23
1.1.4 Data Mining: 24
1.1.5 Data Science: 25
1.2 Probability and Statistics: 26
1.2.1 Counting and Probability Definition: 27
1.2.2 Events and Relationships: 29
1.2.2.1 Independent Events: 29
1.2.2.2 Conditional Independence: 30
1.2.2.3 Bayes Theorem: 30
1.2.3 Randomness, Probability, and Distributions: 32
1.2.4 Confidence Interval and Hypothesis Testing: 33
1.2.4.1 Confidence Interval: 34
1.2.4.2 Hypothesis Testing: 35
1.3 Getting Started with R: 38
1.3.1 Basic Building Blocks: 38
1.3.1.1 Calculations: 38
1.3.1.2 Statistics with R: 39
1.3.1.3 Packages: 39
1.3.2 Data Structures in R: 39
1.3.2.1 Vectors: 40
1.3.2.2 List: 40
1.3.2.3 Matrix: 40
1.3.2.4 Data Frame: 41
1.3.3 Subsetting: 41
1.3.3.1 Vectors: 41
1.3.3.2 Lists: 42
1.3.3.3 Matrixes: 42
1.3.3.4 Data Frames: 43
1.3.4 Functions and Apply Family: 43
1.4 Machine Learning Process Flow: 46
1.4.1 Plan: 46
1.4.2 Explore: 46
1.4.3 Build: 47
1.4.4 Evaluate: 47
1.5 Other Technologies: 48
1.6 Summary: 48
1.7 References: 48
Chapter 2: Data Preparation and Exploration: 50
2.1 Planning the Gathering of Data: 51
2.1.1 Variables Types: 51
2.1.1.1 Categorical Variables: 51
2.1.1.2 Continuous Variables: 52
2.1.2 Data Formats: 52
2.1.2.1 Comma-Separated Values: 53
2.1.2.2 Microsoft Excel: 53
2.1.2.3 Extensible Markup Language: XML: 53
2.1.2.4 Hypertext Markup Language: HTML: 55
2.1.2.5 JSON: 57
2.1.2.6 Other Formats: 59
2.1.3 Data Sources: 59
2.1.3.1 Structured: 59
2.1.3.2 Semi-Structured: 59
2.1.3.3 Unstructured: 59
2.2 Initial Data Analysis (IDA): 60
2.2.1 Discerning a First Look: 60
2.2.1.1 Function str(): 60
2.2.1.2 Naming Convention: make.names(): 61
2.2.1.3 Table(): Pattern or Trend: 62
2.2.2 Organizing Multiple Sources of Data into One: 62
2.2.2.1 Merge and dplyr Joins: 62
2.2.2.1.1 Using merge: 63
2.2.2.1.2 dplyr: 64
2.2.3 Cleaning the Data: 65
2.2.3.1 Correcting Factor Variables: 65
2.2.3.2 Dealing with NAs: 66
2.2.3.3 Dealing with Dates and Times: 67
2.2.3.3.1 Time Zone: 68
2.2.3.3.2 Daylight Savings Time: 68
2.2.4 Supplementing with More Information: 68
2.2.4.1 Derived Variables: 69
2.2.4.2 n-day Averages: 69
2.2.5 Reshaping: 69
2.3 Exploratory Data Analysis: 70
2.3.1 Summary Statistics: 71
2.3.1.1 Quantile: 71
2.3.1.2 Mean: 72
2.3.1.3 Frequency Plot: 73
2.3.1.4 Boxplot: 73
2.3.2 Moment: 74
2.3.2.1 Variance: 75
2.3.2.2 Skewness: 76
2.3.2.3 Kurtosis: 78
2.4 Case Study: Credit Card Fraud: 80
2.4.1 Data Import: 80
2.4.2 Data Transformation: 81
2.4.3 Data Exploration: 82
2.5 Summary: 84
2.6 References: 84
Chapter 3: Sampling and Resampling Techniques: 85
3.1 Introduction to Sampling: 86
3.2 Sampling Terminology: 87
3.2.1 Sample: 87
3.2.2 Sampling Distribution: 88
3.2.3 Population Mean and Variance: 88
3.2.4 Sample Mean and Variance: 88
3.2.5 Pooled Mean and Variance: 88
3.2.6 Sample Point: 89
3.2.7 Sampling Error: 89
3.2.8 Sampling Fraction: 90
3.2.9 Sampling Bias: 90
3.2.10 Sampling Without Replacement (SWOR): 90
3.2.11 Sampling with Replacement (SWR): 90
3.3 Credit Card Fraud: Population Statistics: 91
3.3.1 Data Description: 91
3.3.2 Population Mean: 92
3.3.3 Population Variance: 92
3.3.4 Pooled Mean and Variance: 93
3.4 Business Implications of Sampling: 96
3.4.1 Features of Sampling: 97
3.4.2 Shortcomings of Sampling: 97
3.5 Probability and Non-Probability Sampling: 97
3.5.1 Types of Non-Probability Sampling: 98
3.5.1.1 Convenience Sampling: 98
3.5.1.2 Purposive Sampling: 99
3.5.1.3 Quota Sampling: 99
3.6 Statistical Theory on Sampling Distributions: 99
3.6.1 Law of Large Numbers: LLN: 99
3.6.1.1 Weak Law of Large Numbers: 100
3.6.1.2 Strong Law of Large Numbers: 100
3.6.1.3 Steps in Simulation with R Code: 101
3.6.2 Central Limit Theorem: 103
3.6.2.1 Steps in Simulation with R Code: 103
3.7 Probability Sampling Techniques: 107
3.7.1 Population Statistics: 107
3.7.2 Simple Random Sampling: 111
3.7.3 Systematic Random Sampling: 118
3.7.4 Stratified Random Sampling: 122
3.7.5 Cluster Sampling: 129
3.7.6 Bootstrap Sampling: 135
3.8 Monte Carlo Method: Acceptance-Rejection Method: 142
3.9 A Qualitative Account of Computational Savings by Sampling: 144
3.10 Summary: 145
Chapter 4: Data Visualization in R: 146
4.1 Introduction to the ggplot2 Package: 147
4.2 World Development Indicators: 149
4.3 Line Chart: 149
4.4 Stacked Column Charts: 155
4.5 Scatterplots: 161
4.6 Boxplots: 162
4.7 Histograms and Density Plots: 165
4.8 Pie Charts: 169
4.9 Correlation Plots: 171
4.10 HeatMaps: 173
4.11 Bubble Charts: 175
4.12 Waterfall Charts: 179
4.13 Dendogram: 182
4.14 Wordclouds: 184
4.15 Sankey Plots: 186
4.16 Time Series Graphs: 187
4.17 Cohort Diagrams: 189
4.18 Spatial Maps: 191
4.19 Summary: 195
4.20 References: 196
Chapter 5: Feature Engineering: 197
5.1 Introduction to Feature Engineering: 198
5.1.1 Filter Methods: 200
5.1.2 Wrapper Methods: 200
5.1.3 Embedded Methods: 200
5.2 Understanding the Working Data: 201
5.2.1 Data Summary: 202
5.2.2 Properties of Dependent Variable: 202
5.2.3 Features Availability: Continuous or Categorical: 205
5.2.4 Setting Up Data Assumptions: 207
5.3 Feature Ranking: 207
5.4 Variable Subset Selection: 211
5.4.1 Filter Method: 211
5.4.2 Wrapper Methods: 215
5.4.3 Embedded Methods: 222
5.5 Dimensionality Reduction: 226
5.6 Feature Engineering Checklist: 231
5.7 Summary: 233
5.8 References: 233
Chapter 6: Machine Learning Theory and Practices: 234
6.1 Machine Learning Types: 237
6.1.1 Supervised Learning: 237
6.1.2 Unsupervised Learning: 238
6.1.3 Semi-Supervised Learning: 238
6.1.4 Reinforcement Learning: 238
6.2 Groups of Machine Learning Algorithms: 239
6.3 Real-World Datasets: 244
6.3.1 House Sale Prices: 244
6.3.2 Purchase Preference: 245
6.3.3 Twitter Feeds and Article: 246
6.3.4 Breast Cancer: 246
6.3.5 Market Basket: 247
6.3.6 Amazon Food Review: 247
6.4 Regression Analysis: 248
6.5 Correlation Analysis: 250
6.5.1 Linear Regression: 253
6.5.1.2 Best Linear Predictors: 254
6.5.2 Simple Linear Regression: 256
6.5.3 Multiple Linear Regression: 259
6.5.4 Model Diagnostics: Linear Regression: 262
6.5.4.1 Influential Point Analysis: 263
6.5.4.2 Normality of Residuals: 267
6.5.4.3 Multicollinearity: 269
6.5.4.4 Residual Autocorrelation: 271
6.5.4.5 Homoscedasticity: 273
6.5.5 Polynomial Regression: 276
6.5.6 Logistic Regression: 280
6.5.7 Logit Transformation: 281
6.5.8 Odds Ratio: 282
6.5.8.1 Binomial Logistic Model: 284
6.5.9 Model Diagnostics: Logistic Regression: 290
6.5.9.1 Wald Test: 290
6.5.9.2 Deviance: 291
6.5.9.3 Pseudo R-Square: 292
6.5.9.4 Bivariate Plots: 293
6.5.9.5 Cumulative Gains and Lift Charts: 296
6.5.9.6 Concordance and Discordant Ratios: 299
6.5.10 Multinomial Logistic Regression: 300
6.5.11 Generalized Linear Models: 304
6.5.12 Conclusion: 305
6.6 Support Vector Machine SVM: 305
6.6.1 Linear SVM: 307
6.6.1.1 Hard Margins: 307
6.6.1.2 Soft Margins: 307
6.6.2 Binary SVM Classifier: 308
6.6.3 Multi-Class SVM: 310
6.6.4 Conclusion: 312
6.7 Decision Trees: 312
6.7.1 Types of Decision Trees: 313
6.7.1.1 Regression Trees: 314
6.7.1.2 Classification Tree: 315
6.7.2 Decision Measures: 315
6.7.2.1 Gini Index: 315
6.7.2.2 Entropy: 316
6.7.2.3 Information Gain: 317
6.7.3 Decision Tree Learning Methods: 317
6.7.3.1 Iterative Dichotomizer 3: 319
6.7.3.2 C5.0 algorithm: 322
6.7.3.3 Classification and Regression Tree: CART: 327
6.7.3.4 Chi-Square Automated Interaction Detection: CHAID: 330
6.7.4 Ensemble Trees: 336
6.7.4.1 Boosting: 336
6.7.4.2 Bagging: 338
6.7.4.2.1 Bagging CART: 339
6.7.4.2.2 Random Forest: 341
6.7.5 Conclusion: 344
6.8 The Naive Bayes Method: 345
6.8.1 Conditional Probability: 345
6.8.2 Bayes Theorem: 345
6.8.3 Prior Probability: 346
6.8.4 Posterior Probability: 346
6.8.5 Likelihood and Marginal Likelihood: 346
6.8.6 Naive Bayes Methods: 347
6.8.7 Conclusion: 352
6.9 Cluster Analysis: 352
6.9.1 Introduction to Clustering: 353
6.9.2 Clustering Algorithms: 354
6.9.2.1 Hierarchal Clustering: 356
6.9.2.2 Centroid-Based Clustering: 359
6.9.2.3 Distribution-Based Clustering: 362
6.9.2.4 Density-Based Clustering: 364
6.9.3 Internal Evaluation: 366
6.9.3.1 Dunn Index: 366
6.9.3.2 Silhouette Coefficient: 367
6.9.4 External Evaluation: 368
6.9.4.1 Rand Measure: 368
6.9.4.2 Jaccard Index: 369
6.9.5 Conclusion: 369
6.10 Association Rule Mining: 369
6.10.1 Introduction to Association Concepts: 370
6.10.1.1 Support: 370
6.10.1.2 Confidence: 371
6.10.1.3 Lift: 371
6.10.2 Rule-Mining Algorithms: 372
6.10.2.1 Apriori: 375
6.10.2.2 Eclat: 377
6.10.3 Recommendation Algorithms: 379
6.10.3.1 User-Based Collaborative Filtering (UBCF): 380
6.10.3.2 Item-Based Collaborative Filtering (IBCF): 381
6.10.4 Conclusion: 387
6.11 Artificial Neural Networks: 387
6.11.1 Human Cognitive Learning: 387
6.11.2 Perceptron: 389
6.11.3 Sigmoid Neuron: 392
6.11.4 Neural Network Architecture: 392
6.11.5 Supervised versus Unsupervised Neural Nets: 394
6.11.6 Neural Network Learning Algorithms: 395
6.11.6.1 Evolutionary Methods: 396
6.11.6.2 Gene Expression Programming: 396
6.11.6.3 Simulated Annealing: 396
6.11.6.4 Expectation Maximization: 397
6.11.6.5 Non-Parametric Methods: 397
6.11.6.6 Particle Swarm Optimization: 397
6.11.7 Feed-Forward Back-Propagation: 397
6.11.7.1 Purchase Prediction: Neural Network-Based Classification: 399
6.11.8 Deep Learning: 404
6.11.9 Conclusion: 411
6.12 Text-Mining Approaches: 411
6.12.1 Introduction to Text Mining: 412
6.12.2 Text Summarization: 413
6.12.3 TF-IDF: 415
6.12.4 Part-of-Speech (POS) Tagging: 417
6.12.5 Word Cloud: 421
6.12.6 Text Analysis: Microsoft Cognitive Services: 422
6.12.7 Conclusion: 432
6.13 Online Machine Learning Algorithms: 432
6.13.1 Fuzzy C-Means Clustering: 434
6.13.2 Conclusion: 437
6.14 Model Building Checklist: 437
6.15 Summary: 438
6.16 References: 438
Chapter 7: Machine Learning Model Evaluation: 440
7.1 Dataset: 441
7.1.1 House Sale Prices: 441
7.1.2 Purchase Preference: 443
7.2 Introduction to Model Performance and Evaluation: 445
7.3 Objectives of Model Performance Evaluation: 446
7.4 Population Stability Index: 447
7.5 Model Evaluation for Continuous Output: 452
7.5.1 Mean Absolute Error: 454
7.5.2 Root Mean Square Error: 456
7.5.3 R-Square: 457
7.6 Model Evaluation for Discrete Output: 460
7.6.1 Classification Matrix: 461
7.6.2 Sensitivity and Specificity: 466
7.6.3 Area Under ROC Curve: 467
7.7 Probabilistic Techniques: 470
7.7.1 K-Fold Cross Validation: 471
7.7.2 Bootstrap Sampling: 473
7.8 The Kappa Error Metric: 474
7.9 Summary: 478
7.10 References: 479
Chapter 8: Model Performance Improvement: 480
8.1 Machine Learning and Statistical Modeling: 481
8.2 Overview of the Caret Package: 483
8.3 Introduction to Hyper-Parameters: 485
8.4 Hyper-Parameter Optimization: 489
8.4.1 Manual Search: 490
8.4.2 Manual Grid Search: 492
8.4.3 Automatic Grid Search: 494
8.4.4 Optimal Search: 496
8.4.5 Random Search: 498
8.4.6 Custom Searching: 500
8.5 The Bias and Variance Tradeoff: 503
8.5.1 Bagging or Bootstrap Aggregation: 507
8.5.2 Boosting: 508
8.6 Introduction to Ensemble Learning: 508
8.6.1 Voting Ensembles: 509
8.6.2 Advanced Methods in Ensemble Learning: 510
8.6.2.1 Bagging: 510
8.6.2.2 Boosting: 512
8.7 Ensemble Techniques Illustration in R: 513
8.7.1 Bagging Trees: 513
8.7.2 Gradient Boosting with a Decision Tree: 515
8.7.3 Blending KNN and Rpart: 520
8.7.4 Stacking Using caretEnsemble: 521
8.8 Advanced Topic: Bayesian Optimization of Machine Learning Models: 526
8.9 Summary: 531
8.10 References: 532
Chapter 9: Scalable Machine Learning and Related Technologies: 533
9.1 Distributed Processing and Storage: 534
9.1.1 Google File System (GFS): 534
9.1.2 MapReduce: 536
9.1.3 Parallel Execution in R: 537
9.1.3.1 Setting the Cores: 537
9.1.3.2 Problem Statement: 538
9.1.3.3 Building the model: Serial: 539
9.1.3.4 Building the Model: Parallel: 539
9.1.3.5 Stopping the Clusters: 540
9.2 The Hadoop Ecosystem: 540
9.2.1 MapReduce: 541
9.2.1.1 MapReduce Example: Word Count: 541
9.2.2 Hive: 545
9.2.2.1 Creating Tables: 546
9.2.2.2 Describing Tables: 546
9.2.2.3 Generating Data and Storing it in a Local File: 547
9.2.2.4 Loading the Data into the Hive Table: 547
9.2.2.5 Selecting a Query: 548
9.2.3 Apache Pig: 549
9.2.3.1 Connecting to Pig: 549
9.2.3.2 Loading the Data: 550
9.2.3.3 Tokenizing Each Line: 550
9.2.3.4 Flattening the Tokens: 551
9.2.3.5 Grouping the Words: 551
9.2.3.6 Counting and Sorting: 552
9.2.4 HBase: 552
9.2.4.1 Starting HBase: 553
9.2.4.2 Creating the Table and Put Data: 553
9.2.4.3 Scanning the Data: 554
9.2.5 Spark: 554
9.3 Machine Learning in R with Spark: 555
9.3.1 Setting the Environment Variable: 556
9.3.2 Initializing the Spark Session: 556
9.3.3 Loading Data and the Running Pre-Process: 556
9.3.4 Creating SparkDataFrame: 557
9.3.5 Building the ML Model: 558
9.3.6 Predicting the Test Data: 559
9.3.7 Stopping the SparkR Session: 560
9.4 Machine Learning in R with H2O: 560
9.4.1 Installation of Packages: 561
9.4.2 Initialization of H2O Clusters: 561
9.4.3 Deep Learning Demo in R with H2O: 562
9.4.3.1 Running the Demo: 563
9.4.3.2 Loading the Testing Data: 563
9.5 Summary: 567
9.6 References: 568
Index: 569

Machine Learning Using R

von: Karthik Ramasubramanian, Abhishek Singh

Machine Learning Using R

Contents at a Glance

Contents

About the Authors

About the Technical Reviewer

Acknowledgments

Chapter 1: Introduction to Machine Learning and R

1.1 Understanding the Evolution

1.1.1 Statistical Learning

1.1.2 Machine Learning (ML)

1.1.3 Artificial Intelligence (AI)

1.1.4 Data Mining

1.1.5 Data Science

1.2 Probability and Statistics

1.2.1 Counting and Probability Definition

1.2.2 Events and Relationships

1.2.2.1 Independent Events

1.2.2.2 Conditional Independence

1.2.2.3 Bayes Theorem

1.2.3 Randomness, Probability, and Distributions

1.2.4 Confidence Interval and Hypothesis Testing

1.2.4.1 Confidence Interval

1.2.4.2 Hypothesis Testing

1.3 Getting Started with R

1.3.1 Basic Building Blocks

1.3.1.1 Calculations

1.3.1.2 Statistics with R

1.3.1.3 Packages

1.3.2 Data Structures in R

1.3.2.1 Vectors

1.3.2.2 List

1.3.2.3 Matrix

1.3.2.4 Data Frame

1.3.3 Subsetting

1.3.3.1 Vectors

1.3.3.2 Lists

1.3.3.3 Matrixes

1.3.3.4 Data Frames

1.3.4 Functions and Apply Family

1.4 Machine Learning Process Flow

1.4.1 Plan

1.4.2 Explore

1.4.3 Build

1.4.4 Evaluate

1.5 Other Technologies

1.6 Summary

1.7 References

Chapter 2: Data Preparation and Exploration

2.1 Planning the Gathering of Data

2.1.1 Variables Types

2.1.1.1 Categorical Variables

2.1.1.2 Continuous Variables

2.1.2 Data Formats

2.1.2.1 Comma-Separated Values

2.1.2.2 Microsoft Excel

2.1.2.3 Extensible Markup Language: XML

2.1.2.4 Hypertext Markup Language: HTML

2.1.2.5 JSON

2.1.2.6 Other Formats

2.1.3 Data Sources

2.1.3.1 Structured

2.1.3.2 Semi-Structured

2.1.3.3 Unstructured

2.2 Initial Data Analysis (IDA)

2.2.1 Discerning a First Look

2.2.1.1 Function str()

2.2.1.2 Naming Convention: make.names()

2.2.1.3 Table(): Pattern or Trend

2.2.2 Organizing Multiple Sources of Data into One

2.2.2.1 Merge and dplyr Joins

2.2.2.1.1 Using merge

2.2.2.1.2 dplyr

2.2.3 Cleaning the Data

2.2.3.1 Correcting Factor Variables

2.2.3.2 Dealing with NAs

2.2.3.3 Dealing with Dates and Times

2.2.3.3.1 Time Zone

2.2.3.3.2 Daylight Savings Time