Machine Learning Using R

von: Karthik Ramasubramanian, Abhishek Singh

Apress, 2016

ISBN: 9781484223345 , 580 Seiten

Format: PDF, Online Lesen

Kopierschutz: Wasserzeichen

Mac OSX,Windows PC für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Online-Lesen für: Mac OSX,Linux,Windows PC

Preis: 42,79 EUR

eBook anfordern eBook anfordern

Mehr zum Inhalt

Machine Learning Using R


 

Contents at a Glance

5

Contents

6

About the Authors

17

About the Technical Reviewer

19

Acknowledgments

20

Chapter 1: Introduction to Machine Learning and R

21

1.1 Understanding the Evolution

22

1.1.1 Statistical Learning

22

1.1.2 Machine Learning (ML)

23

1.1.3 Artificial Intelligence (AI)

23

1.1.4 Data Mining

24

1.1.5 Data Science

25

1.2 Probability and Statistics

26

1.2.1 Counting and Probability Definition

27

1.2.2 Events and Relationships

29

1.2.2.1 Independent Events

29

1.2.2.2 Conditional Independence

30

1.2.2.3 Bayes Theorem

30

1.2.3 Randomness, Probability, and Distributions

32

1.2.4 Confidence Interval and Hypothesis Testing

33

1.2.4.1 Confidence Interval

34

1.2.4.2 Hypothesis Testing

35

1.3 Getting Started with R

38

1.3.1 Basic Building Blocks

38

1.3.1.1 Calculations

38

1.3.1.2 Statistics with R

39

1.3.1.3 Packages

39

1.3.2 Data Structures in R

39

1.3.2.1 Vectors

40

1.3.2.2 List

40

1.3.2.3 Matrix

40

1.3.2.4 Data Frame

41

1.3.3 Subsetting

41

1.3.3.1 Vectors

41

1.3.3.2 Lists

42

1.3.3.3 Matrixes

42

1.3.3.4 Data Frames

43

1.3.4 Functions and Apply Family

43

1.4 Machine Learning Process Flow

46

1.4.1 Plan

46

1.4.2 Explore

46

1.4.3 Build

47

1.4.4 Evaluate

47

1.5 Other Technologies

48

1.6 Summary

48

1.7 References

48

Chapter 2: Data Preparation and Exploration

50

2.1 Planning the Gathering of Data

51

2.1.1 Variables Types

51

2.1.1.1 Categorical Variables

51

2.1.1.2 Continuous Variables

52

2.1.2 Data Formats

52

2.1.2.1 Comma-Separated Values

53

2.1.2.2 Microsoft Excel

53

2.1.2.3 Extensible Markup Language: XML

53

2.1.2.4 Hypertext Markup Language: HTML

55

2.1.2.5 JSON

57

2.1.2.6 Other Formats

59

2.1.3 Data Sources

59

2.1.3.1 Structured

59

2.1.3.2 Semi-Structured

59

2.1.3.3 Unstructured

59

2.2 Initial Data Analysis (IDA)

60

2.2.1 Discerning a First Look

60

2.2.1.1 Function str()

60

2.2.1.2 Naming Convention: make.names()

61

2.2.1.3 Table(): Pattern or Trend

62

2.2.2 Organizing Multiple Sources of Data into One

62

2.2.2.1 Merge and dplyr Joins

62

2.2.2.1.1 Using merge

63

2.2.2.1.2 dplyr

64

2.2.3 Cleaning the Data

65

2.2.3.1 Correcting Factor Variables

65

2.2.3.2 Dealing with NAs

66

2.2.3.3 Dealing with Dates and Times

67

2.2.3.3.1 Time Zone

68

2.2.3.3.2 Daylight Savings Time

68

2.2.4 Supplementing with More Information

68

2.2.4.1 Derived Variables

69

2.2.4.2 n-day Averages

69

2.2.5 Reshaping

69

2.3 Exploratory Data Analysis

70

2.3.1 Summary Statistics

71

2.3.1.1 Quantile

71

2.3.1.2 Mean

72

2.3.1.3 Frequency Plot

73

2.3.1.4 Boxplot

73

2.3.2 Moment

74

2.3.2.1 Variance

75

2.3.2.2 Skewness

76

2.3.2.3 Kurtosis

78

2.4 Case Study: Credit Card Fraud

80

2.4.1 Data Import

80

2.4.2 Data Transformation

81

2.4.3 Data Exploration

82

2.5 Summary

84

2.6 References

84

Chapter 3: Sampling and Resampling Techniques

85

3.1 Introduction to Sampling

86

3.2 Sampling Terminology

87

3.2.1 Sample

87

3.2.2 Sampling Distribution

88

3.2.3 Population Mean and Variance

88

3.2.4 Sample Mean and Variance

88

3.2.5 Pooled Mean and Variance

88

3.2.6 Sample Point

89

3.2.7 Sampling Error

89

3.2.8 Sampling Fraction

90

3.2.9 Sampling Bias

90

3.2.10 Sampling Without Replacement (SWOR)

90

3.2.11 Sampling with Replacement (SWR)

90

3.3 Credit Card Fraud: Population Statistics

91

3.3.1 Data Description

91

3.3.2 Population Mean

92

3.3.3 Population Variance

92

3.3.4 Pooled Mean and Variance

93

3.4 Business Implications of Sampling

96

3.4.1 Features of Sampling

97

3.4.2 Shortcomings of Sampling

97

3.5 Probability and Non-Probability Sampling

97

3.5.1 Types of Non-Probability Sampling

98

3.5.1.1 Convenience Sampling

98

3.5.1.2 Purposive Sampling

99

3.5.1.3 Quota Sampling

99

3.6 Statistical Theory on Sampling Distributions

99

3.6.1 Law of Large Numbers: LLN

99

3.6.1.1 Weak Law of Large Numbers

100

3.6.1.2 Strong Law of Large Numbers

100

3.6.1.3 Steps in Simulation with R Code

101

3.6.2 Central Limit Theorem

103

3.6.2.1 Steps in Simulation with R Code

103

3.7 Probability Sampling Techniques

107

3.7.1 Population Statistics

107

3.7.2 Simple Random Sampling

111

3.7.3 Systematic Random Sampling

118

3.7.4 Stratified Random Sampling

122

3.7.5 Cluster Sampling

129

3.7.6 Bootstrap Sampling

135

3.8 Monte Carlo Method: Acceptance-Rejection Method

142

3.9 A Qualitative Account of Computational Savings by Sampling

144

3.10 Summary

145

Chapter 4: Data Visualization in R

146

4.1 Introduction to the ggplot2 Package

147

4.2 World Development Indicators

149

4.3 Line Chart

149

4.4 Stacked Column Charts

155

4.5 Scatterplots

161

4.6 Boxplots

162

4.7 Histograms and Density Plots

165

4.8 Pie Charts

169

4.9 Correlation Plots

171

4.10 HeatMaps

173

4.11 Bubble Charts

175

4.12 Waterfall Charts

179

4.13 Dendogram

182

4.14 Wordclouds

184

4.15 Sankey Plots

186

4.16 Time Series Graphs

187

4.17 Cohort Diagrams

189

4.18 Spatial Maps

191

4.19 Summary

195

4.20 References

196

Chapter 5: Feature Engineering

197

5.1 Introduction to Feature Engineering

198

5.1.1 Filter Methods

200

5.1.2 Wrapper Methods

200

5.1.3 Embedded Methods

200

5.2 Understanding the Working Data

201

5.2.1 Data Summary

202

5.2.2 Properties of Dependent Variable

202

5.2.3 Features Availability: Continuous or Categorical

205

5.2.4 Setting Up Data Assumptions

207

5.3 Feature Ranking

207

5.4 Variable Subset Selection

211

5.4.1 Filter Method

211

5.4.2 Wrapper Methods

215

5.4.3 Embedded Methods

222

5.5 Dimensionality Reduction

226

5.6 Feature Engineering Checklist

231

5.7 Summary

233

5.8 References

233

Chapter 6: Machine Learning Theory and Practices

234

6.1 Machine Learning Types

237

6.1.1 Supervised Learning

237

6.1.2 Unsupervised Learning

238

6.1.3 Semi-Supervised Learning

238

6.1.4 Reinforcement Learning

238

6.2 Groups of Machine Learning Algorithms

239

6.3 Real-World Datasets

244

6.3.1 House Sale Prices

244

6.3.2 Purchase Preference

245

6.3.3 Twitter Feeds and Article

246

6.3.4 Breast Cancer

246

6.3.5 Market Basket

247

6.3.6 Amazon Food Review

247

6.4 Regression Analysis

248

6.5 Correlation Analysis

250

6.5.1 Linear Regression

253

6.5.1.2 Best Linear Predictors

254

6.5.2 Simple Linear Regression

256

6.5.3 Multiple Linear Regression

259

6.5.4 Model Diagnostics: Linear Regression

262

6.5.4.1 Influential Point Analysis

263

6.5.4.2 Normality of Residuals

267

6.5.4.3 Multicollinearity

269

6.5.4.4 Residual Autocorrelation

271

6.5.4.5 Homoscedasticity

273

6.5.5 Polynomial Regression

276

6.5.6 Logistic Regression

280

6.5.7 Logit Transformation

281

6.5.8 Odds Ratio

282

6.5.8.1 Binomial Logistic Model

284

6.5.9 Model Diagnostics: Logistic Regression

290

6.5.9.1 Wald Test

290

6.5.9.2 Deviance

291

6.5.9.3 Pseudo R-Square

292

6.5.9.4 Bivariate Plots

293

6.5.9.5 Cumulative Gains and Lift Charts

296

6.5.9.6 Concordance and Discordant Ratios

299

6.5.10 Multinomial Logistic Regression

300

6.5.11 Generalized Linear Models

304

6.5.12 Conclusion

305

6.6 Support Vector Machine SVM

305

6.6.1 Linear SVM

307

6.6.1.1 Hard Margins

307

6.6.1.2 Soft Margins

307

6.6.2 Binary SVM Classifier

308

6.6.3 Multi-Class SVM

310

6.6.4 Conclusion

312

6.7 Decision Trees

312

6.7.1 Types of Decision Trees

313

6.7.1.1 Regression Trees

314

6.7.1.2 Classification Tree

315

6.7.2 Decision Measures

315

6.7.2.1 Gini Index

315

6.7.2.2 Entropy

316

6.7.2.3 Information Gain

317

6.7.3 Decision Tree Learning Methods

317

6.7.3.1 Iterative Dichotomizer 3

319

6.7.3.2 C5.0 algorithm

322

6.7.3.3 Classification and Regression Tree: CART

327

6.7.3.4 Chi-Square Automated Interaction Detection: CHAID

330

6.7.4 Ensemble Trees

336

6.7.4.1 Boosting

336

6.7.4.2 Bagging

338

6.7.4.2.1 Bagging CART

339

6.7.4.2.2 Random Forest

341

6.7.5 Conclusion

344

6.8 The Naive Bayes Method

345

6.8.1 Conditional Probability

345

6.8.2 Bayes Theorem

345

6.8.3 Prior Probability

346

6.8.4 Posterior Probability

346

6.8.5 Likelihood and Marginal Likelihood

346

6.8.6 Naive Bayes Methods

347

6.8.7 Conclusion

352

6.9 Cluster Analysis

352

6.9.1 Introduction to Clustering

353

6.9.2 Clustering Algorithms

354

6.9.2.1 Hierarchal Clustering

356

6.9.2.2 Centroid-Based Clustering

359

6.9.2.3 Distribution-Based Clustering

362

6.9.2.4 Density-Based Clustering

364

6.9.3 Internal Evaluation

366

6.9.3.1 Dunn Index

366

6.9.3.2 Silhouette Coefficient

367

6.9.4 External Evaluation

368

6.9.4.1 Rand Measure

368

6.9.4.2 Jaccard Index

369

6.9.5 Conclusion

369

6.10 Association Rule Mining

369

6.10.1 Introduction to Association Concepts

370

6.10.1.1 Support

370

6.10.1.2 Confidence

371

6.10.1.3 Lift

371

6.10.2 Rule-Mining Algorithms

372

6.10.2.1 Apriori

375

6.10.2.2 Eclat

377

6.10.3 Recommendation Algorithms

379

6.10.3.1 User-Based Collaborative Filtering (UBCF)

380

6.10.3.2 Item-Based Collaborative Filtering (IBCF)

381

6.10.4 Conclusion

387

6.11 Artificial Neural Networks

387

6.11.1 Human Cognitive Learning

387

6.11.2 Perceptron

389

6.11.3 Sigmoid Neuron

392

6.11.4 Neural Network Architecture

392

6.11.5 Supervised versus Unsupervised Neural Nets

394

6.11.6 Neural Network Learning Algorithms

395

6.11.6.1 Evolutionary Methods

396

6.11.6.2 Gene Expression Programming

396

6.11.6.3 Simulated Annealing

396

6.11.6.4 Expectation Maximization

397

6.11.6.5 Non-Parametric Methods

397

6.11.6.6 Particle Swarm Optimization

397

6.11.7 Feed-Forward Back-Propagation

397

6.11.7.1 Purchase Prediction: Neural Network-Based Classification

399

6.11.8 Deep Learning

404

6.11.9 Conclusion

411

6.12 Text-Mining Approaches

411

6.12.1 Introduction to Text Mining

412

6.12.2 Text Summarization

413

6.12.3 TF-IDF

415

6.12.4 Part-of-Speech (POS) Tagging

417

6.12.5 Word Cloud

421

6.12.6 Text Analysis: Microsoft Cognitive Services

422

6.12.7 Conclusion

432

6.13 Online Machine Learning Algorithms

432

6.13.1 Fuzzy C-Means Clustering

434

6.13.2 Conclusion

437

6.14 Model Building Checklist

437

6.15 Summary

438

6.16 References

438

Chapter 7: Machine Learning Model Evaluation

440

7.1 Dataset

441

7.1.1 House Sale Prices

441

7.1.2 Purchase Preference

443

7.2 Introduction to Model Performance and Evaluation

445

7.3 Objectives of Model Performance Evaluation

446

7.4 Population Stability Index

447

7.5 Model Evaluation for Continuous Output

452

7.5.1 Mean Absolute Error

454

7.5.2 Root Mean Square Error

456

7.5.3 R-Square

457

7.6 Model Evaluation for Discrete Output

460

7.6.1 Classification Matrix

461

7.6.2 Sensitivity and Specificity

466

7.6.3 Area Under ROC Curve

467

7.7 Probabilistic Techniques

470

7.7.1 K-Fold Cross Validation

471

7.7.2 Bootstrap Sampling

473

7.8 The Kappa Error Metric

474

7.9 Summary

478

7.10 References

479

Chapter 8: Model Performance Improvement

480

8.1 Machine Learning and Statistical Modeling

481

8.2 Overview of the Caret Package

483

8.3 Introduction to Hyper-Parameters

485

8.4 Hyper-Parameter Optimization

489

8.4.1 Manual Search

490

8.4.2 Manual Grid Search

492

8.4.3 Automatic Grid Search

494

8.4.4 Optimal Search

496

8.4.5 Random Search

498

8.4.6 Custom Searching

500

8.5 The Bias and Variance Tradeoff

503

8.5.1 Bagging or Bootstrap Aggregation

507

8.5.2 Boosting

508

8.6 Introduction to Ensemble Learning

508

8.6.1 Voting Ensembles

509

8.6.2 Advanced Methods in Ensemble Learning

510

8.6.2.1 Bagging

510

8.6.2.2 Boosting

512

8.7 Ensemble Techniques Illustration in R

513

8.7.1 Bagging Trees

513

8.7.2 Gradient Boosting with a Decision Tree

515

8.7.3 Blending KNN and Rpart

520

8.7.4 Stacking Using caretEnsemble

521

8.8 Advanced Topic: Bayesian Optimization of Machine Learning Models

526

8.9 Summary

531

8.10 References

532

Chapter 9: Scalable Machine Learning and Related Technologies

533

9.1 Distributed Processing and Storage

534

9.1.1 Google File System (GFS)

534

9.1.2 MapReduce

536

9.1.3 Parallel Execution in R

537

9.1.3.1 Setting the Cores

537

9.1.3.2 Problem Statement

538

9.1.3.3 Building the model: Serial

539

9.1.3.4 Building the Model: Parallel

539

9.1.3.5 Stopping the Clusters

540

9.2 The Hadoop Ecosystem

540

9.2.1 MapReduce

541

9.2.1.1 MapReduce Example: Word Count

541

9.2.2 Hive

545

9.2.2.1 Creating Tables

546

9.2.2.2 Describing Tables

546

9.2.2.3 Generating Data and Storing it in a Local File

547

9.2.2.4 Loading the Data into the Hive Table

547

9.2.2.5 Selecting a Query

548

9.2.3 Apache Pig

549

9.2.3.1 Connecting to Pig

549

9.2.3.2 Loading the Data

550

9.2.3.3 Tokenizing Each Line

550

9.2.3.4 Flattening the Tokens

551

9.2.3.5 Grouping the Words

551

9.2.3.6 Counting and Sorting

552

9.2.4 HBase

552

9.2.4.1 Starting HBase

553

9.2.4.2 Creating the Table and Put Data

553

9.2.4.3 Scanning the Data

554

9.2.5 Spark

554

9.3 Machine Learning in R with Spark

555

9.3.1 Setting the Environment Variable

556

9.3.2 Initializing the Spark Session

556

9.3.3 Loading Data and the Running Pre-Process

556

9.3.4 Creating SparkDataFrame

557

9.3.5 Building the ML Model

558

9.3.6 Predicting the Test Data

559

9.3.7 Stopping the SparkR Session

560

9.4 Machine Learning in R with H2O

560

9.4.1 Installation of Packages

561

9.4.2 Initialization of H2O Clusters

561

9.4.3 Deep Learning Demo in R with H2O

562

9.4.3.1 Running the Demo

563

9.4.3.2 Loading the Testing Data

563

9.5 Summary

567

9.6 References

568

Index

569