Machine learning on R: expert techniques for predictive analysis

imageHello, habrozhiteli! The R language offers a powerful set of machine learning methods that allow you to quickly conduct non-trivial analysis of your data. The book is a guide that will help apply machine learning methods to solve everyday problems. Brett Lanz will teach you everything you need for data analysis, forecasting and data visualization. Here you will find information on new and improved libraries, tips on the ethical aspects of machine learning and bias issues, as well as in-depth training.

In this book - Fundamentals of machine learning and features of computer training on examples. - Preparation of data for use in machine learning by means of the language R. - Classification of the significance of the results. - Prediction of events using decision trees, rules and reference vectors. - Prediction of numerical data and assessment of financial data using regression methods. - Modeling complex processes using neural networks is the foundation of deep learning. - Evaluation of models and improving their performance. - The latest technologies for processing big data, in particular R 3.6, Spark, H2O and TensorFlow.

Who is the book for?


The book is intended for those who expect to use data in a specific area. You may already be a little familiar with machine learning, but you have never worked with the R language; or, conversely, you know a little about R, but almost do not know about machine learning. In any case, this book will help you get started quickly. It would be useful to refresh a little the basic concepts of mathematics and programming, but no prior experience would be required. All you need is a desire to learn.

What will you read in the publication
1 « » , , , , .

2 « » R. , , .

3 « : » , : .

4 « : » , . , .

5 « : » , , . , .

6 « : » , . , , .

7 « “ ”: » , . , , .

8 « : » , , . - , , , .

9 « : k-» . -.

10 « » , .

11 « » , , . , .

12 « » : R. , , R.


Example: modeling concrete strength using a neural network


In the field of civil engineering, it is extremely important to have accurate estimates of the effectiveness of building materials. These assessments are necessary to develop safety rules governing the use of materials in the construction of buildings, bridges and roads.

Of particular interest is the assessment of concrete strength. Concrete is used in almost any construction, the performance characteristics of concrete are very different, since it consists of a huge number of ingredients that interact in a complex. As a result, it is difficult to say exactly what the strength of the finished product will be. A model that would allow determining the strength of concrete for sure, taking into account the composition of the starting materials, could provide a higher level of safety for construction sites.

Step 1. Data collection


For this analysis, we will use the concrete compressive strength data provided by I-Cheng Yeh to the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Since Ai-Cheng Ye successfully used neural networks to model this data, we will try to reproduce his work by applying a simple model of a neural network in R.

Judging by the site, this dataset contains 1030 records about different brands of concrete with eight characteristics that describe the components used in the composition of the concrete mix. It is believed that these characteristics affect the final compressive strength. These include: the amount (in kilograms per cubic meter) of cement, water, various additives, large and small aggregates such as crushed stone and sand used in the finished product, as well as setting time (in days).

To run this example, download the concrete.csv file and save it in the R working directory.

Step 2. Research and data preparation


As usual, we start the analysis by loading the data into the R-object using the read.csv () function and make sure that the result corresponds to the expected structure:

> concrete <- read.csv("concrete.csv")
> str(concrete)
'data.frame':        1030 obs. of 9 variables:
$ cement       : num 141 169 250 266 155 ...
$ slag            : num 212 42.2 0 114 183.4 ...
$ ash             : num 0 124.3 95.7 0 0 ...
$ water          : num 204 158 187 228 193 ...
$ superplastic : num 0 10.8 5.5 0 9.1 0 0 6.4 0 9 ...
$ coarseagg    : num 972 1081 957 932 1047 ...
$ fineagg        : num 748 796 861 670 697 ...
$ age             : int 28 14 28 28 28 90 7 56 28 28 ...
$ strength      : num 29.9 23.5 29.2 45.9 18.3 ...

Nine variables in the data frame correspond to eight characteristics and one expected result, but it became obvious that there is a problem. Neural networks work best when the input data is scaled to a narrow range centered around 0, and here we see values ​​in the range from 0 to more than 1000.

Typically, the solution to this problem is to scale the data using the normalization or standardization function. If the data distribution corresponds to a bell-shaped curve (normal distribution, see chapter 2), then it may make sense to use standardization using the built-in scale () function. If the data distribution is close to uniform or very different from normal, then normalization to the range from 0 to 1 may be more suitable. In this case, we will use the latter option.

In chapter 3, we created our own normalize () function:

> normalize <- function(x) {
       return((x - min(x)) / (max(x) — min(x)))
}

After this code is executed, you can apply the normalize () function to all columns of the selected data frame using the lapply () function:

> concrete_norm <- as.data.frame(lapply(concrete, normalize))

To verify that the normalization has worked, you can check whether the minimum and maximum values ​​of the strength attribute are 0 and 1, respectively:

> summary(concrete_norm$strength)
       Min.     1st Qu.         Median     Mean      3rd Qu.      Max.
   0.0000     0.2664         0.4001  0.4172      0.5457   1.0000

For comparison: the initial minimum and maximum values ​​of this attribute were 2.33 and 82.60, respectively:

> summary(concrete$strength)
     Min.       1st Qu.     Median       Mean      3rd Qu.       Max.
    2.33         23.71       34.44      35.82        46.14      82.60

Any conversion applied to the data before training the model should subsequently be applied in the reverse order to convert the attribute back to the original units. To facilitate scaling, it is advisable to save the source data, or at least a summary of the statistics of the source data.

Following the scenario described by Ye in the original article, we will divide the data into a training set, which includes 75% of all examples, and a test set, consisting of 25%. The CSV file used is sorted in random order, so we can only divide it into two parts: We will use a training dataset to build a neural network and a test dataset to assess how well the model generalizes for future results. Since the neural network is easily brought to a state of retraining, this step is very important.

> concrete_train <- concrete_norm[1:773, ]
> concrete_test <- concrete_norm[774:1030, ]




Step 3. Training the model on data


To model the relationship between the ingredients used in concrete production and the strength of the finished product, we will build a multilayer direct distribution neural network. The neuralnet package, developed by Stefan Fritsch and Frauke Guenther, provides a standard and easy-to-use implementation of such networks. This package also includes a function for building a network topology. Implementing neuralnet is a good way to get additional information about neural networks, although this does not mean that it cannot be used to do real work either - as you will soon see, it is a rather powerful tool.

R , , . nnet R, , , . , . — RSNNS, , , .

Since the neuralnet package is not included in the base R, you will need to install it by typing install.packages ("neuralnet") and download it using the library (neuralnet) command. The neuralnet () function in the package can be used to train neural networks in numerical prediction using the following syntax.

Neural network syntax


Using the neuralnet () function from the neuralnet package

Building a model:

m <- neuralnet(target ~ predictors, data = mydata,
                       hidden = 1, act.fct = "logistic")

• target - a model that will be built as a result of training on the mydata data frame;

• predictors - R-formula that determines the characteristics from the mydata data frame to be used in forecasting;

• data - data frame to which target and predictors belong;

• hidden - the number of neurons in the hidden layer (default is 1). Note: to describe several hidden layers, a vector of integers is used, for example, c (2, 2);

• act.fct - activation function: "logistic" or "tanh". Note: any other differentiable function can also be used.

The function returns a neural network object that can be used for forecasting.

Prediction:

p <- compute (m, test)

• m - model trained using the neuralnet () function;

• test - a data frame containing test data with the same characteristics as the training data used to construct the classifier.

The function returns a list consisting of two components: $ neurons, where neurons are stored for each network layer, and $ net.result, where the values ​​predicted using this model are stored.

Examples:



concrete_model <- neuralnet(strength ~ cement + slag + ash,
      data = concrete, hidden = c(5, 5), act.fct = "tanh")
model_results <- compute(concrete_model, concrete_data)
strength_predictions <- model_results$net.result

Let's start by training the simplest multi-level direct distribution network with default parameters, which has only one hidden node:

> concrete_model <- neuralnet(strength ~ cement + slag
         + ash + water + superplastic + coarseagg + fineagg + age,
         data = concrete_train)

Then, as shown in fig. 7.11, you can visualize the network topology using the plot () function and passing it the resulting model object:

> plot(concrete_model)

image

In this simple model, there is one input node for each of the eight features, then there is one hidden and one output node, which gives a forecast of concrete strength. The diagram also shows the weights for each connection and the offset value indicated for the nodes marked with the number 1. The offset value is a numerical constant that allows you to shift the value in the specified node up or down, approximately like a shift in a linear equation.

A neural network with one hidden node can be considered the “cousin” of the linear regression models discussed in Chapter 6. The weights between the input nodes and the hidden node are similar to beta coefficients, and the offset weight is like a shift.

At the bottom of the figure, the number of training steps and the magnitude of the error are displayed - the total mean square error (Sum of Squared Errors, SSE), which, as expected, is the sum of the squared differences between the predicted and actual values. The smaller the SSE, the more accurately the model matches the training data, which indicates the effectiveness of these data, but says little about how the model will work with unknown data.

Step 4. Assessing the effectiveness of the model


The network topology diagram provides an opportunity to look into the “black box” of a neural network, but it does not provide much information on how well the model matches future data. To generate forecasts on a test data set, you can use the compute ()

> model_results <- compute(concrete_model, concrete_test[1:8])

function : The compute () function works a little differently than the predict () functions that we have used so far. It returns a list consisting of two components: $ neurons, where neurons are stored for each network layer, and $ net.result, where predicted values ​​are stored. It is $ net.result that we need:

> predicted_strength <- model_results$net.result

Since we have the task of numerical forecasting, and not classification, we cannot use the matrix of inconsistencies to verify the accuracy of the model. We measure the correlation between the predicted and the true value of concrete strength. If the predicted and actual values ​​will strongly correlate, then, probably, the model will be useful for determining the strength of concrete.

Let me remind you that to obtain the correlation between two numerical vectors, the cor () function is used:

> cor(predicted_strength, concrete_test$strength)
                    [,1]
[1,] 0.8064655576

Do not be alarmed if your result differs from ours. Since the neural network starts working with random weights, the predictions presented in the book may be different for different models. If you want to accurately match the results, try the set.seed (12345) command before you start building a neural network.

If the correlation is close to 1, this indicates a strong linear relationship between the two variables. Therefore, a correlation of approximately 0.806 indicates a rather strong relationship. This means that the model works quite well even with a single hidden node. Given that we used only one hidden node, it is likely that we can improve the efficiency of the model, which we will try to do.

Step 5. Improving Model Efficiency


Since networks with a more complex topology are able to study more complex concepts, let's see what happens if you increase the number of hidden nodes to five. We will use the neuralnet () function, as before, but add the hidden = 5 parameter:

> concrete_model2 <- neuralnet(strength ~ cement + slag +
                                               ash + water + superplastic +
                                               coarseagg + fineagg + age,
                                               data = concrete_train, hidden = 5)

Having built the network diagram again (Fig. 7.12), we will see a sharp increase in the number of connections. How has this affected efficiency?

> plot(concrete_model2)

Please note that the resulting error (again measured as SSE) decreased from 5.08 in the previous model to 1.63. In addition, the number of training stages increased from 4882 to 86,849 - which is not surprising, given how complicated the model is. The more complex the network, the more iterations are required to find the optimal weights.

Applying the same steps to compare the predicted values ​​with the true ones, we get a correlation of about 0.92, which is much better compared to the previous result of 0.80 for a network with one hidden node:

> model_results2 <- compute(concrete_model2, concrete_test[1:8])
> predicted_strength2 <- model_results2$net.result
> cor(predicted_strength2, concrete_test$strength)
                  [,1]
[1,] 0.9244533426

image

Despite significant improvements, you can go even further to increase the effectiveness of the model. In particular, it is possible to introduce additional hidden layers and change the network activation function. By making these changes, we are laying the foundations for building a simple deep neural network.

The choice of activation function is very important for deep learning. The best function for a particular learning task is usually found experimentally, and then is widely used by the community of machine learning researchers.

Recently, the activation function, called the distillation function, or rectifier, has become very popular due to its successful application in complex tasks, such as image recognition. A neural network node in which a rectifier is used as an activation function is called a Rectified Linear Unit (ReLU). As shown in fig. 7.13, the rectifier type activation function is described in such a way that returns x if x is greater than or equal to 0, and 0 otherwise. The importance of this function is that, on the one hand, it is non-linear, and on the other, it has simple mathematical properties that make it computationally inexpensive and highly efficient for gradient descent. Unfortunately, for x = 0, the rectifier derivative is not defined,therefore, the rectifier cannot be used in conjunction with the neuralnet () function.

Instead, you can use a smoothed approximation of ReLU called softplus or SmoothReLU, an activation function defined as log (1 + ex). As shown in fig. 7.13, the softplus function is close to zero for x values ​​less than 0 and approximately equal to x for x greater than 0.

image

To define the softplus () function in R, we use the following code:

> softplus <- function(x) { log(1 + exp(x)) }

Such an activation function can be provided to the input neuralnet () using the act.fct parameter. In addition, we add a second hidden layer consisting of five nodes, assigning the hidden parameter the value of the integer vector c (5, 5). As a result, we get a two-layer network, each of the layers of which has five nodes, and all of them use the softplus activation function:

> set.seed(12345)
> concrete_model3 <- neuralnet(strength ~ cement + slag +
                                               ash + water + superplastic +
                                               coarseagg + fineagg + age,
                                               data = concrete_train,
                                               hidden = c(5, 5),
                                               act.fct = softplus)

As before, the network can be visualized (Fig. 7.14):

> plot(concrete_model3)

image

The correlation between the predicted and actual strength of concrete can be calculated as follows:

> model_results3 <- compute(concrete_model3, concrete_test[1:8])
> predicted_strength3 <- model_results3$net.result
> cor(predicted_strength3, concrete_test$strength)
                  [,1]
[1,] 0.9348395359

The correlation between the predicted and the actual strength was 0.935, which is the best indicator obtained so far. Interestingly, in the original publication, Ye reported a correlation of 0.885. This means that, with relatively little effort, we were able to get a comparable result and even surpass the results of an expert in this field. True, the results of Ye were published in 1998, which gave us a head start on more than 20 years of additional research in the field of neural networks!

Another important detail should be taken into account: since we normalized the data before training the model, the forecasts are also in the normalized interval from 0 to 1. For example, the following code shows a data frame that compares line-by-line the concrete strength values ​​from the initial data set with the corresponding forecasts:

> strengths <- data.frame(
      actual = concrete$strength[774:1030],
      pred = predicted_strength3
   )
> head(strengths, n = 3)
      actual        pred
774 30.14 0.2860639091
775 44.40 0.4777304648
776 24.50 0.2840964250


Examining the correlation, we see that the choice of normalized or abnormalized data does not affect the calculated performance statistics - just like before, the correlation is 0.935: But if we calculated another performance indicator, for example, the absolute difference between the predicted and actual values, then the choice of scale would be very important. With this in mind, you can create the unnormalize () function that would perform the reverse of minimax normalization and would allow you to convert normalized forecasts to the original scale:

> cor(strengths$pred, strengths$actual)
[1] 0.9348395359






> unnormalize <- function(x) {
     return((x * (max(concrete$strength)) -
           min(concrete$strength)) + min(concrete$strength))
   }

After applying the unnormalize () function that we wrote to forecasts, it becomes clear that the scale of the new forecasts is similar to the initial values ​​of concrete strength. This allows you to calculate the meaningful value of the absolute error. In addition, the correlation between abnormal and initial strength values ​​remains unchanged:

> strengths$pred_new <- unnormalize(strengths$pred)
> strengths$error <- strengths$pred_new — strengths$actual
> head(strengths, n = 3)
           actual                pred             pred_new                    error
774          30.14         0.2860639091               23.62887889      -6.511121108
775          44.40         0.4777304648               39.46053639      -4.939463608
776          24.50         0.2840964250               23.46636470      -1.033635298

> cor(strengths$pred_new, strengths$actual)
[1] 0.9348395359

Applying neural networks to your projects, you need to follow a similar sequence of steps to return the data to its original scale.

You may also find that neural networks are rapidly becoming more complex as they are used for increasingly difficult learning tasks. For example, you may encounter the so-called “disappearing” small gradient problem and the closely related “exploding” gradient problem, when the back propagation algorithm does not find a useful solution, since it does not converge in a reasonable time. To solve these problems, you can try to change the number of hidden nodes, apply various activation functions, such as ReLU, adjust the learning speed, etc. On the help page for the neuralnet function, you will find additional information about the various parameters that can be configured. However, this leads to another problem,when the bottleneck in building a highly efficient model is checking a large number of parameters. This is the price of using neural networks, and even more so deep learning networks: their huge potential requires a lot of time and processing power.

, ML . , Amazon Web Services (AWS) Microsoft Azure, . 12.


The Support Vector Machine (SVM) method can be represented as a surface that forms the boundary between data points plotted in a multidimensional space that describes examples and values ​​of their attributes. The goal of SVM is to build a flat border - a hyperplane that divides the space in such a way that homogeneous groups form on both sides of it. Thus, SVM training combines aspects of both nearest-neighbor training based on the instances described in Chapter 3 and linear regression modeling, discussed in Chapter 6. This is an extremely powerful combination that allows SVMs to model very complex relationships.

Despite the fact that the basic mathematics underlying SVM have existed for decades, interest in these methods has grown significantly after they began to be applied to ML. The popularity of these methods increased after high-profile success stories in solving complex learning problems, as well as after the development of SVM algorithms, which were awarded and implemented in well-supported libraries in many programming languages, including R. After that, SVM methods were accepted by a wide audience. Otherwise, it would probably be impossible to apply the complex math required to implement SVM. The good news is that although the mathematics are possibly complex, the basic concepts are understandable.

SVM methods can be adapted to use almost any type of training task, including classification and numerical forecasting. Many of the key successes of this algorithm relate to pattern recognition. The best-known applications for these methods include the following:

  • classification of data on the expression of microarray genes in bioinformatics for the detection of cancer and other genetic diseases;
  • text categorization, such as determining the language used in a document, or classifying documents by topic;
  • Detection of rare but important events, such as the failure of an internal combustion engine, a safety violation, or an earthquake.


SVM methods are easiest to understand using binary classification as an example - this is how they are usually used. Therefore, in the remaining sections, we will focus only on SVM classifiers. Principles similar to those presented here are also used when adapting SVM methods for numerical forecasting.

about the author


Brett Lantz (@DataSpelunking) has been using innovative data processing techniques to study human behavior for over a decade. Being a sociologist by training, Brett first became interested in machine learning while exploring a large database of teen profiles on social networks. Brett is a teacher at DataCamp and often makes presentations at machine learning conferences and seminars around the world. He is a well-known enthusiast in the field of practical application of data science in the field of sports, unmanned vehicles, the study of foreign languages ​​and fashion, as well as in many other industries. Brett hopes to one day write about all of this on dataspelunking.com, an exchange of knowledge about finding patterns in data.

About Science Editor


Raghav Bali (Raghav Bali) - senior researcher of one of the world's largest health care organizations. He is engaged in research and development of corporate solutions based on machine learning, deep learning and natural language processing for use in healthcare and insurance. In his previous position at Intel, he participated in proactive initiatives in the field of information technology, based on big data, using natural language processing, deep learning and traditional statistical methods. At American Express, he worked in digital engagement and customer retention.

Raghav is the author of several books published by leading publishers. His latest book is about the latest in transfer study.

Raghav graduated from the International Institute of Information Technology in Bangalore, has a master's degree (honors degree). In those rare moments when he is not busy solving scientific problems, Raghav likes to read and photograph everything in a row.

»More information about the book can be found on the publisher’s website
» Contents
» Excerpt

For Khabrozhiteley 25% discount on the coupon - Machine Learning

Upon payment of the paper version of the book, an electronic book is sent by e-mail.

All Articles