## Statistical Skills For The Machine Learning Engineer

# Machine Learning Is One Of The Topics Of The Day In The Information Technology World Many Users Are Interested In Learning.

However, mastery of existing tools and frameworks and the ability to analyze the problem are only part of the story.

The central part of this story is mastery of mathematics and statistics. More precisely, without familiarity with the mathematical topics of the model and algorithm you intend to implement, it will not be completed successfully.

On this basis, it is essential to think about learning mathematics and especially statistics in addition to learning the basic principles of machine learning.

In this article, you will learn about essential topics from the world of statistics; mastering them not only simplifies the process of implementing models but also allows models to provide more accurate output.

There are so many mathematical equations, formulas, and rules governing the world of machine learning that it may not be possible to master them; however, focusing on learning them will help you perform the assigned tasks better. In this article, you will learn about 16 essential things we suggest you think about learning as a machine learning engineer or data mining engineer.

**1. Permutation test**

The permutation test is a type of statistical hypothesis test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under the retrievals of the labels of the observed points.

In other words, how behaviors are attributed to subjects in an experimental design is reflected in the design analysis. If the tags are interchangeable under the null hypothesis, the resulting tests provide precise significance levels that allow the observation of exchangeable random variables. Confidence intervals can be obtained based on these tests.

**2. Statistical hypothesis test**

Statistical hypothesis testing is a method for checking claims or assumptions about distribution parameters in statistical societies if an engineer wants to make an opinion based on sample data on whether the average life of a specific type of car tire is at least 22,000 miles or not if an agricultural expert wants to make an opinion based on experiments whether a particular kind of agricultural fertilizer produces more beans than another fertilizer.

Does it or not, and if a manufacturer of pharmaceutical products wants to make an opinion based on samples, whether 90% of all patients who take a new drug will recover from a particular disease or not, all these issues can be returned to the language of statistical hypothesis testing.

A statistical hypothesis is a statement or guesses about the distribution of one or more random variables. If a statistical assumption specifies the distribution, it is called a simple assumption. Otherwise, it is called a compound assumption.

**3. Stable statistics**

Robust statistics is a way to achieve basic statistical methods so that estimates are not affected by abnormally large or small values. The robust statistic is a well-performing statistic for data drawn from a wide range of probability distributions, especially for distributions that are not normal.

Robust statistical methods have been developed to solve many common problems, such as location parameter estimation, scale parameter estimation, and linear regression. One of the goals of creating statistical methods is that outliers do not unnecessarily influence the results.

Another goal is to provide methods with good performance when there are few departures from the parametric distributions. For example, robust methods work well for combining two normal distributions with different standard deviations.

**4. Prior probability**

A prior probability is a probability distribution derived from inductive reasoning. One of the ways to derive prior probabilities is to use the principle of indifference. The focus of boredom states that if we have N mutually exclusive and comprehensive events, and each of them is a probable event, then the probability of one occurring is equal to one divided by N.

Similarly, the probability of occurrence of a set of K events is equal to K divided by N. A disadvantage of defining possibilities in the above way is that the above method can only be applied to a limited set of events. In Bayesian inference, the terms ignorant prior distribution or objective prior distribution refers to specific choices of prior probabilities. Note that the preceding chance is a broader concept.

**5. Posterior probability distribution**

The posterior probability for a random variable is the conditional probability that is calculated based on previous evidence about the occurrence of that event. This conditional probability is called Posterior Probability. However, Prior Probability is also likely to obtain such evidence.

Similar to the distinction between the concept of prior and posterior in philosophy, in Bayesian inference, an initial distribution represents general knowledge about the data distribution before making an inference. In contrast, a posterior distribution represents the knowledge that includes the results of making an inference.

In Bayesian statistics, the posterior probability distribution is a probability quantity of the probability distribution after observing the evidence (data). In other words, the posterior probability distribution is the conditional probability of that quantity given the condition of seeing the data.

**6. Accepted fruit**

In Bayesian statistics, an acceptable interval (or acceptable Bayesian interval) is an interval in the domain of a posterior probability distribution used in interval estimation. The generalization to multivariate problems is the Bayesian region.

Credible intervals are similar to confidence intervals in quantitative inference, although their philosophy of existence is different. Bayesian intervals, like the estimated and fixed parameter, are random variables, while confidence intervals treat their bounds as random variables and the parameter as a fixed value.

**7. Bayesian estimator**

In estimation algorithms and decision theory, a Bayesian estimator or a Bayesian operator is an estimator or decision rule that minimizes the posterior mathematical expectation of a loss function. Equivalently, this estimator maximizes the posterior expectation of an optimal position.

Another convenient way to formulate an estimator within Bayesian statistics is the posterior probability cluster estimator. In statistics, the maximum posterior estimation of a parameter is the mode of the posterior probability distribution of that parameter.

**8. T-Student distribution**

When determining the approximate mean of samples taken from a random variable, Student’s t-distribution is used. This distribution is the basis of a test called the t-test, which declares the confidence value of the difference between two random variables based on their samples.

The t-student test is used to evaluate the level of convergence or the sameness of the average of the sample with the average of the population in the case that the standard deviation of the population is unknown because the distribution of t in the case of small samples is adjusted using the degrees of freedom, it can be This test was used for tiny pieces.

**9. Chi-square test**

The chi-squared test is used to evaluate the relationship between nominal variables. Pearson’s chi-2 test is used to determine whether there is a statistically significant difference between the frequency of observations and the expected frequency in one or more groups of the consensus table (two-way). In typical applications of this test, words are divided into separate classes.

**10. U-Mann-Whitney test**

Mann-Whitney test is in the group of nonparametric tests and is used to measure the difference between samples. In this test, ranking and cal calculations are done on the ranks. When reporting descriptive statistics that accompany nonparametric difference results, you should report the median and range of change (not the mean and standard deviation) as central tendency and dispersion measures. Median and range of change are more suitable descriptors for nonparametric tests because these tests do not have a regular and free distribution.

The Mann-Whitney test is a nonparametric equivalent of the independent t-test and is used to compare data obtained from different group designs. When the conditions for using parametric tests are not available in the variables, that is, the variables are not continuous and regular, this test is used.

Two samples should be independent, and both should be smaller than ten items. If it is greater than 10, Z-statistics should be used (in computer calculations, the conversion to Z is done automatically). In this test, the shape of the distribution does not have a default meaning; it can be normal or non-normal.

**11. Wilcoxon test**

Wilcoxon signed-rank test is a nonparametric statistical test used to evaluate the similarity of two dependent samples with a rank scale. This test considers the size of the difference between the ranks so that the variables can have different answers or intervals. This test is suitable for before and after designs (one piece in two different situations) or two samples from the same community.

This test corresponds to the dependent two-sample t-test and is a good substitute for it if there are no t-test conditions. The samples used in this test must be matched (paired) concerning their other attributes.

**12. Kolmogorov-Smirnov test **

Kolmogorov-Smirnov test is a non-parametric statistical test. When choosing a statistical test for research, we must decide whether to use parametric or nonparametric tests. One of the main criteria for this selection is the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test shows the non-normality of the data distribution.

That is, it compares the distribution of an attribute in a sample (for example, age among 100 nurses) with the distribution assumed for society (for example, the age of all nurses). If the Kolmogorov-Smirnov test is rejected, the data has a normal distribution, and it is possible to use parametric statistical tests for research. On the contrary, if the Kolmogorov-Smirnov test is confirmed, the data does not have a normal distribution, so we should use nonparametric tests in the research.

**13. Bayesian probabilities**

Bayesian Probability Bayesian reasoning is a probabilistic method for making inferences. This method is based on the principle that every quantity has a probability distribution. By observing new data and reasoning about its probability distribution, optimal decisions can be made. This theorem is valid because it can be used to calculate the probability of an event conditional on the occurrence or non-occurrence of another event.

In many cases, it isn’t easy to calculate the probability of an event directly. The desired possibility can be computed using this theorem and making the desired event conditional on another occasion.

**14. Test of the standard error of the mean**

Z test is a type of statistical test that the distribution of the test statistic under the null hypothesis can be estimated as a normal distribution. Due to the central limit theorem, most of the test statistics for many samples can be calculated approximately with a normal distribution. For each level of significance, the Z test has a critical value (for example, 1.96 for 5% two-sided), which is more convenient than the t-test because there is a specific critical value for each number of samples in the t-test.

Therefore, in most statistical tests, if the variance of the population is known or the number of samples is large, the Z test can be used as an approximation. If the population variance is unknown (and must be obtained from samples) or the number of pieces is small (less than 30), the t-test is more appropriate than this test.

**15. Sampling distribution**

Sample distribution in statistics refers to the distribution of individual observations in a sample.

**16. Sampling distribution**

The distribution of a statistic is called sampling distribution or sampling distribution. Typically, a sampling distribution is used when we have more than one simple random sample of the same population size of data, and these samples are unrelated and independent. In this case, if an individual is included in one piece, it is likely to be present in the following piece.

**17. Variance**

Variance is a measure of dispersion. The value of variance is calculated by averaging the square of the distance between the probable or observed value and the expected value. Compared to the mean, the mean represents the location of the distribution, while the variance is a measure of how spread out the data are around the standard.

The squared variance unit is the primary quantity unit. The variance’s square root, called the standard deviation, has the same team as the primary variable. A lower conflict means that it is expected that if a sample is selected from the distribution, its value will be closer to the mean.

## last word

As you have seen, in mathematics and statistics, there are many points that most companies and schools do not mention during machine learning training because they are specialized and technical topics. Of course, it is impossible to address them in terms of time frame. Therefore, your responsibility is to think about learning and mastering mathematical topics.