Hamburger_menu.svg

FOR DEVELOPERS

Introduction to Statistics for Machine Learning

Introduction to Statistics for Machine Learning

Statistics is a subject that really matters a lot in any technology, especially while dealing with data because all the use cases we work on statistics concepts are required to understand and analyze our data. Here’s an explanation of all the important topics and key concepts generally used in Machine Learning.

Once you understand these topics it will be easy for you to understand machine learning algorithms and other important architectures. We will focus more on mathematical intuition because what I think is you never understand a topic fully until and unless you know what is going on under the hood. Now without wasting our time let’s start by defining what exactly we mean by statistics.

What is statistics?

Statistics is defined as the science of collecting, organizing, and analyzing data. Two major areas of statistics are Descriptive statistics and inferential statistics. Descriptive statistics consists of organizing and summarizing the data (what has happened) whereas inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect).

Pretty hard to digest? Never mind, let’s understand both the areas with the help of an example.

Suppose you want to know what is the most common height of students in your class? You can either calculate the mean or mode or median right? This is exactly what we call descriptive statistics, we are trying to summarize the data with the help of any statistical method.

Now if you want to know, are the heights of the students in this classroom similar to what you expect in a normal class at this university? You might say that yes the heights of the students in this class are very similar to that of the other students in the university. Hence, here you are trying to get some conclusions from your data. There are various types of tests to conclude your data. Some of them are z- test, t-test, chi-square test, ANOVA, etc. We will learn about all the tests in the upcoming article. In this article, we will first build up our base and then we will move ahead.

Let’s understand a few important terms which will be repeated over and over in this article.

Population (N): Suppose you want to know the average age of the people in India. Is it possible? Can you reach out to every single person out there and ask their age? Definitely not. In this case, the count of people living in India makes up for our population. The population is mostly denoted by capital ‘N’.

Sample (n): What choice do you have now. How can you know the average age of the people living in India? A better approach will be to take out a few samples from this population and study them. We can definitely not reach 130cr people but we can surely reach out to a few people and ask them their age. This process is called sampling. The population here is mostly denoted by small ‘n’.

Now you must be wondering, on what basis do we take samples from this huge population? Well, statistics has it all covered. There are various types of sampling methods that we can use according to our needs. Let’s understand some of the sampling techniques now:

1. Simple random sampling: - We can obviously get intuition from the name itself that the samples are randomly getting selected which means each person here in our case has an equal probability of getting selected in the case of simple random sampling.

2. Stratified sampling: - If we divide this population into smaller groups then we can use the stratified sampling method. Another term for groups is “strata”, you may find it in books or other articles. Improved quality of data may be obtained by employing different types of people in different strata/groups. For example, people knowing local languages may be deployed to the rural areas, whereas in urban areas people knowing English may be more advantageous. Or we can even divide this population into males and females.

3. Systematic sampling: - this is a probability sampling method where we randomly choose a starting point, let's say (n) and from thereafter at a certain interval the next element is chosen for the sample. Let’s say you choose a starting point as 4 which means you will ask every 4th person about their age as you move ahead. But this has certain disadvantages. What if you are near an old age home and then you go out and decide to ask the age of every 4th person you meet. You will probably end up getting 60+ age and that will be very biased since you started taking samples near an old age home.

4. Convenience sampling: - It is also called voluntary response sampling. Suppose you are taking a survey and you have mailed it to 2000 people. Will everyone be filling out your survey? Obviously, the answer is no. Only the people who are interested in this survey will be filling out the form. Hence in convenience sampling we basically pick out people who are interested in sharing their age (for our example).

Variable: A variable is any characteristic, number, or quantity that can be measured or counted, or we can even say that variable is a property that can take on many values. For example, ‘income’ is a variable that can vary between data units in a population or it can also vary over time for each data unit.
Let’s see various types of variables:
Quantitative variables- These variables are measured numerically. Since they are measured numerically, we can add, subtract, multiply, divide to get a meaningful result. For example, “age”, “no of students in a class”. Quantitative variables are then classified into 2 more types:

  • Discrete variable
  • Continuous variable

Qualitative/categorical variable- these variables can take on names and labels and can fit into categories. For example “gender”can have two labels ‘male’ and ‘female’, another example might be “breed of a dog” which can have various types like ‘bulldog’, ‘poodle’, ‘lab’ etc. Most of the machine learning problem statements you’ll get will have their output as a qualitative variable.

Conclusion

In this article we learned the basic topics of statistics. Before we dive into complex topics, we should have a strong base. In subsequent articles you would get to know about descriptive and inferential statistics. How and where do we use them? What are the practical applications of these topics?

Author

  • Turing logo

    Turing

    Author is a seasoned writer with a reputation for crafting highly engaging, well-researched, and useful content that is widely read by many of today's skilled programmers and developers.

Frequently Asked Questions

In order to work with machine learning, you need to have a basic understanding of descriptive statistics, data distributions, and data visualization. It will help you identify the methods for every task.

The best way is to choose online platforms like Udacity, Coursera, Simplilearn, and others that offer courses on statistics for machine learning.

Statistics finds great usage in machine learning to convert observations from data and conclude informed decisions.

Statistical learning in machine learning is a framework that works for machine learning drawings related to the field of functional and statistics analysis. It helps in finding a predictive function based on data for statistical interference problems.

Statistics is a prerequisite in machine learning. It is applied in the first phase of a machine learning algorithm when raw data needs to be analyzed for deriving informed conclusions.

There are four basic types in which we divide data with a machine learning perspective.

  • Numerical data
  • Text
  • Categorical data
  • Time series data
View more FAQs
Press

Press

What’s up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work. Checkout our blog here.
Contact

Contact

Have any questions? We’d love to hear from you.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.