Statistics with Python [Beginners Guide][Part 1/2]

Igorps
4 min readDec 26, 2021

--

Source of image: Shane Lynn

The main objective with this article on Medium showing useful content in the shortest time possible. It does not exclude the general approach using books.

What you’ll see at the first part of this post.

  • Mean, median, mode
  • Percentiles
  • Dispersion Measures — Range (Max-Min), Variance, Standard deviation
  • Graphics (Histogram and Boxplot) — Using matplotlib

For relating all the topics: We’ll start working on basic examples and furthermore present it with complex datasets from Kaggle [Part 2].

Lets code!

This is the dataframe

Basically, 10 students and grades associated with them. (n=10)

Mean

Remember: Mean -> measure of central tendency

Sort data according to grade value:

The data presents that 7 students are associated with lower grades and 3 with higher grades.

  • In this case, the .mean() does not describe very well the performance of each student.

Median

Let’s calculate the .median(); Another measure of central tendency less affected by outliers.

Remember:

  • If it has and odd number of terms, the middle one is selected.
  • [1, 3, 3, 6, 7, 8, 9]
  • If the data set has an even number of observations, there is no distinct middle value and the median is usually defined to be the arithmetic mean of the two middle values.
  • [1, 2, 3, 4, 5, 6, 8, 9]

Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.

The values that appear with more frequency are 30.0 and 90.0. However, in some cases it is possible to return just one result.

Percentiles (Quartiles)

A more detailed way to look at the distribution of data is percentiles.

There is more than one definition for percentiles! Let’s introduce one of the possibilities.

Consider a set of values:

  • ​v1,v2,…,vn

Sorted from smallest to largest and a P value in (0,100)

  • The P -n percentile is the value at position P(n+1)/100.

When the position value is not an integer value, the percentile is calculated from the nearest integer positions.

The idea is to calculate a position that separates the smallest P% values ​​from the largest

For example, the idea is that the 25th percentile divides the 25% smallest values ​​from the 75% largest values.

The 25ᵒ, 50ᵒ, and 75ᵒ percentiles are called quartiles.

The 25ᵒ percentile is called the first quartile.
The 50ᵒ percentile is called the second quartile.
The 75ᵒ percentile is called the third quartile.

To calculate the percentiles, we use the percentile function. See how it works:

  • 25ᵒ, 50ᵒ and 75ᵒ Percentile of grades

Dispersion Measures

Used to evaluate how “spreaded” are data and the max-min values.

  • .min()
  • .max()
  • .max()-.min() known as Range

Variance

Variance measures how far a data set is spread out. It is mathematically defined as the average of the squared differences from the mean.

Standard Deviation

The square root of the variance is the standard deviation. While var. gives you a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean.

What can you deduce about the grades from the obtained standard deviation value; are the notes homogeneous?

Dont worry! It is possible to receive all that info with few coding using the command: .describe()

Graphics

  • Creating graphs to represent our data can be very useful for quickly getting information.

Histogram

  • Note: We can quickly visualize students performance.
  • In our case, the categories divide the values ​​from 20 (lowest grade) to 100 (highest grade) and each category is formed by 8 grades.
  • For example, the first category is grades 20 to 28. The second is up to 36 (and so on).

We can specify the number of categories. See an example with only 2 categories:

Another type of graph commonly used in Statistics is the box plot.

The box plot presents a variety of useful information:

  • The green line inside the box is the median.
  • The bottom row of the box is the first quartile (Q1), that is, the 25th percentile.
  • The top row of the box is the third quartile (Q3), that is, the 75th percentile.
  • The lower black line is the smallest value in the data that is above Q1−1.5IQR (IQR=Q3−Q1)
  • The upper black line is the highest values in the data that is under Q3+1.5IQR.

IQR stands for Interquartile Range

Points below the lower stroke or above the upper stroke are called outliers.

In our case, there were no outliers. When they exist, they are represented by circles.

Source of image: Finxter

I hope this article was helpful, more content will be posted soon, including [Part 2]. ;)

--

--