Today we are going over some of the different measures of dispersion in statistics. This includes range, interquartile range (IQR), variance, and standard deviation. Each of these tools have their own application and help us to understand a data set a little bit better. We only look at income this time as our example but we could have also looked at age or education just as easily. We start off by importing our data and take loading 30,000 rows of data into a list. This is the same as our previous parts of this series. Range is also pretty fast to figure out. We start by making an empty list and then loading it with the income data with a loop. Note, we start the loop one element in so that we skip the header and we filter out all N/A incomes which are coded as "9999999". Once the data income data is in the list all we need to do is to make variables to hold the minimum income and maximum income. Once those are in place all we need to do is to subtract the minimum from the maximum which yields our range.
Next up is the interquartile range. The concept is simular, but we are not counting the lower or upper 25% of our dataset. In other words, we are only looking the middle most 50% of our data. This make sense if, for example, your data set has some extreme values which may not be important to the questions you want to answer. So, if you were interested in what an 'average' American income may look like, you may not want to include the extreme ends of the data.
The last part we want to look at is variance and standard deviation. To do this, we need to add our first python module outside of the csv module we use to initially access our data. We are adding 'math' which gives us access to the math.sqrt() function which is needed for calculating the standard deviation with the variance.
Our next tutorial will look at what goes into the shape of a distribution. That is, skew, kurtosis, and the central limit theorem.
Find us on social media with questions or feedback!
Comentarios