Rather than a transcript this time, we are going to do a short write up of each part of the code we discussed in the corresponding video.
Here we open our CSV file we obtained in the in the previous post and we pick up right where it ended.
In this case we are limiting the size of our file. This is optional and you could load the full CSV file into the data_list list. It will take longer to calculate and how long will depend on your own computer. As this data set was obtained for the purpose of these tutorials, I consider using the entire thing to be not needed right now.
Now we are ready to calculate the mean. I choose income for this as it highlights the use of a continuous variable and there are a couple "ah ha" moments with how the data is coded.
The most important catch here is that some of the data points have numerical coding which indicates "N/A". These codes/exceptions are found on use.ipums.org/usa. In our code above, you will see we filter out any rows of data with 999999 as the income using an "if" statement. Also, pay attention to use of "int()" as it is important. Otherwise we are comparing a string to an integer (you will make this mistake at some point). Next up is the mode. We choose to evaluate the "race" variable as it is categorical in type.
This is pretty simple. All we are doing is adding 1 to a variable named after a given race code. Then at the end we use max() to find the variable with the highest count. Unfortunately, the way we did this does not easily allow for a label. We could however, rewrite this to use a python dictionary which would allow us to tie text strings to the data as a label. However, for now, I want to stay with using python lists as putting in too many python data structures would dilute what these tutorials are for. Down the line, we can look at elaborating on this further.
Lastly, we are looking at median. Mathematically, it is about as simple as can be. All you need to do is find the middle most data point in an ordered list. In python, we need to take a couple of extra steps to make this happen. Mainly, in order to use list.sort() we need a key, and to use a key we need a function which will tell list.sort() what to do. Then we just sort the data, use len() to get the length of the data which gives us the middle when divided by two. Typically you would want to check if the middle is not a whole number and then take the average of the two nearest data points. However, for simplicity's sake, we rounded. This should be fine for everything outside a statistics test in practice. I choose educational attainment as median education arguably has some meaning to it.
That does it for our look at measures of central tendency using python! If you have any questions, please his us up on our various social media!
Comentários