Statistics Series Getting Started

Jacob at IPW

Jan 2, 20195 min read

Updated: Jan 4, 2019

In this tutorial, we go over setting up python as well as getting a data set from IPUMS. We will run a little bit of sample code as well as read the data set into python and display the first 100 lines of data. You can find our corresponding YouTube video at https://youtu.be/QRAJtT7on7A.

Resources Used:

7-Zip: https://www.7-zip.org/

IPUMS: usa.ipums.org/usa/

Transcript Follows:

Hello and Welcome! I am Jacob From Intrepid Protoworks. I am a human factors engineering consultant and today we are continuing our dive into statistics using python.

Before we get started, we need to actually download python. For the sake of this tutorial series, I will be assuming that you all are on a windows operating system. However, once we are into the code itself, it should all pretty much work on a Linux system. The only thing which may be different is the Integrated Development Environment (IDE). For the goal of keeping this simple, we will be using a python package which includes a python IDE.

There will be in the description But we will go through it here. The package we will be using is WinPython. Again, this should all still work on linux, but you will have to go out and get the python environment setup yourself. So, to get started head to http://winpython.github.io/ and download the most recent package. There should not be a massive difference between versions as they relate to us. Though, if you want to follow the series exactly, I am downloading WinPython 3.6.7.0Qt5-64bit.

Let’s go ahead and download python. In a browser window type winpython.github.io and press enter. As we stated before, we are downloading WinPython 3.6.7.0Qt5-64bit If you are not sure, typically the most frequently downloaded option is the correct one.

Now we just need to wait for the installer to download. I have sped it up for brevity's sake. Now drag the installer somewhere convenient. I will be working from the desktop for this series but you can choose what you prefer.

Then double click the installer and progress through the prompts. Winpython will then begin to unpack itself, this will take several minutes to complete.

Once done, browse to the folder it was installed in and open SPyder. There are other IDEs out there and you may like others more, if that is the case feel free to use the IDE of your choice. However, for the purpose of keeping the setup of a python environment as simple as practical, we are just going to use SPyder.

Go ahead and open Spyder for the first time by double clicking spyder.exe. Once Spyder is open, size the window how you like. I am also going to increase the size of the text. This can be done in the preferences menu under tools.

We are mostly done with the setup of python now!. For the moment let us set aside SPyder and grab a web browser. To extract the data we will retrieve shortly, we will want to grab 7Zip from https://www.7-zip.org/. This will allow us to open the archive format it will be downloaded in. I am skipping the restart because I already have 7zip installed. Now back in our web browser, type usa.ipums.org/usa/ . This is where we will be getting our data set for the series.

Next pause the video and create an account. I already have an account so I will just login and proceed. Once logged in select the ‘Select Data menu option to process. This is where the fun starts! We can put together a wide range of data sets covering a huge range of variables. For our purpose, I am not particularly interested in household, so we will be looking at person data. We will go ahead and grab several variables that should be interesting. We are going to go ahead and grab race. Also take a second to note the assorted myriad of variables here. Going back, we will look under education next. Again, there are a bunch of options. Here we are interested in Educational Attainment. A couple more variables we may be interested in are sex and age. These can be found under demographic. Lastly, let's look at income for each of these people. Here we just want income total.

That does it for selecting our variables! Let’s go to view cart and then select the samples we want to use for our data set. There is a MASSIVE amount of data available here and we need to be careful as we do not want to download something several terabytes in size. So, for this series, we will only use ACS 2017. There are some added some cataloging variables here which is nice, but not needed for us. So, we are de-selecting all of the variables labeled “[preselected]”. We need to review the codes. We are just taking note of them here. I have them written down as well. They will be essential going forward. You can see the variables are coded in several different ways. Each of these types of variables need to be handled differently. If we were to apply the wrong tool to a series of data, our results would at best be unintelligible and at worst confounding. Most of the time it should be pretty obvious what makes sense. For example, what would “Average sex” mean? Versus average income.

With a look at the codes out of the way, we need to check the size of our data set. Here it is 73 Megabytes. We need to check the format. We will be using .csv for this series. Once done, we will need to wait several minutes for the request to be processed and then we can download our data. With our data set in hand we can minimize the window and select the folder with our python install. In it we will add a folder and name it “MyScripts”. We will do a lot of our work from here. We will get our data ready to be extracted by dragging it over to our new folder.

Let's go back to SPyder and save our python script so it is ready to run when we need it. I am saving it as “stats series getting started.py”. Future tutorials will simply be named how after the tutorial name. Hit save and it should appear in our folder. Now in SPyder, type “Import CSV” and “print(‘I worked!’). Then hit save. Once done, go to the “Run” menu and select “Configure”. For our example we will have our script run in a new dedicated iPython console.

Now we can go back and extract our achieve with our data set. Make sure the CSV file is in the same directory as our python script. With the data set in place let's open it up and take a look!

Thank you for watching! That is everything we needed to accomplish for getting started! Like subscribe and share! Next time we will go over reducing down our data set as well as start in on descriptive statistics.

Intrepid Protoworks

Statistics Series Getting Started

Recent Posts

Comentários