Nov 8, 2011

Using Postfile in Stata (+2 examples)

Postfile can be used to generate new, computed datasets and to subset data into new datasets in Stata.

There is an awesome command in Stata you may not yet have heard of called postfile. It lets you create Stata datasets. It can be used for a variety of tasks:

  • For creating subsets of your data.
  • For building datasets of generated statistics.
  • For Monte Carlo-type experiments.


If you haven't already, I suggest creating a folder for code snippets or using a program to do so. You should try snippely which is free and available on Windows, Mac, and Linux as an adobe air application. 

The following can be the first snippet you create. If you program in multiple languages you'll want to sort them into groups or folders.

Postfile Snippet
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
tempname memhold
postfile `memhold' str10 stringvar numvar using "C:/Filename.dta" , replace

post `memhold' (your_str_var) (your_num_var)

postclose `memhold'
The above 'snippet' needs to be changed to fit your needs everytime you want to use it, but because its quite a bit to memorize its better to save this text somewhere to call upon it when needed. Lets breakdown what goes on.

line 1 creates a temporary object (aka "tempname") to save the contents of the file you are creating temporarily. Each postfile related command that follows will reference this tempname (`memhold').

line 2 sets the names of the variables that will be in your new dataset. Here, I've specified the tempname (`memhold') followed by the variables I'm going to save. Order is important. The order you specify them in will be the their order in your new data set. Additionally, you have to specify string variables with a str followed by the number of characters desired. This can be confusing because string variables are then two words and numerical words just one.

Notice that I have used the replace option. That is because I tend to have to rerun postfile scripts frequently as they tend to be written through trial and error. This can be useful but be careful - you may overwrite something you want!

line 4 will write one observation to your dataset. You have to specify the tempname (`memhold') again. Data written in each variable is enclosed in parenthesis and corresponds to the order specified in line 2. The colors indicate this correspondence between line 2 and line 4. You will have to loop or repeat this command for every observation you want to write.

line 6 saves your file.

Example 1: Using Postfile to create subsets of data
The easiest way to create a subset of your data is to use a loop. Suppose you had a dataset in which you wanted only a specific set of observations and two of the variables. You could use postfile to write a file containing this paired down data set. This example uses the auto dataset (you can load it using line 1 from below). Take a moment to familiarize yourself with the dataset before proceeding.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
sysuse auto

tempname memhold
postfile `memhold' str18 car_make car_price using "C:/Filename.dta" , replace

foreach j in 2 4 7 18 23 29 33 38 39 47 51 53 67 74 {
post `memhold' (make[`j']) (price[`j'])
}

postclose `memhold'
This example uses the example data set auto. Line 1 will load the dataset.

Again, line 3 sets up a place for to store the data.

In line 4, you'll have to change the filename (green) to a location on your computer. Notice in line 4 how the variables I named correspond to the items in parenthesis in line 7.

line 6 starts a loop of observations we want to subset into our new dataset.

line 7 writes the observation (using subscripting), but because it is looped it will write an observation for 2, 4, 7, etc. So for the first time through the computer would think something along these lines:
  • post `memhold' (make[`j']) (price[`j'])
  • post `memhold' (make[2]) (price[2])
  • post `memhold' ("AMC Pacer") (4749)
This would then be added as an observation, and the loop would move to the next number. After looping through all of the numbers we hit line 10 which saves the file. The result looks like this:
Subset of auto data created using postfile command in Stata
Example 2: Using Postfile to Save Statistics
If you know how to use the return list then you can create tables of returned data (statistics) for further manipulation, or organization. The following example will save the average weight of cars by mpg from the auto dataset. Take a moment to familiarize yourself with the dataset before proceeding.

To simplify writing your postfile scripts, you might want to try running a statistical test and then adapting it to run within a loop and for use with postfile.

Ran a simple statistical test and checked the return list to see what data we could grab for postfile.
I used the return list command to see what values were available. All of them [r(N), r(sum_w), r(mean), etc.] could be inserted into a new dataset using postfile, but we would have to specify variables for each following the postfile command. I'm only interested in mean weights by mpg.

Notice in the code below how I adapted summarize weight if mpg == 12 for use within a loop and within postfile to apply to all of the mpg values which range from 12/41 (line 7).

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
sysuse auto

tempname memhold
postfile `memhold' mpg avg_weight using "C:/Filename.dta" , replace

forvalues mpg =  12/41 {
summarize weight if mpg == `mpg'
if "`r(mean)'" != "" {
post `memhold' (`mpg') (`r(mean)')
}
}

postclose `memhold'

Remember to change the filename to an appropriate location on your computer (line 3).

Again, notice that the order of the variables following the postfile command corresponds to each piece of data saved under the post command (line 4, line 9).

Line 6 starts a loop which cycles through all of the possible values for mpg in the dataset.


Line 7 uses the command summarize to calculate the mean weight for each value of mpg. 

Line 8: I had to use an if command to prevent empty values from being written (line 8) . If you look closely at the dataset you will see that mpg ranges from 12-41, but doesn't include certain values, like 13. No means will be generated for mpg == 13 and this if statement is actually necessary to prevent an error: postfile expects a value for `r(mean)' and without one it will stop working.

Line 9: The first variable in our new dataset is the mpg. I used `mpg', which is set by the current iteration of our loop, as the mpg variable. To insert the average weights for each mpg, I inserted (`r(mean)') in the second slot to correspond with its location as the second variable - avg_weight.

Line 13: As usual, saves the file.

The end result of Example 2.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thanks for sharing this. In explaining the second example, I couldn't follow your lines. I think it will be great if you shift numbers up!
    Cheers,
    Eilya.

    ReplyDelete
  3. Perfect! exactly what I needed!

    ReplyDelete