Tuesday 31 May 2011

Stata Snippets: How to export formatted results ready for publication?

This post is all about producing automated quality statistical outputs in Stata. Note that there are so many ways to do this and you don't need to be a geek. Why do we bother ourselves as statisticians writing codes and programs for automated quality outputs? Honestly, I used to write simple do files, save as log files, copying and pasting the contents of the do files to text editor (e.g word). Then creat and edit tables within a text editor. You could tell the pain and complexity of this sequence of events. It's just a sheer waste of time and energy. Imagine if you're working on more than 5 trials say and you need to produce statistical reports every fortnight or monthly. Surely, you'll struggle to find time to see your girlfriend. More so, there is increased risk of making errors during copying and pasting of statistical outputs. The only solution is to invest in time once and write codes and programs to produce outputs straight to the printer. Isn't it that appealing? I don't need to go through a tiresome process each time I need an output but to push the button and make some coffee. I love this approach. It also gives you time to do your own research work.

There are a lot of user written commands in Stata that could be used to produce these outputs (e.g tabout, estout, postfile,parmest, etc). There are no rules of thumb when it comes to writing a code but some few tips will always help. I have my own phrase which goes like "Why should I do it your own way?". Sometimes it works BUT not always. You have to consider some snippets from others here and there. Here are some;
  1. Rule 1: Be smart! I'm not talking about wearing a tie or gold chains "bling bling". A statistician needs to be ORGANISED especially when working on different projects.
  2. Rule 2: Invest in time to produce automated outputs. It's a pain and frustrating at first BUT that's the nature of every learning process. 
  3. Rule 3: Always give the code to your colleague to review. This is important; Of course to err is human, to forgive is divine BUT a statistician should avoid giving out a WRONG output at all cost.
  4. Rule 4: Always check your code on few variables (categorical and continuous) to assertain if it's executing the intended task. Beware of missing values.
  5. Rule 5: Don't be too obsessed with programming and forget the statistical principles in model building.
  6. Rule 6: Push the button and remember to take your loved one for dinner or holiday. LIFE GOES ON!   
 Now we are done with the rules and lets start the real fun. I've tried to prepare a simple program to illustrate my point here. You can copy the contents and paste in Stata do file, and check a simple output. As I've said before, there are many ways of doing it and differ across text editors. For instance, Latex is different and there are special user witten commands in Stata to produce outputs straight in Latex. I'll illustrate this in the coming series. Here I'll concentrate on text editors such as word, notepad etc. Let's suppose we conducted a randomised clinical trials to test the effectiveness of a new intervention using a before and after design, and we need to present baseline characteristics of patients at randomisation and efficacy measures based on ANCOVA model. The code below will show you how to do this and results will be shown on Stata Results window. What you will need to do is to highlight all,copy and paste into notepad or word (use courier new font theme, size 8, and adjust line spacing on paragraph), and print. You could then save as pdf if you want. Done! There is nothing unique about Stata, you could produce outputs in a different way in SAS for example. So enjoy it. 



*****************************************
* Program Author: Munya Dimairo
* Email: mdimairo@gmail.com or m.dimairo@sheffield.ac.uk
* Date: 31-05-2011
****************************************
* Results are automatically exported on stata result window
* formated for publication (copy and paste results in a text editor)
* to be modified a bit
****************************************

cap prog drop exportme
prog define exportme

*set trace on /*for debugging*/
*program syntax [pass the number of observations to generate hypothetical dataset]
syntax [,OBS(integer 4)]
    set more off
    qui drop _all
    version `c(version)'

    confirm integer number `obs'
if _rc!=0{
    di as error "Number of observations should be an integer"
    exit 198
}
else{
 if `obs'<30{
    di as error "Number of Observations should be greater than 30"
    exit 198
 }
 else{

* set number of observations
qui set obs `obs'
    qui gen  id=1000+ _n
    label var id "Patient ID Number"

* now lets simulate hypothetical data set (pre and post) and baseline data: just for illustration
qui gen sex=rbinomial(1,0.35) /*don't ask why i started with SEX :-) */
    qui replace sex=sex+1 /*just want categories to be 1 and 2*/
    label define sex 1"male" 2"female", modify
    label val sex sex
    label var sex "Sex"
qui gen treat=rbinomial(1,0.5)

    qui replace treat=treat+1  /*just want categories to be 1 and 2*/
    label define treat 1"Control" 2"Intervention", modify
    label val treat treat
    label var treat "Intervention Group"

qui gen age=round(rnormal(45,6))
    label var age "Age(years)"

* don't worry about how the variables are generated (that's not the point here)
qui gen qol=round(rnormal(50,5.6))
qui gen qolp= qol + round(rnormal(3,2.5))
    label var qol "Qolife(pre)"
    label var qolp "Qolife(p)"
qui gen dizz=round(rnormal(25,3.5))
qui gen dizzp= dizz - round(rnormal(3,2.5))
    label var dizz "Dizziness(pre)"
    label var dizzp "Dizziness(p)"
qui gen sym=round(rnormal(65,3.5))
qui gen symp= sym - round(rnormal(10,2.5))
    label var sym "Symptoms(pre)"
    label var symp "Symptoms(p)"
qui gen anxi=round(rnormal(72,5.5))
qui gen anxip= anxi - round(rnormal(15,2.5)) 
    label var anxi "Anxiety(pre)"
    label var anxip "Anxiety(p)"
qui gen cd4=round(rnormal(800,40))
qui gen cd4p= cd4+ round(rnormal(50,10)) 
    label var cd4 "CD4 count(pre)"
    label var cd4p "CD4 count(p)"
qui gen total=sym+qol
qui gen totalp=symp+qolp
    label var total "Total(pre)"
    label var totalp "Total(p)"

 * Now we have created our hypothetical data set and let's the fun begins  
***************************************
* the idea is to create baseline characteristics and regression results
* which is ready for publication
* we want to keep editing at its minimum level
***************************************

di _n
di as text _dup(35) "-"
di "Program Author" _col(20) ":" _col(22) "Munya Dimairo"
di "Email"          _col(20) ":" _col(22) "{it:mdimairo@gmail.com}"
di                  _col(20) ":" _col(22) "{it:m.dimairo@sheffield.ac.uk}"
di "Date"           _col(20) ":" _col(22) "`c(current_date)'"
di "Time"           _col(20) ":" _col(22) "`c(current_time)'"
di _dup(35) "-"

di _skip(2)
di _dup(55) "_"
di "Baseline Characteristics of Randomised Participants"
di _dup(55) "_"

qui count
global AN=r(N)
forvalues g=1/2{
    local trt`g': label treat `g'
    qui count if treat==`g'
    global N`g'=r(N)
    global N`g'f: di %3.1f (r(N)/$AN)* 100
    local col=`g'*20
    di _col(`col') as text "`trt`g''" _cont
}
 di _n
 di _col(2) as res "Total[N(%)]" _cont
forvalues g=1/2{
    local col=`g'*20
    di _col(`col') as res "${N`g'}(${N`g'f'}%)" _cont
}
di _n


* Sex
local sex1: variable label sex
di _col(2) "`sex1'[n(%)]"
forvalues i=1/2{
 local sex`i': label sex `i'
    }
forvalues i=1/2{
  forvalues g=1/2{
    qui count if treat==`g' & sex==`i'
    local n`g'`i'=r(N)
    local n`g'`i'f: di %3.1f (`n`g'`i''/${N`g'})*100
    }
if "`i'"=="1"{
    di _col(4) "`sex`i''" _col(20) "`n11'(`n11f'%)" _col(40) "`n21'(`n21f'%)""
}
else if "`i'"=="2"{
    di _col(4) "`sex`i''" _col(20) "`n12'(`n12f'%)" _col(40) "`n22'(`n22f'%)""
}
}
 
di

* Continuous variables
foreach var of varlist age dizz anxi cd4 sym qol total{
local `var'1: variable label `var'
di _col(2) "``var'1'"


* quietly produce summary statistics and save returned results
* tabstat will save into matrix r(Stat1), r(Stat2) and r(StatTotal)
* that is stats in group 1, 2 and all respectively


qui tabstat `var', stats(N mean sd q) by(treat) save
    matrix def treat1=r(Stat1)
    matrix def treat2=r(Stat2)

* extracts the elements of the matrices with respect to position
* that is, row by column position e.g treat1[2,1] means element on row #2 column #1

* save the elements as local macros for later use
* loop over by group to obtain results for group one and two
* this could be extended to include total summaries

forvalues i=1/2{
 local rn`i'=treat`i'[1,1]
 local mu`i': di %3.1f treat`i'[2,1]
 local sd`i': di %3.1f treat`i'[3,1]
 local p25_`i': di %3.1f treat`i'[4,1]
 local p50_`i': di %3.1f treat`i'[5,1]
 local p75_`i': di %3.1f treat`i'[6,1]
}

* now display the results saved above in specified columns
* e.g contents of local macro rn1 will be displayed in col(20)
* contents of local macro rn2 in col(40)

di _col(4) "n" _cont
forvalues i=1/2{
 local col=`i'*(20)
 di  _col(`col') "`rn`i''" _cont
}

* this will dispaly the mean and std deviation as described above
di
di _col(4) "Mean(SD)" _cont
forvalues i=1/2{
 local col=`i'*(20)
 di  _col(`col') "`mu`i''(`sd`i'')" _cont
}

* this will display the median with InterQuartile Range
di
di _col(4) "Median (IQR)" _cont
forvalues i=1/2{
 local col=`i'*(20)-2
 di _col(`col') "`p50_`i''(`p25_`i''-`p75_`i'')" _cont
}

* just skip one row
di _n
}

* just a line of length 55 characters
di _dup(55) "_"

**************Now we want to fit an ancova regression model (say)
*we want to display the following results
*N=Number of Observations used in the model
*Effect=mean difference between the treatment groups adjusted for baseline values (group 1 as reference)
*SE=Standard Error of the Effect
* LCL to UCL=lower and Upper 95% Confidence Limits Respectively

di  _skip(4)
di as text  "Analysis of change in the VBQ (and its subscales) and CD4 count using the ANCOVA model"
di
di _dup(78) "_"
di _col(58) as text  "95% CI"
di "Variable" _col(30) "N" _col(35) "Effect" _col(45) "SE"  _col(55) "LCL to UCL" _col(73) "p value" /*line 168*/
di

*loop over all releveant continuous outcomes
foreach x of varlist dizz anxi cd4 sym qol total{
* count if the baseline and follow-up measurements are not missing (regression set)
    qui count if (`x'!=.&`x'p!=.)
 *save it in the local macro N
    local N=r(N)
 * extract the variable label for the post measurement variables
    * save this in local macro i`x' for later use

    local i`x': variable label `x'p
 * now we quietly and noisly run the ancova model (nothing magic here)
qui regress `x'p `x' i.treat
 * extract the intervention effect and save in macro b
    local b: di %5.2f _b[2.treat]
 * extract the std error of intervention effect and save in macro se
    local se: di %5.2f _se[2.treat]
 * now obtain the t statistic which we will use to calculate the p value and save in macro t
    local t=`b'/`se'
 * obtain the p value[e(df_r)=returned degrees of freedom] under the t-distribution
    local p: di %5.3f 2*ttail(e(df_r),abs(`t'))
 *calculate the lower and upper confidence 95% CI
    local cl: di %5.2f `b'- invttail(e(df_r),0.025)*`se'
    local cu: di %5.2f `b'+ invttail(e(df_r),0.025)*`se'

 * now we're done, we just need to reference the saved macros correponding to each element in LINE 168
di as res "`i`x''" as res _col(30) "`N'" _col(35) "`b'" _col(45) "`se'" _col(55) "`cl' to `cu'" _col(73) "`p'"
di

}
di _dup(78) "_"
}
}
exit

end

exportme,obs(200)
exit