A SAS® Macro for measuring and testing global balance of categorical covariates

(1)

A SAS® Macro for measuring and testing global balance of categorical covariates

Camillo, Furio and D’Attoma,Ida

Dipartimento di Scienze Statistiche, Università di Bologna

via Belle Arti,41- 40126-

Bologna, Italy

November 22, 2010

Professor Furio Camillo is an Associate Professor of Business Statistics and Data Mining at

Department of Statistical Sciences, University of Bologna (Italy). He studies the applications of data

mining in private and public organizations in the areas of marketing, customer relationship

management and policy evaluation.

E-mail: furio.camillo@unibo.it

Dr. Ida D’Attoma is a Research Fellow at Department of Statistical Sciences, University of Bologna

(Italy). Her research interests include micro data mining, causal inference in observational studies,

subgroup analysis, and new methods for public and private program evaluation.

E-mail: Ida.dattoma2@unibo.it

Abstract

We developed a SAS® Macro [1] to simultaneously measure and test global balance of categorical

covariates. The purpose of the %BALANCE macro is to check global balance and test it by subgroups,

no matter how groups are obtained. A subgroup could be a bin of a Propensity Score Analysis or a

group of any classification method.

For each group we measure and test global balance according to the multivariate measure and its test

introduced in Camillo and D’Attoma (2010) [2] and D’Attoma and Camillo (2010) [3].

The %BALANCE Macro use the SAS/Iml matrix language to obtain such measure. It generates as

output the global balance measure and its test by subgroups. The user will choose a set of parameters

that define the balance measure and its multivariate test.

(2)

The use of such Macro can significantly reduce the amount of time required to simultaneously test

balance of any number of categorical covariates by subgroups.

Description

Given a set of pre-treatment categorical covariates involved in the selection processs, the %BALANCE

macro simultaneously checks balance of all categorical covariates and tests its statistically significance

in a multivariate ways accordinding to the GI formula and the multivariate test introduced in D’Attoma

and Camillo (2010) [3].

The macro uses the Sas/iml matrix language that allows to compute the GI formula [3]:



 





T 1 t Q J 1 j _._t _._j 2 tj

1 k

k

b

Q

1 GI

where Q is the number of all categorical covariates introduced in the analysis,

b is the number of

_tj

units with category

j



J

Q

in the treatment group

t

 ,

T

k is the treatment group size and

.t

k is the

.j

number of units with category

j



J

Q

. Furthermore, the %BALANCE macro allows to test the GI

significance according to the following confidence interval [3]:

  











 



  

nQ

,

0 GI

2 , 1 J 1 T

where T is the number of treatment levels, J is the number of the categories of the Q categorical

covariates and n is the number of units.

Usage

The %BALANCE Macro takes the following named parameters. The arguments may be listed within

parentheses in any order, separated by commas. The %BALANCE macro uses the %DUMMY macro

(Friendly,2001) [4] to create the disjunctive table given the Q pre-treatment categorical covariates. The

%DUMMY macro must be downloaded from http://www.datavis.ca/sasmac/dummy.html and saved in

the specified PATH=.

The %DUMMY macro must be parametrized as follows:

%DUMMY(data=_&cluster, out=disj_&cluster, var=&balance_var &treat, base=_last_, prefix=,

format=, name=VAL, fullrank=0 );

(3)

Parameters

LIBRARY= The name of the SAS library

DATA= The name of the input dataset. The input dataset contains the categorical

covariates, the treatment indicator variable, the ID variable and a variable

that indicates the group membership for each unit. A group could be the result

of any classification analysis or a bin of a Propensity Score Analysis,

conducted separately before running such Macro.

OUT= The name of the output dataset. For each group it reports the number of units

within the group (n), the number of units in the treatment group (n_t1) , the

number of units within the control group (n_t2), the group membership

indicator (id_clu), the value of the balance measure (GI), the upper limit of

the confidence interval ( CHI ) and the balance result (BALANCE).

BALANCE=yes if the GI measure is between 0 and the upper limit of the

confidence interval; BALANCE=no otherwise.

FIRSTCLU= The number of the first group to analyze in the GROUP_VAR. It is a numeric

value.

LASTCLU= The number of the last group to analyze in the GROUP_VAR. It is a numeric

value.

GROUP_VAR= The name of the input variable denoting the units group membership.

BALANCE_VAR= The name (s) of the input variable(s) to be balance checked. The name(s)

may be listed in any order. The variable(s) must be numeric. No missing

values are allowed.

Q= The number of categorical variables on which simultaneously check balance.

The variable must be numeric.

TREAT= The name of the treatment indicator variable. The variable must be numeric.

ALPHA= The alpha level for the multivariate imbalance test. It is a numeric value.

PATH= The path where the folder containing datasets and dummy.sas file is located

(4)

/*********************************************************************************/

/* MACRO NAME: Balance */

/* TITLE: Macro to simultaneously test global balance of categorical */

/* covariates */

/*********************************************************************************/ /* AUTHORS: Camillo, Furio furio.camillo@unibo.it */ /* D’Attoma, Ida Ida.dattoma2@unibo.it */ /* CREATED: 07 June 2010 */ /* REVISED: 22 November 2010 */ /*********************************************************************************/

%MACRO Balance(

library= , /* name of the SAS lilbrary */ data= , /* name of input dataset */ out= , /* name of output dataset */ firstclu= , /* the number of the first group to analyze */ lastclu= , /* the number of the last group to analyze */ group_var= , /* the name of the classification variable */ balance_var=, /* the name(s) of variable(s) to be balance

checked */ Q= , /* the number of categorical variables to be balance checked */ treat= , /* the name of the treatment indicator variable*/ alpha= , /* the alpha level for the multivariate test */ path= ; /* the path where the dummy.sas macro is saved */ );

/*********************************************************************************/ /* The loop over groups */ /*********************************************************************************/ /*********************************************************************************/

/* creates a dataset for each group */

/*********************************************************************************/

%do I=&firstclu %to &lastclu ; %let cluster=&I;

data _&cluster; set &library..&data;

if &group_var=&cluster then output; run;

/*********************************************************************************/ /* counts treatment and control units within each group */ /*********************************************************************************/

proc freq data=_&cluster noprint; table &treat/out=freq_&cluster; run;

proc transpose data=freq_&cluster out=trasp_&cluster name=count

prefix=n_t; run;

(5)

set trasp_&cluster;

if count ne "COUNT" then delete; run;

data trasp_&cluster; set trasp_&cluster; id_clu=&cluster; drop count _label_; run;

%end;

%do n=&firstclu %to &lastclu; %let nidcluster=&n;

data nt; set

%do h=1 %to &nidcluster; trasp_&h %end; ; run; %end; ; run; /*********************************************************************************/ /* creates a disjunctive table for each group using the dummy.sas macro */ /* Produces as output for each group a disj_&cluster dataset */ /*********************************************************************************/

%do J=&firstclu %to &lastclu ;

%let cluster=&J;

%include &path;

%end;

run;

%do I=&firstclu %to &lastclu ;

%let cluster=&I;

data disj_&cluster; set disj_&cluster;

drop &balance_var &treat &group_var;

%end;

run;

%do k=&firstclu %to &lastclu ;

%let cluster=&k; data t_&cluster; set disj_&cluster; keep &treat:; %end; run; /*********************************************************************************/ /* Compute the GI measure for each group */ /* Z includes the Q original categorical covariates in disjunctive form */

/* L includes the t treatment levels */

/* Q denotes the number of categorical variables */ /* n denotes the number of rows in the disjunctive table(s) */

(6)

/* c denotes the number of columns of the disjunctive table(s) */ /* j denotes the number of levels of the Q categorical covariates */ /* B is the Burt table */ /* band is the Burt band */ /* kt denotes the number of treatment and comparison cases */ /* Between denotes the final GI formula */ /*********************************************************************************/

%do I=&firstclu %to &lastclu ;

%let cluster=&I; proc iml;

use disj_&cluster; read all into X; use t_&cluster; read all into L; n=nrow(X); C=ncol(X); levelt=ncol(L); j=C-levelt; Z=X[,2:C-levelt]; A=T(Z); B= A*Z; T=X[,c-(levelt-1):c ]; band=A*T; band_2=band#band; kt=T[+,]; burt_diag=VECDIAG(B); inversa=1/burt_diag; frac_t=band_2#inversa; sum_t=frac_t[+,]; invkt=1/kt; bet1=invkt#sum_t; invQ=1/&Q; bet2=invQ*bet1; bet3=bet2[,+]; gdl=(levelt-1)*(j-1); Between=bet3-1; /*********************************************************************************/ /* chi-square quantile */

/* cinv(p,df,nc) returns the pth quantile, 0<p<1, */ /* of the chi-square distribution with degrees of freedom df. */ /* If the optional non-centrality parameter nc is not specified, nc=0. */ /* nc is the sum of the squares of the means. Examples: */ /* y=cinv(0.95,5); z=cinv(0.95,5,3); */ /*********************************************************************************/ chi=cinv(1-&alpha,gdl)/(n*&Q);

if (Between > chi) | (Between < 0) then Balance='Unbalanced'; else Balance='Balanced';

inertia_tot=(j/&Q)-1;

within=inertia_tot-Between;

MIC=(1-(within/inertia_tot))*100; chi=cinv(1-&alpha,gdl)/(n*&Q);

results_&cluster = Between || chi; cname = {"GI" "CHI" };

create GI_&cluster from results_&cluster [ colname=cname ]; append from results_&cluster;

(7)

quit;

data GI_&cluster; set GI_&cluster; length Balance $ 17;

if GI le CHI then Balance="Yes";

else if (GI=0 and CHI='.') then Balance="no common support"; else Balance="No";

run;

data GI_&cluster; set GI_&cluster;

if Balance="no common support" then GI='.';else GI=GI; run;

%END;

run;

/*********************************************************************************/

/* Create output dataset */

/*********************************************************************************/

%do n=&firstclu %to &lastclu;

%let nidcluster=&n; data GI;

set

%do h=1 %to &nidcluster; GI_&h %end; ; run; %end; ; run; /*********************************************************************************/ /* Delete temporary output dataset(s) */ /*********************************************************************************/

%do n=&firstclu %to &lastclu;

%let nidcluster=&n; proc datasets;

delete GI_&nidcluster freq_&nidcluster trasp_&nidcluster t_&nidcluster _&nidcluster _cats_ disj_&nidcluster;

%end; run; data GI; set GI; id_clu=_n_; run;

proc sort data=GI; by id_clu;

run;

proc sort data=Nt; by id_clu;

run;

(8)

merge Nt(in=A) Gi(in=B); by id_clu;

if A and B; run;

data &library..&out; set &library..&out; n=sum(n_t1, n_t2); run;

%MEND Balance;

%Balance;

Example: the output dataset

Suppose you have 18 subgroups on which test balance of covariates among treatment groups before

estimate a treatment effect. The balance macro allows you to understand where an effect could be

estimated in a easy way: the output dataset of %BALANCE Macro gives you information about

balance for each subgroup.

In particular, the output result displays for each group the number of units within treatment group

(n_t1), the number of units in the control group (n_t2), the number of units within each subgroup (n),

the group membership indicator (id_clu), the GI measure (GI), the upper limit of the confidence

interval (CHI) and the balance result (BALANCE). Balance equals ‘yes’ if the GI measure is lower

than the upper limit of the confidence interval; otherwise, it equals ‘no’ if the GI measure is greater

than the upper limit of the confidence interval. Finally, Balance equals ‘no common support’ when the

GI measure and its test cannot computed because there are only treated units or only controls.

(9)