Source code for spreg.diagnostics_tsls

"""
Diagnostics for two stage least squares regression estimations. 
        
"""

__author__ = "Luc Anselin luc.anselin@asu.edu, Nicholas Malizia nicholas.malizia@asu.edu "

from libpysal.common import *
from scipy.stats import pearsonr

__all__ = ["t_stat", "pr2_aspatial", "pr2_spatial"]


[docs]def t_stat(reg, z_stat=False): """ Calculates the t-statistics (or z-statistics) and associated p-values. :cite:`Greene2003` Parameters ---------- reg : regression object output instance from a regression model z_stat : boolean If True run z-stat instead of t-stat Returns ------- ts_result : list of tuples each tuple includes value of t statistic (or z statistic) and associated p-value Examples -------- We first need to import the needed modules. Numpy is needed to convert the data we read into arrays that ``spreg`` understands and ``pysal`` to perform all the analysis. The ``diagnostics`` module is used for the tests we will show here and the OLS and TSLS are required to run the models on which we will perform the tests. >>> import numpy as np >>> import libpysal >>> from libpysal import examples >>> import spreg >>> from spreg import OLS Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open(). This is the DBF associated with the Columbus shapefile. Note that libpysal.io.open() also reads data in CSV format; since the actual class requires data to be passed in as numpy arrays, the user can read their data in using any method. >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r') Before being able to apply the diagnostics, we have to run a model and, for that, we need the input variables. Extract the CRIME column (crime rates) from the DBF file and make it the dependent variable for the regression. Note that PySAL requires this to be an numpy array of shape (n, 1) as opposed to the also common shape of (n, ) that other packages accept. >>> y = np.array(db.by_col("CRIME")) >>> y = np.reshape(y, (49,1)) Extract INC (income) and HOVAL (home value) vector from the DBF to be used as independent variables in the regression. Note that PySAL requires this to be an nxj numpy array, where j is the number of independent variables (not including a constant). By default this model adds a vector of ones to the independent variables passed in, but this can be overridden by passing constant=False. >>> X = [] >>> X.append(db.by_col("INC")) >>> X.append(db.by_col("HOVAL")) >>> X = np.array(X).T Run an OLS regression. Since it is a non-spatial model, all we need is the dependent and the independent variable. >>> reg = OLS(y,X) Now we can perform a t-statistic on the model: >>> testresult = spreg.t_stat(reg) >>> print("%12.12f"%testresult[0][0], "%12.12f"%testresult[0][1], "%12.12f"%testresult[1][0], "%12.12f"%testresult[1][1], "%12.12f"%testresult[2][0], "%12.12f"%testresult[2][1]) 14.490373143689 0.000000000000 -4.780496191297 0.000018289595 -2.654408642718 0.010874504910 We can also use the z-stat. For that, we re-build the model so we consider HOVAL as endogenous, instrument for it using DISCBD and carry out two stage least squares (TSLS) estimation. >>> X = [] >>> X.append(db.by_col("INC")) >>> X = np.array(X).T >>> yd = [] >>> yd.append(db.by_col("HOVAL")) >>> yd = np.array(yd).T >>> q = [] >>> q.append(db.by_col("DISCBD")) >>> q = np.array(q).T Once the variables are read as different objects, we are good to run the model. >>> reg = spreg.TSLS(y, X, yd, q) With the output of the TSLS regression, we can perform a z-statistic: >>> testresult = spreg.t_stat(reg, z_stat=True) >>> print("%12.10f"%testresult[0][0], "%12.10f"%testresult[0][1], "%12.10f"%testresult[1][0], "%12.10f"%testresult[1][1], "%12.10f"%testresult[2][0], "%12.10f"%testresult[2][1]) 5.8452644705 0.0000000051 0.3676015668 0.7131703463 -1.9946891308 0.0460767956 """ k = reg.k # (scalar) number of ind. vas (includes constant) n = reg.n # (scalar) number of observations vm = reg.vm # (array) coefficients of variance matrix (k x k) betas = reg.betas # (array) coefficients of the regressors (1 x k) variance = vm.diagonal() tStat = betas.reshape(len(betas),) / np.sqrt(variance) ts_result = [] for t in tStat: if z_stat: ts_result.append((t, stats.norm.sf(abs(t)) * 2)) else: ts_result.append((t, stats.t.sf(abs(t), n - k) * 2)) return ts_result
def pr2_aspatial(tslsreg): """ Calculates the pseudo r^2 for the two stage least squares regression. Parameters ---------- tslsreg : two stage least squares regression object output instance from a two stage least squares regression model Returns ------- pr2_result : float value of the squared pearson correlation between the y and tsls-predicted y vectors Examples -------- We first need to import the needed modules. Numpy is needed to convert the data we read into arrays that ``spreg`` understands and ``pysal`` to perform all the analysis. The TSLS is required to run the model on which we will perform the tests. >>> import numpy as np >>> from spreg import TSLS, pr2_aspatial >>> import libpysal >>> from libpysal import examples Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open(). This is the DBF associated with the Columbus shapefile. Note that libpysal.io.open() also reads data in CSV format; since the actual class requires data to be passed in as numpy arrays, the user can read their data in using any method. >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r') Before being able to apply the diagnostics, we have to run a model and, for that, we need the input variables. Extract the CRIME column (crime rates) from the DBF file and make it the dependent variable for the regression. Note that PySAL requires this to be an numpy array of shape (n, 1) as opposed to the also common shape of (n, ) that other packages accept. >>> y = np.array(db.by_col("CRIME")) >>> y = np.reshape(y, (49,1)) Extract INC (income) vector from the DBF to be used as independent variables in the regression. Note that PySAL requires this to be an nxj numpy array, where j is the number of independent variables (not including a constant). By default this model adds a vector of ones to the independent variables passed in, but this can be overridden by passing constant=False. >>> X = [] >>> X.append(db.by_col("INC")) >>> X = np.array(X).T In this case, we consider HOVAL (home value) as an endogenous regressor, so we acknowledge that by reading it in a different category. >>> yd = [] >>> yd.append(db.by_col("HOVAL")) >>> yd = np.array(yd).T In order to properly account for the endogeneity, we have to pass in the instruments. Let us consider DISCBD (distance to the CBD) is a good one: >>> q = [] >>> q.append(db.by_col("DISCBD")) >>> q = np.array(q).T Now we are good to run the model. It is an easy one line task. >>> reg = TSLS(y, X, yd, q=q) In order to perform the pseudo R^2, we pass the regression object to the function and we are done! >>> result = pr2_aspatial(reg) >>> print("%1.6f"%result) 0.279361 """ y = tslsreg.y predy = tslsreg.predy pr = pearsonr(y.flatten(), predy.flatten())[0] pr2_result = float(pr ** 2) return pr2_result def pr2_spatial(tslsreg): """ Calculates the pseudo r^2 for the spatial two stage least squares regression. Parameters ---------- stslsreg : spatial two stage least squares regression object output instance from a spatial two stage least squares regression model Returns ------- pr2_result : float value of the squared pearson correlation between the y and stsls-predicted y vectors Examples -------- We first need to import the needed modules. Numpy is needed to convert the data we read into arrays that ``spreg`` understands and ``pysal`` to perform all the analysis. The GM_Lag is required to run the model on which we will perform the tests and the ``spreg.diagnostics`` module contains the function with the test. >>> import numpy as np >>> import libpysal >>> from libpysal import examples >>> import spreg as D >>> from spreg import GM_Lag Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open(). This is the DBF associated with the Columbus shapefile. Note that libpysal.io.open() also reads data in CSV format; since the actual class requires data to be passed in as numpy arrays, the user can read their data in using any method. >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r') Extract the HOVAL column (home value) from the DBF file and make it the dependent variable for the regression. Note that PySAL requires this to be an numpy array of shape (n, 1) as opposed to the also common shape of (n, ) that other packages accept. >>> y = np.array(db.by_col("HOVAL")) >>> y = np.reshape(y, (49,1)) Extract INC (income) vectors from the DBF to be used as independent variables in the regression. Note that PySAL requires this to be an nxj numpy array, where j is the number of independent variables (not including a constant). By default this model adds a vector of ones to the independent variables passed in, but this can be overridden by passing constant=False. >>> X = np.array(db.by_col("INC")) >>> X = np.reshape(X, (49,1)) In this case, we consider CRIME (crime rates) as an endogenous regressor, so we acknowledge that by reading it in a different category. >>> yd = np.array(db.by_col("CRIME")) >>> yd = np.reshape(yd, (49,1)) In order to properly account for the endogeneity, we have to pass in the instruments. Let us consider DISCBD (distance to the CBD) is a good one: >>> q = np.array(db.by_col("DISCBD")) >>> q = np.reshape(q, (49,1)) Since this test has a spatial component, we need to specify the spatial weights matrix that includes the spatial configuration of the observations into the error component of the model. To do that, we can open an already existing gal file or create a new one. In this case, we will create one from ``columbus.shp``. >>> w = libpysal.weights.Rook.from_shapefile(examples.get_path("columbus.shp")) Unless there is a good reason not to do it, the weights have to be row-standardized so every row of the matrix sums to one. Among other things, this allows to interpret the spatial lag of a variable as the average value of the neighboring observations. In PySAL, this can be easily performed in the following way: >>> w.transform = 'r' Now we are good to run the spatial lag model. Make sure you pass all the parameters correctly and, if desired, pass the names of the variables as well so when you print the summary (reg.summary) they are included: >>> reg = GM_Lag(y, X, w=w, yend=yd, q=q, w_lags=2, name_x=['inc'], name_y='hoval', name_yend=['crime'], name_q=['discbd'], name_ds='columbus') Once we have a regression object, we can perform the spatial version of the pesudo R^2. It is as simple as one line! >>> result = pr2_spatial(reg) >>> print("%1.6f"%result) 0.299649 """ y = tslsreg.y predy_e = tslsreg.predy_e pr = pearsonr(y.flatten(), predy_e.flatten())[0] pr2_result = float(pr ** 2) return pr2_result def _test(): import doctest doctest.testmod() if __name__ == '__main__': _test()