Source code for spreg.diagnostics_tsls

"""
Diagnostics for two stage least squares regression estimations. 
        
"""

__author__ = "Luc Anselin luc.anselin@asu.edu, Nicholas Malizia nicholas.malizia@asu.edu "

from libpysal.common import *
from scipy.stats import pearsonr

__all__ = ["t_stat", "pr2_aspatial", "pr2_spatial"]


[docs]def t_stat(reg, z_stat=False):
    """
    Calculates the t-statistics (or z-statistics) and associated p-values.
    :cite:`Greene2003`

    Parameters
    ----------
    reg             : regression object
                      output instance from a regression model
    z_stat          : boolean
                      If True run z-stat instead of t-stat

    Returns
    -------    
    ts_result       : list of tuples
                      each tuple includes value of t statistic (or z
                      statistic) and associated p-value

    Examples
    --------

    We first need to import the needed modules. Numpy is needed to convert the
    data we read into arrays that ``spreg`` understands and ``pysal`` to
    perform all the analysis. The ``diagnostics`` module is used for the tests
    we will show here and the OLS and TSLS are required to run the models on
    which we will perform the tests.

    >>> import numpy as np
    >>> import libpysal
    >>> from libpysal import examples
    >>> import spreg
    >>> from spreg import OLS

    Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open().
    This is the DBF associated with the Columbus shapefile.  Note that
    libpysal.io.open() also reads data in CSV format; since the actual class
    requires data to be passed in as numpy arrays, the user can read their
    data in using any method.  

    >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r')

    Before being able to apply the diagnostics, we have to run a model and,
    for that, we need the input variables. Extract the CRIME column (crime
    rates) from the DBF file and make it the dependent variable for the
    regression. Note that PySAL requires this to be an numpy array of shape
    (n, 1) as opposed to the also common shape of (n, ) that other packages
    accept.

    >>> y = np.array(db.by_col("CRIME"))
    >>> y = np.reshape(y, (49,1))

    Extract INC (income) and HOVAL (home value) vector from the DBF to be used as
    independent variables in the regression.  Note that PySAL requires this to
    be an nxj numpy array, where j is the number of independent variables (not
    including a constant). By default this model adds a vector of ones to the
    independent variables passed in, but this can be overridden by passing
    constant=False.

    >>> X = []
    >>> X.append(db.by_col("INC"))
    >>> X.append(db.by_col("HOVAL"))
    >>> X = np.array(X).T

    Run an OLS regression. Since it is a non-spatial model, all we need is the
    dependent and the independent variable.

    >>> reg = OLS(y,X)

    Now we can perform a t-statistic on the model:

    >>> testresult = spreg.t_stat(reg)
    >>> print("%12.12f"%testresult[0][0], "%12.12f"%testresult[0][1], "%12.12f"%testresult[1][0], "%12.12f"%testresult[1][1], "%12.12f"%testresult[2][0], "%12.12f"%testresult[2][1])
    14.490373143689 0.000000000000 -4.780496191297 0.000018289595 -2.654408642718 0.010874504910

    We can also use the z-stat. For that, we re-build the model so we consider
    HOVAL as endogenous, instrument for it using DISCBD and carry out two
    stage least squares (TSLS) estimation.

    >>> X = []
    >>> X.append(db.by_col("INC"))
    >>> X = np.array(X).T    
    >>> yd = []
    >>> yd.append(db.by_col("HOVAL"))
    >>> yd = np.array(yd).T
    >>> q = []
    >>> q.append(db.by_col("DISCBD"))
    >>> q = np.array(q).T

    Once the variables are read as different objects, we are good to run the
    model.

    >>> reg = spreg.TSLS(y, X, yd, q)

    With the output of the TSLS regression, we can perform a z-statistic:

    >>> testresult = spreg.t_stat(reg, z_stat=True)
    >>> print("%12.10f"%testresult[0][0], "%12.10f"%testresult[0][1], "%12.10f"%testresult[1][0], "%12.10f"%testresult[1][1], "%12.10f"%testresult[2][0], "%12.10f"%testresult[2][1])
    5.8452644705 0.0000000051 0.3676015668 0.7131703463 -1.9946891308 0.0460767956
    """

    k = reg.k           # (scalar) number of ind. vas (includes constant)
    n = reg.n           # (scalar) number of observations
    vm = reg.vm         # (array) coefficients of variance matrix (k x k)
    betas = reg.betas   # (array) coefficients of the regressors (1 x k)
    variance = vm.diagonal()
    tStat = betas.reshape(len(betas),) / np.sqrt(variance)
    ts_result = []
    for t in tStat:
        if z_stat:
            ts_result.append((t, stats.norm.sf(abs(t)) * 2))
        else:
            ts_result.append((t, stats.t.sf(abs(t), n - k) * 2))
    return ts_result


def pr2_aspatial(tslsreg):
    """
    Calculates the pseudo r^2 for the two stage least squares regression.

    Parameters
    ----------
    tslsreg             : two stage least squares regression object
                          output instance from a two stage least squares
                          regression model

    Returns
    -------
    pr2_result          : float
                          value of the squared pearson correlation between
                          the y and tsls-predicted y vectors

    Examples
    --------

    We first need to import the needed modules. Numpy is needed to convert the
    data we read into arrays that ``spreg`` understands and ``pysal`` to
    perform all the analysis. The TSLS is required to run the model on
    which we will perform the tests.

    >>> import numpy as np
    >>> from spreg import TSLS, pr2_aspatial
    >>> import libpysal
    >>> from libpysal import examples

    Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open().
    This is the DBF associated with the Columbus shapefile.  Note that
    libpysal.io.open() also reads data in CSV format; since the actual class
    requires data to be passed in as numpy arrays, the user can read their
    data in using any method.  

    >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r')

    Before being able to apply the diagnostics, we have to run a model and,
    for that, we need the input variables. Extract the CRIME column (crime
    rates) from the DBF file and make it the dependent variable for the
    regression. Note that PySAL requires this to be an numpy array of shape
    (n, 1) as opposed to the also common shape of (n, ) that other packages
    accept.

    >>> y = np.array(db.by_col("CRIME"))
    >>> y = np.reshape(y, (49,1))

    Extract INC (income) vector from the DBF to be used as
    independent variables in the regression.  Note that PySAL requires this to
    be an nxj numpy array, where j is the number of independent variables (not
    including a constant). By default this model adds a vector of ones to the
    independent variables passed in, but this can be overridden by passing
    constant=False.

    >>> X = []
    >>> X.append(db.by_col("INC"))
    >>> X = np.array(X).T

    In this case, we consider HOVAL (home value) as an endogenous regressor,
    so we acknowledge that by reading it in a different category.

    >>> yd = []
    >>> yd.append(db.by_col("HOVAL"))
    >>> yd = np.array(yd).T

    In order to properly account for the endogeneity, we have to pass in the
    instruments. Let us consider DISCBD (distance to the CBD) is a good one:

    >>> q = []
    >>> q.append(db.by_col("DISCBD"))
    >>> q = np.array(q).T

    Now we are good to run the model. It is an easy one line task.

    >>> reg = TSLS(y, X, yd, q=q)

    In order to perform the pseudo R^2, we pass the regression object to the
    function and we are done!

    >>> result = pr2_aspatial(reg)
    >>> print("%1.6f"%result)    
    0.279361

    """

    y = tslsreg.y
    predy = tslsreg.predy
    pr = pearsonr(y.flatten(), predy.flatten())[0]
    pr2_result = float(pr ** 2)
    return pr2_result


def pr2_spatial(tslsreg):
    """
    Calculates the pseudo r^2 for the spatial two stage least squares 
    regression.

    Parameters
    ----------
    stslsreg            : spatial two stage least squares regression object
                          output instance from a spatial two stage least 
                          squares regression model

    Returns
    -------    
    pr2_result          : float
                          value of the squared pearson correlation between
                          the y and stsls-predicted y vectors

    Examples
    --------

    We first need to import the needed modules. Numpy is needed to convert the
    data we read into arrays that ``spreg`` understands and ``pysal`` to
    perform all the analysis. The GM_Lag is required to run the model on
    which we will perform the tests and the ``spreg.diagnostics`` module
    contains the function with the test.

    >>> import numpy as np
    >>> import libpysal
    >>> from libpysal import examples
    >>> import spreg as D
    >>> from spreg import GM_Lag

    Open data on Columbus neighborhood crime (49 areas) using libpysal.io.open().
    This is the DBF associated with the Columbus shapefile.  Note that
    libpysal.io.open() also reads data in CSV format; since the actual class
    requires data to be passed in as numpy arrays, the user can read their
    data in using any method.  

    >>> db = libpysal.io.open(examples.get_path("columbus.dbf"),'r')

    Extract the HOVAL column (home value) from the DBF file and make it the
    dependent variable for the regression. Note that PySAL requires this to be
    an numpy array of shape (n, 1) as opposed to the also common shape of (n, )
    that other packages accept.

    >>> y = np.array(db.by_col("HOVAL"))
    >>> y = np.reshape(y, (49,1))

    Extract INC (income) vectors from the DBF to be used as
    independent variables in the regression.  Note that PySAL requires this to
    be an nxj numpy array, where j is the number of independent variables (not
    including a constant). By default this model adds a vector of ones to the
    independent variables passed in, but this can be overridden by passing
    constant=False.

    >>> X = np.array(db.by_col("INC"))
    >>> X = np.reshape(X, (49,1))

    In this case, we consider CRIME (crime rates) as an endogenous regressor,
    so we acknowledge that by reading it in a different category.

    >>> yd = np.array(db.by_col("CRIME"))
    >>> yd = np.reshape(yd, (49,1))

    In order to properly account for the endogeneity, we have to pass in the
    instruments. Let us consider DISCBD (distance to the CBD) is a good one:

    >>> q = np.array(db.by_col("DISCBD"))
    >>> q = np.reshape(q, (49,1))

    Since this test has a spatial component, we need to specify the spatial
    weights matrix that includes the spatial configuration of the observations
    into the error component of the model. To do that, we can open an already
    existing gal file or create a new one. In this case, we will create one
    from ``columbus.shp``.

    >>> w = libpysal.weights.Rook.from_shapefile(examples.get_path("columbus.shp")) 

    Unless there is a good reason not to do it, the weights have to be
    row-standardized so every row of the matrix sums to one. Among other
    things, this allows to interpret the spatial lag of a variable as the
    average value of the neighboring observations. In PySAL, this can be
    easily performed in the following way:

    >>> w.transform = 'r'

    Now we are good to run the spatial lag model. Make sure you pass all the
    parameters correctly and, if desired, pass the names of the variables as
    well so when you print the summary (reg.summary) they are included:

    >>> reg = GM_Lag(y, X, w=w, yend=yd, q=q, w_lags=2, name_x=['inc'], name_y='hoval', name_yend=['crime'], name_q=['discbd'], name_ds='columbus')

    Once we have a regression object, we can perform the spatial version of
    the pesudo R^2. It is as simple as one line!

    >>> result = pr2_spatial(reg)
    >>> print("%1.6f"%result)
    0.299649

    """

    y = tslsreg.y
    predy_e = tslsreg.predy_e
    pr = pearsonr(y.flatten(), predy_e.flatten())[0]
    pr2_result = float(pr ** 2)
    return pr2_result


def _test():
    import doctest
    doctest.testmod()


if __name__ == '__main__':
    _test()