rspeare/p_values_for_logreg.py

rizzomichaelg · 2017-11-24T16:24:54Z

Thanks for posting this! I'm wondering if, for the Fisher info matrix calculation, the denom should be tiled first because I'm running into problems of division on arrays of different shape. Was thinking something like this:

denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
denom = np.tile(denom,(X.shape[1],1)).T
F_ij = np.dot((X/denom).T,X) ## Fisher Information Matrix

I believe sigma_estimates can be condensed to: sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao)).
I also extended this to include confidence intervals for each of the params (similar to how statsmodels does it):

alpha = 0.05
q = stats.norm.ppf(1 - alpha / 2)
lower = self.model.coef_[0] - q * sigma_estimates
upper = self.model.coef_[0] + q * sigma_estimates
self.conf_int = np.dstack((lower, upper))[0]

Thanks again for this!

MiloVentimiglia · 2018-09-28T09:04:40Z

Hi,

Would you be able to provide any documentation/ book/article that served as base for this code?
I don't quite understand why to calculate the "denom" you apply a hyperbolic cosine to the scores.

Thanks in advance!

rspeare · 2019-01-13T14:33:59Z

Hey @rizzomichaelg, thanks so much for the comments. Put the changes in above. @MiloVentimiglia, you'll see that Cosh just comes from the Hessian of the binomial likelihood for logistic regression. (A little tricky but all Generalized linear models have a fisher information matrix of the form X.D.X^T, where X is the data matrix and D is some intermediary -- normally diagonal and in this case it's our cosh function)

Mikeal001 · 2019-08-16T12:18:49Z

I have tried to use your code, but I do get errors: I have the whole codes and error shown below.

inputs_train.shape
(373028, 104)

loan_data_targets_train.shape
(373028, 1)

Now fitting it with p-values

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

class LogisticRegression_with_p_values: # this is a new class of reg

def __init__(self,*args,**kwargs):#,**kwargs):
    self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
 def fit(self,X,y):
    self.model.fit(X,y)        
    #### Get p-values for the fitted model ####
    denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
    denom = np.tile(denom,(X.shape[1],1)).T
    F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
    Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
    sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
    z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
    p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values    
    self.coef_ = self.model.coef_
    self.intercept_ = self.model.intercept_
    self.p_values = p_values  # p values are store in a variable called p value

reg = LogisticRegression_with_p_values()
reg.fit(inputs_train, loan_data_targets_train)

RESULTS

LinAlgError Traceback (most recent call last)
in
----> 1 reg.fit(inputs_train, loan_data_targets_train)
2 # Estimates the coefficients of the object from the 'LogisticRegression' class
3 # with inputs (independent variables) contained in the first dataframe
4 # and targets (dependent variables) contained in the second dataframe.

in fit(self, X, y)
20 denom = np.tile(denom,(X.shape[1],1)).T
21 F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
---> 22 Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
23 sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
24 z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient

~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in inv(a)
549 signature = 'D->D' if isComplexType(t) else 'd->d'
550 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
552 return wrap(ainv.astype(result_t, copy=False))
553

~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

anaclaramatos · 2019-08-26T15:07:14Z

When fit_intercept = True, shouldn't a column of ones be added to x?

def fit(self, x, y):
        self.model.fit(x, y)

        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(x))))

        if self._fit_intercept:
            x = np.hstack([np.ones((x.shape[0], 1)), x])

        denom = np.tile(denom, (x.shape[1], 1)).T

        f_ij = np.dot((x / denom).T, x)  ## Fisher Information Matrix
        cramer_rao = np.linalg.inv(f_ij)  ## Inverse Information Matrix

        if self._fit_intercept:
            self.coef = np.column_stack((self.model.intercept_, self.model.coef_))
        else:
            self.coef = self.model.coef_

        self.sigma = np.sqrt(np.diagonal(cramer_rao))
        self.z = (self.coef / self.sigma)[0]
        self.p = (np.round([stat.norm.sf(abs(x)) * 2 for x in self.z], 3))

PescheHelfer · 2019-11-03T20:00:59Z

It's quite possible I am doing something wrong, but I can't do predictions using the wrapper method because it does not know the method predict(). I tried to get around this problem by implementing inheritance, but failed miserably. Does anybody have an idea, what I am doing wrong?

If I use "self.model.fit(X,y)", I get the error

"This LogisticRegressionExtended instance is not fitted yet".

If I try to invoke the fit method in the base class by using "super().fit(X, y)", I get

`C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\py36GPU\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1198 Returns self.
1199 """
-> 1200 if not isinstance(self.C, numbers.Number) or self.C < 0:
1201 raise ValueError("Penalty term must be positive; got (C=%r)"
1202 % self.C)

AttributeError: 'LogisticRegressionExtended' object has no attribute 'C'`

When trying "super().model.fit(X, y)", I get

"'super' object has no attribute 'model'".

...
from sklearn import linear_model
import scipy.stats as stat
...
class LogisticRegressionExtended(linear_model.LogisticRegression):

    def __init__(self,*args,**kwargs):#,**kwargs):
        #self.model = LogisticRegression(*args,**kwargs)#,**args)
        super().__init__(*args, **kwargs)

    def fit(self,X,y):
        #self.model.fit(X,y)
        #super().fit(X, y)
        super()model.fit(X, y)
 ...

Thanks for any help :)

wkangong · 2020-04-17T17:32:47Z

did anyone ever solved this problem? i am having same issue

Akanksha594 · 2020-04-25T02:39:40Z

I am also having the same problem with the same dataset which @Mikeal001 is getting. Please provide me solution for this error.
Thanks

rspeare · 2020-04-27T04:28:34Z

Hey @Akanksha594 and @wkangong and @Mikeal001 :

One thing to try is adding a tiny amount to the diagonal of the matrix before inversion, e.g:

eps=1e-4
F_ij = np.dot((X / denom).T,X) + np.eye(F_ij.shape[0])*eps ## Fisher Information Matrix

Akanksha594 · 2020-04-27T06:01:18Z

Thank you so much for your effort but I am doing exactly same code now I am having unbound local error: local variable 'F_ij' referenced before assignment. Please give me solution. Thanks for this.

…

On Mon 27 Apr, 2020, 9:58 AM Rob Speare, ***@***.***> wrote: ***@***.**** commented on this gist. ------------------------------ Hey @Akanksha594 <https://github.com/Akanksha594> and @wkangong <https://github.com/wkangong> and @Mikeal001 <https://github.com/Mikeal001> : Looks like the X matrix passed in has some correlated features. One way to fix this is with regularization, and adding a tiny amount to the diagonal of the matrix, e.g. eps=1e-4 F_ij = np.dot((X / denom).T,X) + np.eye(F_ij.shape[0])*eps ## Fisher Information Matrix — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://gist.github.com/77061e6e317896be29c6de9a85db301d#gistcomment-3271117>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APKKSFPN3SV476ZNQALTONTROUCX5ANCNFSM4IMG6MJQ> .

dncortez · 2020-05-11T15:56:40Z

Hi, Rspeare! Thank you so much for your code. I used it in a research in comparative genomics and now I'm in the process of writing the paper for publishing. I want to know if you're ok with me citing this code directly with you as author or if you prefer me to cite the original method on logistic regression and the binomial likehood.
Cheers, Diego.

rspeare · 2020-05-11T16:09:40Z

Hey @dncortez! really excited to hear you used this for research in genomics! No need to cite this gist, perhaps an older source / paper on p-values for generalized linear models would be a good reference

-Rob

dini437 · 2020-07-12T05:34:18Z

I have tried to use your code, but I do get errors: I have the whole codes and error shown below.

inputs_train.shape
(373028, 104)

loan_data_targets_train.shape
(373028, 1)

Now fitting it with p-values

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

class LogisticRegression_with_p_values: # this is a new class of reg

def init(self,*args,**kwargs):#,**kwargs):
self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
def fit(self,X,y):
self.model.fit(X,y)
#### Get p-values for the fitted model ####
denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
denom = np.tile(denom,(X.shape[1],1)).T
F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient
p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores] ### two tailed test for p-values
self.coef_ = self.model.coef_
self.intercept_ = self.model.intercept_
self.p_values = p_values # p values are store in a variable called p value
reg = LogisticRegression_with_p_values()
reg.fit(inputs_train, loan_data_targets_train)

RESULTS

LinAlgError Traceback (most recent call last)
in
----> 1 reg.fit(inputs_train, loan_data_targets_train)
2 # Estimates the coefficients of the object from the 'LogisticRegression' class
3 # with inputs (independent variables) contained in the first dataframe
4 # and targets (dependent variables) contained in the second dataframe.

in fit(self, X, y)
20 denom = np.tile(denom,(X.shape[1],1)).T
21 F_ij = np.dot((X / denom).T,X) ## Fisher Information Matrix
---> 22 Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
23 sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
24 z_scores = self.model.coef_[0] / sigma_estimates # z-score for eaach model coefficient

~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in inv(a)
549 signature = 'D->D' if isComplexType(t) else 'd->d'
550 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
552 return wrap(ainv.astype(result_t, copy=False))
553

~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

dini437 · 2020-07-12T05:59:20Z

Hi @rspeare , I tried the following code as per your guidance.

eps=1e-4
F_ij = np.dot((X / denom).T,X) + np.eye(F_ij.shape[0])*eps ## Fisher Information Matrix

but now I am having unbound local error: local variable 'F_ij' referenced before assignment.
Can you please help me on this. Thanks

biohouston · 2020-10-13T13:29:24Z

I edited code a bit so it now can work with multinomial logistic regression. The extended class should calculate statistics for each pair of classes of the dependent variable.
I also added a function to output the results of calculation properly.

class MNLogisticReg(linear_model.LogisticRegression):
    
    def __init__(self, *args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
        if 'fit_intercept' in kwargs.keys():           
            self._fit_intercept = kwargs['fit_intercept']

    def fit(self,X,y):
        self.model.fit(X,y)
        #### Get p-values for the fitted model ####
        denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
        p_values = []
        z_scores = []
        self.columns = list(X.columns)

        if self._fit_intercept:
            X = np.hstack([np.ones((X.shape[0], 1)), X])
           
        for i in range(denom.shape[1]):
            d = denom[:,i]        
            
            if self._fit_intercept:
                self.coef = np.column_stack((self.model.intercept_, self.model.coef_))
            else:
                self.coef = self.model.coef_
            
            d = np.tile(d,(X.shape[1],1)).T
            F_ij = np.dot((X/d).T,X) ## Fisher Information Matrix
            Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix  
            sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
            z_score = (self.coef[i]/sigma_estimates) # z-score for each model coefficient
            z_scores.append(z_score)
            p_vals = [stat.norm.sf(abs(i))*2 for i in z_score] ### two tailed test for p-values
            p_values.append(p_vals)
            
        self.z_scores = np.array(z_scores)
        self.p_values = np.array(p_values)
        self.sigma_estimates = sigma_estimates
        self.F_ij = F_ij

    # A function to create an output in form of pandas dataframe, with regressors and intercept in
    # rows and coefficients in columns. Coefficients, p-values and z-scores are calculated for each
    # pair of classes in the dependent variable
    
    def printstats(self):      
        data = None
        for i in range(self.coef.shape[0]):
            if data is None:
                data = np.vstack(( self.coef[i,:], self.p_values[i,:], self.z_scores[i,:])).T
            else:
                d0 = np.vstack(( self.coef[i,:], self.p_values[i,:], self.z_scores[i,:])).T
                data = np.hstack((data,d0))
        # data is reshaped in the correct order
        regr = []
        for item in list(itertools.combinations(list(dep_acute.unique()), 2)):
            regr.append('{} vs {}'.format(item[0], item[1]))
            
        functions = ['coef', 'P-value', 'Z-score']
        column_names = [([i] + [j]) for i in regr for j in functions] 
        index = pd.MultiIndex.from_tuples(column_names)
        predictors = self.columns
        if self._fit_intercept:
            ind = ['intercept'] + predictors
        else:
            ind = predictors
        self.stats = pd.DataFrame(data, columns = index, index = ind)
        return self.stats

Here's an example of output:

Green-Guo · 2021-03-23T13:48:30Z

@biohouston Thanks a whole lot this was super helpful. Also I think this is calling a special implementation of multinomial? I did linear_model.LogisticRegression(multi_class='multinomial', max_iter=500, solver='lbfgs') and only got n series of coefs/p-values (n = groups of y classes) but your awesome code seems to generate C(n, 2) series? :)

I edited code a bit so it now can work with multinomial logistic regression. The extended class should calculate statistics for each pair of classes of the dependent variable.
I also added a function to output the results of calculation properly.

class MNLogisticReg(linear_model.LogisticRegression):
    
    def __init__(self, *args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)
        if 'fit_intercept' in kwargs.keys():           
            self._fit_intercept = kwargs['fit_intercept']

    def fit(self,X,y):
        self.model.fit(X,y)
        #### Get p-values for the fitted model ####
        denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
        p_values = []
        z_scores = []
        self.columns = list(X.columns)

        if self._fit_intercept:
            X = np.hstack([np.ones((X.shape[0], 1)), X])
           
        for i in range(denom.shape[1]):
            d = denom[:,i]        
            
            if self._fit_intercept:
                self.coef = np.column_stack((self.model.intercept_, self.model.coef_))
            else:
                self.coef = self.model.coef_
            
            d = np.tile(d,(X.shape[1],1)).T
            F_ij = np.dot((X/d).T,X) ## Fisher Information Matrix
            Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix  
            sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
            z_score = (self.coef[i]/sigma_estimates) # z-score for each model coefficient
            z_scores.append(z_score)
            p_vals = [stat.norm.sf(abs(i))*2 for i in z_score] ### two tailed test for p-values
            p_values.append(p_vals)
            
        self.z_scores = np.array(z_scores)
        self.p_values = np.array(p_values)
        self.sigma_estimates = sigma_estimates
        self.F_ij = F_ij

    # A function to create an output in form of pandas dataframe, with regressors and intercept in
    # rows and coefficients in columns. Coefficients, p-values and z-scores are calculated for each
    # pair of classes in the dependent variable
    
    def printstats(self):      
        data = None
        for i in range(self.coef.shape[0]):
            if data is None:
                data = np.vstack(( self.coef[i,:], self.p_values[i,:], self.z_scores[i,:])).T
            else:
                d0 = np.vstack(( self.coef[i,:], self.p_values[i,:], self.z_scores[i,:])).T
                data = np.hstack((data,d0))
        # data is reshaped in the correct order
        regr = []
        for item in list(itertools.combinations(list(dep_acute.unique()), 2)):
            regr.append('{} vs {}'.format(item[0], item[1]))
            
        functions = ['coef', 'P-value', 'Z-score']
        column_names = [([i] + [j]) for i in regr for j in functions] 
        index = pd.MultiIndex.from_tuples(column_names)
        predictors = self.columns
        if self._fit_intercept:
            ind = ['intercept'] + predictors
        else:
            ind = predictors
        self.stats = pd.DataFrame(data, columns = index, index = ind)
        return self.stats

Here's an example of output:

JackCollins1991 · 2021-05-29T20:44:29Z

Hi @biohousten , your code was very helpful, but I am getting an error claiming 'dep_acute' is not defined, can you help me understand where this variable comes from? Cheers

DavidBrown-RRT · 2022-06-30T18:21:12Z

It boggles my mind that it is so freaking complicated to get a p-value in sklearn. My God, what is the freaking point of doing logistic regression if you can't assess whether your p-values are significant or not? Why is this not a basic function argument?

rspeare/p_values_for_logreg.py

rizzomichaelg commented Nov 24, 2017

MiloVentimiglia commented Sep 28, 2018

rspeare commented Jan 13, 2019

Mikeal001 commented Aug 16, 2019

anaclaramatos commented Aug 26, 2019 •

edited

Loading

PescheHelfer commented Nov 3, 2019

wkangong commented Apr 17, 2020

Akanksha594 commented Apr 25, 2020

rspeare commented Apr 27, 2020 •

edited

Loading

Akanksha594 commented Apr 27, 2020 via email

dncortez commented May 11, 2020

rspeare commented May 11, 2020

dini437 commented Jul 12, 2020

dini437 commented Jul 12, 2020

biohouston commented Oct 13, 2020

Green-Guo commented Mar 23, 2021

JackCollins1991 commented May 29, 2021

DavidBrown-RRT commented Jun 30, 2022

	from sklearn import linear_model
	import numpy as np
	import scipy.stats as stat

	class LogisticReg:
	"""
	Wrapper Class for Logistic Regression which has the usual sklearn instance
	in an attribute self.model, and pvalues, z scores and estimated
	errors for each coefficient in

	self.z_scores
	self.p_values
	self.sigma_estimates

	as well as the negative hessian of the log Likelihood (Fisher information)

	self.F_ij
	"""

	def __init__(self,args,kwargs):#,*kwargs):
	self.model = linear_model.LogisticRegression(args,kwargs)#,*args)

	def fit(self,X,y):
	self.model.fit(X,y)
	#### Get p-values for the fitted model ####
	denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
	denom = np.tile(denom,(X.shape[1],1)).T
	F_ij = np.dot((X/denom).T,X) ## Fisher Information Matrix
	Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
	sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
	z_scores = self.model.coef_[0]/sigma_estimates # z-score for eaach model coefficient
	p_values = [stat.norm.sf(abs(x))*2 for x in z_scores] ### two tailed test for p-values

	self.z_scores = z_scores
	self.p_values = p_values
	self.sigma_estimates = sigma_estimates
	self.F_ij = F_ij

rspeare/p_values_for_logreg.py

rizzomichaelg commented Nov 24, 2017

MiloVentimiglia commented Sep 28, 2018

rspeare commented Jan 13, 2019

Mikeal001 commented Aug 16, 2019

anaclaramatos commented Aug 26, 2019 • edited Loading

PescheHelfer commented Nov 3, 2019

wkangong commented Apr 17, 2020

Akanksha594 commented Apr 25, 2020

rspeare commented Apr 27, 2020 • edited Loading

Akanksha594 commented Apr 27, 2020 via email

dncortez commented May 11, 2020

rspeare commented May 11, 2020

dini437 commented Jul 12, 2020

dini437 commented Jul 12, 2020

biohouston commented Oct 13, 2020

Green-Guo commented Mar 23, 2021

JackCollins1991 commented May 29, 2021

DavidBrown-RRT commented Jun 30, 2022

anaclaramatos commented Aug 26, 2019 •

edited

Loading

rspeare commented Apr 27, 2020 •

edited

Loading