Module poldracklab.ml
Functions for cross-validation
This module provides functions for cross-validation, including functions to split data into training and testing sets.
The module contains the following functions:
BalancedKFold
: Split data into training and testing sets using anova across CV folds
Classes
class BalancedKFold (nfolds: int = 5, pthresh: float = 0.8, verbose: bool = False)
-
Expand source code
class BalancedKFold: """ This function uses anova across CV folds to find a set of folds that are balanced in their distriutions of the X value - see Kohavi, 1995 - we don't actually need X but we take it for consistency Args: nfolds (int): the number of folds to use pthresh (float): the p-value threshold for a good split verbose (bool): whether to print verbose output """ def __init__(self, nfolds: int = 5, pthresh: float = 0.8, verbose: bool = False): self.nfolds = nfolds self.pthresh = pthresh self.verbose = verbose def split( self, X: np.ndarray, Y: np.ndarray, max_splits: int = 1000 ) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Split the data into training and testing sets Args: X (np.ndarray): the input data Y (np.ndarray): the target data max_splits (int): the maximum number of splits to try Returns: Iterator[Tuple[np.ndarray, np.ndarray]]: the training and testing sets """ nsubs = len(Y) # cycle through until we find a split that is good enough runctr = 0 best_pval = 0.0 while 1: runctr += 1 cv = KFold(n_splits=self.nfolds, shuffle=True) idx = np.zeros((nsubs, self.nfolds)) # this is the design matrix folds = [] ctr = 0 for train, test in cv.split(Y): idx[test, ctr] = 1 folds.append([train, test]) ctr += 1 lm_y = OLS(Y - np.mean(Y), idx).fit() if lm_y.f_pvalue > best_pval: best_pval = lm_y.f_pvalue best_folds = folds if lm_y.f_pvalue > self.pthresh: if self.verbose: print(lm_y.summary()) return iter(folds) if runctr > max_splits: print("no sufficient split found, returning best (p=%f)" % best_pval) return iter(best_folds)
This function uses anova across CV folds to find a set of folds that are balanced in their distriutions of the X value - see Kohavi, 1995 - we don't actually need X but we take it for consistency
Args
nfolds
:int
- the number of folds to use
pthresh
:float
- the p-value threshold for a good split
verbose
:bool
- whether to print verbose output
Methods
def split(self, X: numpy.ndarray, Y: numpy.ndarray, max_splits: int = 1000) ‑> Iterator[Tuple[numpy.ndarray, numpy.ndarray]]
-
Expand source code
def split( self, X: np.ndarray, Y: np.ndarray, max_splits: int = 1000 ) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Split the data into training and testing sets Args: X (np.ndarray): the input data Y (np.ndarray): the target data max_splits (int): the maximum number of splits to try Returns: Iterator[Tuple[np.ndarray, np.ndarray]]: the training and testing sets """ nsubs = len(Y) # cycle through until we find a split that is good enough runctr = 0 best_pval = 0.0 while 1: runctr += 1 cv = KFold(n_splits=self.nfolds, shuffle=True) idx = np.zeros((nsubs, self.nfolds)) # this is the design matrix folds = [] ctr = 0 for train, test in cv.split(Y): idx[test, ctr] = 1 folds.append([train, test]) ctr += 1 lm_y = OLS(Y - np.mean(Y), idx).fit() if lm_y.f_pvalue > best_pval: best_pval = lm_y.f_pvalue best_folds = folds if lm_y.f_pvalue > self.pthresh: if self.verbose: print(lm_y.summary()) return iter(folds) if runctr > max_splits: print("no sufficient split found, returning best (p=%f)" % best_pval) return iter(best_folds)
Split the data into training and testing sets
Args
X
:np.ndarray
- the input data
Y
:np.ndarray
- the target data
max_splits
:int
- the maximum number of splits to try
Returns
Iterator[Tuple[np.ndarray, np.ndarray]]
- the training and testing sets