Efficient Probit Regression API

Here you can find the documentation of most of our functions and classes.

Probit Model

class efficient_probit_regression.probit_model.PGeneralizedProbitSGD(p, initial_learning_rate=0.1, power_t=0.5)

Stochastic Gradient descent for probit regression. Adapts the learning rate in each iteration using inverse scaling.

Parameters
  • p (int) – The order of the probit model.

  • initial_learning_rate (float) – The initial learning rate.

  • power_t (float) – Inverse scaling is used to adapt the learning rate in each iteration. The update formula is learning_rate = initial_learning_rate / power(cur_iteration, power_t)

get_params()

The function get_params() returns the estimated parameters. (revision)

new_sample(x, y)

Performs one step of SGD on a new sample x, y.

Parameters
  • x (numpy.ndarray) –

  • y (int) –

class efficient_probit_regression.probit_model.ProbitSGD(initial_learning_rate=0.1, power_t=0.5)
Parameters
  • initial_learning_rate (float) –

  • power_t (float) –

efficient_probit_regression.probit_model.p_gen_norm_cdf(x, p)

Returns the cumulative densitiy until a point x for p>= 1.

efficient_probit_regression.probit_model.p_gen_norm_pdf(x, p)

Returns the densitiy at a point x for p>= 1.

Sampling

class efficient_probit_regression.sampling.ReservoirSampler(sample_size, d)

Implementation of a reservoir sampler as described in “A general purpose unequal probability sampling plan” by M. T. Chao, adapted here for row sampling of datasets consisting of a data matrix X and a label vector y.

Parameters
  • sample_size (int) – Number of rows in the resulting sample.

  • d (int) – Second dimension of the sample. The whole sample will have a dimension of (sample_size, d).

get_sample()

Returns the sample of X and the sample of y.

insert_record(row, label, weight)

Insert a data record consisting of a row and a label. The record will be sampled with a probability that is proportional to the given weight.

Parameters
  • row (numpy.ndarray) –

  • label (float) –

  • weight (float) –

efficient_probit_regression.sampling.compute_leverage_scores(X, p=2, fast_approx=False)

Computes leverage scores.

Parameters

X (numpy.ndarray) –

efficient_probit_regression.sampling.fast_QR(X, p=2)

Returns Q of a fast QR decomposition of X.

efficient_probit_regression.sampling.leverage_score_sampling(X, y, sample_size, augmented=False, online=False, round_up=False, precomputed_scores=None, p=2, fast_approx=False)

Draw a leverage score weighted sample of X and y without replacement.

Parameters
  • X (numpy.ndarray) – Data matrix.

  • y (numpy.ndarray) – Label vector.

  • sample_size (int) – Sample size.

  • augmented (bool) – Whether to add the additive 1/W term, where W is the sum of all weights.

  • online (bool) – Compute online leverage scores in one pass over the data.

  • round_up (bool) – Round the leverage scores up to the nearest power of two.

  • precomputed_scores (Optional[numpy.ndarray]) – To avoid recomputing the leverage scores every time, pass the precomputed scores here.

  • p (int) – The order of the p-generalized probit model.

  • fast_approx (bool) – Whether to use the fast leverage score approximation algorithm.

Returns

X_reduced

The reduced data matrix.

y_reduced

The reduced label vector.

w

The corresponding sample weights.

efficient_probit_regression.sampling.logit_sampling(X, y, sample_size)

Logit sampling from 2018 Paper On Coresets for Logistic Regression.

Returns X_reduced, y_reduced, weights

Parameters
  • X (numpy.ndarray) –

  • y (numpy.ndarray) –

  • sample_size (int) –

efficient_probit_regression.sampling.online_ridge_leverage_score_sampling(X, y, sample_size, augmentation_constant=None, lambda_ridge=1e-06)

Sample X and y proportional to the online ridge leverage scores.

Parameters
  • X (numpy.ndarray) –

  • y (numpy.ndarray) –

  • sample_size (int) –

  • augmentation_constant (Optional[float]) –

  • lambda_ridge (float) –

efficient_probit_regression.sampling.truncated_normal(a, b, mean, std, size, random_state=None)

Use rejection sampling if the interval [a, b] covers at least a probability mass of 5%. Otherwise use the implementation given in scipy.stats.truncnorm.

The parameters a and b specify the actual interval where the probability mass is located, mean and std specify the original normal distribution.

Parameters
  • a (numpy.ndarray) –

  • b (numpy.ndarray) –

  • mean (numpy.ndarray) –

  • std (numpy.ndarray) –

efficient_probit_regression.sampling.uniform_sampling(X, y, sample_size)

Draw a uniform sample of X and y without replacement.

Parameters
  • X (numpy.ndarray) – A matrix of size (n, d).

  • y (numpy.ndarray) – A vector of size (n,).

  • sample_size (int) – The size of the sample that should be drawn.

Returns

The reduced sample (X_reduced, y_reduced).

Datasets

class efficient_probit_regression.datasets.BaseDataset(add_intercept=True, use_caching=True, cache_dir=None)
get_X()

Returns the data matrix from the data object.

get_beta_opt(p)

Returns the optimized/estimated parameters beta.

Parameters

p (int) – the order of the p-generalized probit-model.

Returns

returns the estimated parameters

get_d()

Returns the dimension of the data. d can also be regarded as the number of features.

get_n()

Returns the number of rows of the data.

get_y()

Returns the target data y.

class efficient_probit_regression.datasets.Covertype(use_caching=True)

Dataset Homepage: https://archive.ics.uci.edu/ml/datasets/Covertype

get_name()

Returns name of the data set.

class efficient_probit_regression.datasets.Example2D
get_name()

Returns name of data set.

load_X_y()

Loads data set and returns X and y.

class efficient_probit_regression.datasets.Iris(use_caching=True)
get_name()

” Returns name of data set.

load_X_y()

Loads data and returns X and y.

class efficient_probit_regression.datasets.KDDCup(use_caching=True)

Dataset Homepage: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

get_name()

Returns name of data set.

class efficient_probit_regression.datasets.Webspam(drop_sparse_columns=True, use_caching=True)

Dataset Source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#webspam

get_name()

Returns name of data set.

load_X_y()

Loads data and returns X and y.

efficient_probit_regression.datasets.add_intercept(X)

Adds intercept.

Experiments

class efficient_probit_regression.experiments.BaseExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename)
Parameters
get_config_grid()

Returns a list of configurations that are used to run the experiments.

abstract get_reduced_X_y_weights(config)

Abstract method that each experiment overrides to return the reduced matrix X, label vector y and weights that correspond to an experimental config.

optimize(X, y, w)

Optimize the Probit regression problem given by X, y and w.

Parameters
  • X – Data matrix.

  • y – Label vector.

  • w – Weights.

run(parallel=False, n_jobs=4)

Run the experiment.

Parallel = False

parallel is set to False by default.

N_jobs = 4

n_jobs is set to 4 by default.

class efficient_probit_regression.experiments.BaseExperimentBayes(dataset, num_runs, min_size, max_size, step_size, prior_mean, prior_cov, samples_per_chain, num_chains, burn_in)
Parameters
  • dataset (efficient_probit_regression.datasets.BaseDataset) –

  • num_runs (int) –

  • min_size (int) –

  • max_size (int) –

  • step_size (int) –

  • prior_mean (numpy.ndarray) –

  • prior_cov (numpy.ndarray) –

  • samples_per_chain (int) –

  • num_chains (int) –

  • burn_in (int) –

abstract get_method_name()

Returns the name of the method, like “uniform”, “leverage”, or “leverage_online”.

class efficient_probit_regression.experiments.LeverageScoreSamplingExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename, only_compute_once=True, online=False, round_up=True, fast_approx=False)
Parameters

dataset (efficient_probit_regression.datasets.BaseDataset) –

get_reduced_X_y_weights(config)

Returns reduced X, y and weights.

run(**kwargs)

Run the experiment.

Parallel = False

parallel is set to False by default.

N_jobs = 4

n_jobs is set to 4 by default.

class efficient_probit_regression.experiments.LeverageScoreSamplingExperimentBayes(dataset, num_runs, min_size, max_size, step_size, prior_mean, prior_cov, samples_per_chain, num_chains, burn_in)
Parameters
  • dataset (efficient_probit_regression.datasets.BaseDataset) –

  • num_runs (int) –

  • min_size (int) –

  • max_size (int) –

  • step_size (int) –

  • prior_mean (numpy.ndarray) –

  • prior_cov (numpy.ndarray) –

  • samples_per_chain (int) –

  • num_chains (int) –

  • burn_in (int) –

get_method_name()

Returns the name of the method, like “uniform”, “leverage”, or “leverage_online”.

class efficient_probit_regression.experiments.LewisSamplingExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename, fast_approx=True)
Parameters

dataset (efficient_probit_regression.datasets.BaseDataset) –

get_reduced_X_y_weights(config)

Returns reduced X, y and weights.

class efficient_probit_regression.experiments.LogitSamplingExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename)

Returns reduced X, y and weights.

Parameters
get_reduced_X_y_weights(config)

Abstract method that each experiment overrides to return the reduced matrix X, label vector y and weights that correspond to an experimental config.

class efficient_probit_regression.experiments.OnlineLeverageScoreSamplingExperimentBayes(dataset, num_runs, min_size, max_size, step_size, prior_mean, prior_cov, samples_per_chain, num_chains, burn_in)
Parameters
  • dataset (efficient_probit_regression.datasets.BaseDataset) –

  • num_runs (int) –

  • min_size (int) –

  • max_size (int) –

  • step_size (int) –

  • prior_mean (numpy.ndarray) –

  • prior_cov (numpy.ndarray) –

  • samples_per_chain (int) –

  • num_chains (int) –

  • burn_in (int) –

get_method_name()

Returns the name of the method, like “uniform”, “leverage”, or “leverage_online”.

class efficient_probit_regression.experiments.OnlineRidgeLeverageScoreSamplingExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename)
Parameters

dataset (efficient_probit_regression.datasets.BaseDataset) –

get_reduced_X_y_weights(config)

Returns reduced X, y and weights.

class efficient_probit_regression.experiments.SGDExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename)
Parameters

dataset (efficient_probit_regression.datasets.BaseDataset) –

get_reduced_X_y_weights(config)

In SGD, no reduction is performed.

optimize(X, y, w)

Applies SGD in one pass over the data.

class efficient_probit_regression.experiments.UniformSamplingExperiment(p, num_runs, min_size, max_size, step_size, dataset, results_filename)
Parameters

dataset (efficient_probit_regression.datasets.BaseDataset) –

get_reduced_X_y_weights(config)

Reduces the sample weights and returns them.

class efficient_probit_regression.experiments.UniformSamplingExperimentBayes(dataset, num_runs, min_size, max_size, step_size, prior_mean, prior_cov, samples_per_chain, num_chains, burn_in)
Parameters
  • dataset (efficient_probit_regression.datasets.BaseDataset) –

  • num_runs (int) –

  • min_size (int) –

  • max_size (int) –

  • step_size (int) –

  • prior_mean (numpy.ndarray) –

  • prior_cov (numpy.ndarray) –

  • samples_per_chain (int) –

  • num_chains (int) –

  • burn_in (int) –

get_method_name()

Returns the name of the method, like “uniform”, “leverage”, or “leverage_online”.

Settings