Statistics¶
Collection of tools for calculating statistics.
CDF¶
Cumulative Distribution Function¶
-
pygeostat.statistics.cdf.
cdf
(var, weights=None, lower=None, upper=None, bins=None)¶ Calculates an empirical CDF using the standard method of the midpoint of a histogram bin. Assumes that the data is already trimmed or that iwt=0 for variables which are not valid.
If ‘bins’ are provided, then a binned histogram approach is used. Otherwise, the CDF is constructed using all points (as done in GSLIB).
Notes: ‘lower’ and ‘upper’ limits for the CDF may be supplied and will be returned appropriately
Parameters: - var (array) – Array passed to the cdf function
- weights (str) – column where the weights are stored
- Lower (float) – Lower limit
- upper (float) – Upper Limit
- bins (int) – Number of bins to use
Returns: - np.ndarray
array of bin midpoints
- cdf: np.ndarray
cdf values for each midpoint value
Return type: midpoints
Percentile from CDF¶
-
pygeostat.statistics.cdf.
percentile_from_cdf
(cdf_x, cdf, percentile)¶ Given ‘x’ values of a CDF and corresponding CDF values, find a given percentile. Percentile may be a single value or an array-like and must be in [0, 100] or CDF bounds
z_percentile¶
-
pygeostat.statistics.cdf.
z_percentile
(z, cdf_x, cdf)¶ Given ‘cdf_x` values of a CDF and corresponding
cdf
values, find the percetile of a given valuez
. Percentile may be a single value or an array-like and must be in [0, 100] or CDF bounds.
Build Indicator CDF¶
-
pygeostat.statistics.cdf.
build_indicator_cdf
(prob_ik, zvals)¶ Build the X-Y data required to plot a categorical cdf
Parameters: - prob_ik – np.ndarray the p-vals corresponding to the cutoffs
- zvals – np.ndarray the corresponding z-value specifying the cutoffs
Returns: - np.ndarray
the x and y coordinates of (1) the cutoffs and (2) the midpoints for each cutoff
Return type: points, midpoints
Kernel Density Estimation Functions¶
Univariate KDE with StatsModels¶
-
pygeostat.statistics.kde.
kde_statsmodels_u
(x, x_grid, bandwidth=0.2, **kwargs)¶ Univariate Kernel Density Estimation with Statsmodels
Multivariate KDE with StatsModels¶
-
pygeostat.statistics.kde.
kde_statsmodels_m
(x, x_grid, bandwidth=0.2, **kwargs)¶ Multivariate Kernel Density Estimation with Statsmodels
KDE with Scikit-learn¶
-
pygeostat.statistics.kde.
kde_sklearn
(x, x_grid, bandwidth=0.2, **kwargs)¶ Kernel Density Estimation with Scikit-learn
KDE with Scipy¶
-
pygeostat.statistics.kde.
kde_scipy
(x, x_grid, bandwidth=0.2, **kwargs)¶ Kernel Density Estimation based on different packages and different kernels Note that scipy weights its bandwidth by the covariance of the input data. To make the results comparable to the other methods, we divide the bandwidth by the sample standard deviation here.
Weighted Statistics¶
Weighted Variance¶
-
pygeostat.statistics.utils.
weighted_variance
(var, wts)¶ Calculates the weighted variance
Weighted Skewness¶
-
pygeostat.statistics.utils.
weighted_skew
(var, wts)¶ Calculates the weighted skewness
Weighted Kurtosis¶
-
pygeostat.statistics.utils.
weighted_kurtosis
(var, wts)¶ Calculates the weighted skewness
Weighted Correlation¶
-
pygeostat.statistics.utils.
weighted_correlation
(x, y, wt)¶ Calculates the weighted correlation
Assorted Stats Functions¶
Nearest Positive Definite Correlation Matrix¶
-
pygeostat.statistics.utils.
near_positive_definite
(input_matrix)¶ This function uses R to calculate the nearest positive definite matrix within python. An installation of R with the library “Matrix” is required. The module rpy2 is also needed
The only requirement is an input matrix. Can be either a pandas dataframe or numpy-array.
Parameters: input_matrix – input numpy array or pandas dataframe, not numpy matrix Returns: Nearest positive definite matrix as a numpy-array Return type: (np.array)
Accuracy Plot Statistics - Simulation¶
-
pygeostat.statistics.utils.
accsim
(truth, reals, pinc=0.05)¶ Calculates the proportion of locations where the true value falls within symmetric p-PI intervals when completing a jackknife study. A portion of the data is excluded from the conditioning dataset and the excluded sample locations simulated values are then checked.
See also
Pyrcz, M. J., & Deutsch, C. V. (2014). Geostatistical Reservoir Modeling (2nd ed.). New York, NY: Oxford University Press, p. 350-351.
Parameters: - truth – Tidy (long-form) 1D data where a single column containing the true values. A pandas dataframe/series or numpy array can be passed
- reals – Tidy (long-form) 2D data where a single column contains values from a single realizations and each row contains the simulated values from a single truth location. A pandas dataframe or numpy matrix can be passed
Keyword Arguments: pinc (float) – Increments between the probability intervals to calculate within (0, 1)
Returns: Dataframe with the calculated probability intervals and the fraction within the interval
Return type: propavg (pd.DataFrame)
Returns: Dictionary containing the average variance (U), mean squared error (MSE), accuracy measure (acc), precision measure (pre), and a goodness measure (goo)
Return type: sumstats (dict)
Accuracy Plot Statistics - CDF thresholds¶
-
pygeostat.statistics.utils.
accmik
(truth, thresholds, mikprobs, pinc=0.05)¶ Similar to accsim but accepting mik distributions instead
Mostly pulled from accsim
Parameters: - truth (np.ndarray) – Tidy (long-form) 1D data where a single column containing the true values. A pandas dataframe/series or numpy array can be passed
- thresholds (np.ndarray) – 1D array of thresholds where each CDF is defined by these thresholds and the probability given in the mikprobs array for each location.
- mikprobs (np.ndarray) – Tidy (long-form) 2D data where a single column contains values from a single MIK cutoff and each row contains the simulated values for the corresponding single truth location. A pandas dataframe or numpy matrix can be passed
- pinc (float) – Increments between the probability intervals to calculate within (0, 1)
Returns: - propavg (pd.DataFrame) – Dataframe with the calculated probability intervals and the fraction within the interval
- sumstats (dict) – Dictionary containing the average variance (U), mean squared error (MSE), accuracy measure (acc), precision measure (pre), and a goodness measure (goo)
PostSim¶
-
pygeostat.statistics.postsim.
postsim_multfiles
(file_base_or_list, output_name, Nr=None, file_ending=None, fltype=None, output_fltype=None, zero_padding=0, variables=None, var_min=None)¶ The multiple file postsim function uses recursive statistics for memory management and coolness factor. See http://people.revoledu.com/kardi/tutorial/RecursiveStatistic/ This function will take multiple realizations and post process the results into mean and variance for each variable. You can either pass it a list of files to iterate through or a filebase name and the number of realizations.
Parameters: - file_base_or_list (list) or (str) – List of files or path + base name of sequentially named files
- output_name (str) – ath (or name) of file to write output to.
- Nr (int) – Number of realizations. Needed if file base name is passed.
- file_ending (str) – file ending (ex. “out”). Used if file base name is passed. Period is not included.
- fltype (str) – Type of data file: either
csv
,gslib
,hdf5
, orgsb
. Used if file base name is passed and file_ending is not used. - output_fltype (str) – Type of output data file: either
csv
,gslib
,hdf5
, orgsb
. - zero_padding (int) – Number of zeros to padd number in sequentially named files with. Default is 0.
- variables (str) – List of variables to process.
- var_min (list) or (float) – Minimum trimming limit to use. If one value is passed it will apply the trimming limit to all variables. Or a list of trimming limit for each variable can be passed.