pareidolia package¶
Subpackages¶
Submodules¶
pareidolia.detection module¶
Functions used for change detection. cmdoret, 20200403
- pareidolia.detection.get_sse_mat(mats)[source]¶
Given multiple matrices, return the matrix of position-wise sum of squared errors to median. All input matrices must have the same shape, nonzero coordinates and format. Output matrix will have the same format as inputs.
- Parameters
mats (Iterable of scipy.sparse.spmatrix) – The list of data matrices from different samples.
- Returns
A sparse matrix where each nonzero pixel is the sum of squared difference of all samples to their median background, at the corresponding nonzero pixel.
- Return type
scipy.sparse.spmatrix
- pareidolia.detection.get_win_density(mat, win_size=3, sym_upper=False)[source]¶
Compute local pixel density in sparse matrices using convolution. The convolution is performed in ‘full’ mode: Computations are performed all the way to the edges by trimming the kernel.
- Parameters
mat (scipy.sparse.csr_matrix) – The input sparse matrix to convolve.
win_size (int) – The size of the area in which to compute the proportion of nonzero pixels. This will be the kernel size used in convolution.
sym_upper (bool) – Whether the input matrix is symmetric upper, in which case only the upper triangle is returned.
- Returns
The result of the convolution of the uniform kernel (win_size x win_size) on the binarized input matrix. Each value represents the proportion of nonzero pixels in the neighbourhood.
- Return type
scipy.sparse.csr_matrix
- pareidolia.detection.median_bg(mats)[source]¶
Given multiple sparse matrices with the same shape, format and nonzero coordinates, generate a background matrix made up of the median signal
- Parameters
mats (Iterable of scipy.sparse.spmatrix:) – A list of data matrices from different samples.
- Returns
The median background matrix, where each pixel is the median of the corresponding pixel from all samples.
- Return type
scipy.sparse.spmatrix
- pareidolia.detection.reps_bg_diff(mats)[source]¶
Given multiple sample matrices, return a 1D array of all differences between each sample matrix’s pixels and their corresponding median background value. All input matrices must have the same shape, nonzero coordinates and format.
- Parameters
mats (Iterable of scipy.sparse.spmatrix) – The list of data matrices from different samples.
- Returns
The distribution of pixel differences to the median background. If there are S matrices of P nonzero pixels, this 1D array will contain P*S elements.
- Return type
numpy.ndarray of floats
pareidolia.hic_utils module¶
Functions used to apply Hi-C specific transformations and interact with cooler objects. cmdoret, 20200404
- pareidolia.hic_utils.change_detection_pipeline(cool_files, conditions, kernel='loops', bed2d_file=None, region=None, max_dist=None, min_dist=None, subsample=True, pearson_thresh=None, density_thresh=0.1, cnr_thresh=1.0, n_cpus=4)[source]¶
Run end to end pattern change detection pipeline on input cool files. A list of conditions of the same lengths as the sample list must be provided. The first condition in the list is used as the reference (control) state.
Changes for a specific pattern are computed. A valid chromosight pattern name can be supplied (e.g. loops, borders, hairpins, …) or a kernel matrix can be supplied directly instead. maximum scanning distance can be specified directly (in basepairs) to override the kernel default value.
Positive diff_scores mean the pattern intensity was increased relative to control (first condition).
Positions with significant changes will be reported in a pandas dataframe. In addition to the score, a contrast-to-noise ratio between 0 and 10 is given to give an estimation of the signal quality. Optionally, a 2D bed file with positions of interest can be specified, in which case change value at these positions will be reported instead. When using a bed2d file.
- Parameters
cool_files (Iterable[str]) – The list of paths to cool files for the input samples.
conditions (Iterable[str]) – The list of conditions matching the samples.
kernel (Union[numpy.ndarray, str]) – Either the kernel to use as pattern as a numpy array, or the name of a valid chromosight pattern.
bed2d_file (Optional[str]) – Path to a bed2D file containing a list of 2D positions. If this is provided, pattern changes at these coordinates will be quantified. Otherwise, they will be detected based on a threshold.
region (Optional[Union[Iterable[str], str]]) – Either a single UCSC format region string, or a list of multiple regions. The analysis will be restricted to those regions.
max_dist (Optional[int]) – Maximum interaction distance (in basepairs) to consider in the analysis. If this is not specified and a chromosight kernel was specified, the default max_dist for that kernel is used. If the case of a custom kernel, the whole matrix will be scanned if no max_dist is specified.
subsample (bool) – Whether all input matrices should be subsampled to the same number of contacts as the least covered sample.
pearson_thresh (Optional[float]) – The pearson correlation threshold to use when detecting patterns. If None, the default value for the kernel is used.
density_thresh (Optional[float]) – The pixel density threshold to require. Low coverage windows with a proportion of nonzero pixels below this value are discarded.
n_cpus (int) – Number of CPU cores to allocate for parallel operations.
- Returns
The list of reported 2D coordinates and their change intensities.
- Return type
pd.DataFrame
- pareidolia.hic_utils.coords_to_bins(clr, coords)[source]¶
Converts genomic coordinates to a list of bin ids based on the whole genome contact map.
- Parameters
coords (pandas.DataFrame) – Table of genomic coordinates, with columns chrom, pos.
clr (cooler.api.Cooler) –
- Returns
Indices in the whole genome matrix contact map.
- Return type
numpy.array of ints
- pareidolia.hic_utils.detection_matrix(samples, kernel, region=None, subsample=None, max_dist=None, pearson_thresh=None, density_thresh=None, cnr_thresh=0.3, n_cpus=4)[source]¶
Run the detection process for a single chromosome or region. This is abstracted from all notions of chromosomes and genomic coordinates.
- pareidolia.hic_utils.get_min_contacts(coolers, region=None)[source]¶
Given several cooler objects, returns the number of contacts in the least covered matrix. Optionally, a region can be given in UCSC format, in which case coverage will be computed only in the matrix from that region.
- pareidolia.hic_utils.make_density_filter(mats, density_thresh=0.1, win_size=3, sym_upper=False)[source]¶
Given a list of sparse matrices, generate a “density filter”. This new sparse matrix is a boolean mask where values indicate whether the proportion of nonzero pixels in the neighbourhood of diameter win_size is above the input threshold in all input matrices.
- Parameters
mats (Iterable of scipy.sparse.csr_matrix) – The matrices to be combined into a filter.
density_thresh (float) – The required proportion of nonzero pixels in the neighbourhood pass the filter.
win_size (int) – The diameter of the neighbourhood in which to compute the proportion of nonzero pixels.
sym_upper (bool) – Whether the matrix is symmetric upper. In this case, computations are performed in the upper triangle.
- Returns
The sparse boolean mask representing the density filter. Values are True where all input matrices passed the threshold.
- Return type
scipy.sparse.csr_matrix of bools
- pareidolia.hic_utils.preprocess_hic(clr, min_contacts=None, region=None)[source]¶
Given an input cooler object, returns the preprocessed Hi-C matrix. Preprocessing involves (in that order): subsetting region, subsampling contacts, normalisation, detrending (obs / exp). Balancing weights must be pre-computer in the referenced cool file. Region must be in UCSC format.
pareidolia.io module¶
Functions used to load and save files. cmdoret, 20200406
pareidolia.preprocess module¶
Functions used to clean and prepare sparse matrices for detection. cmdoret, 20200403
- pareidolia.preprocess.fill_nnz(mat, all_nnz, fill_value=1e-09)[source]¶
Given an input sparse matrix and a superset of nonzero coordinates, fill the matrix to ensure all values in the set are stored explicitely with the value of fill_value.
- Parameters
mat (sp.csr_matrix) – The sparse matrix of a single sample.
all_nnz (np.ndarray of ints) – A 2D array of shape Nx2, where N is the number of nonzero elements to fill. The columns represent the row and column coordinates of those elements.
fill_value (float) – The value to use when filling the nonzero elements in the matrix. Has to be the same datatype as the input matrix.
- Returns
The filled sparse matrix, where all coordinates in all_nnz have been filled with fill_value.
- Return type
sp.csr_matrix
- pareidolia.preprocess.get_common_valid_bins(mats, n_mads=5)[source]¶
Generates an array of valid bins indices, using the intersection of valid bins from all input sparse matrices. All input matrices must be square and have the same shape. Valid bins are defined based on their proportion of nonzero pixels.
- Parameters
mats (Iterable of sp.csr_matrix) – A list sparse matrices representing Hi-C contacts, each matrix represents a sample.
n_mads (float) – A bin is considered missing if its proportion of nonzero pixels is lower than n_mads median absolute deviations below the median of the bin distribution for the whole matrix.
- Returns
A 1D array containing the indices of valid (non-missing) bins.
- Return type
np.ndarray of ints
- pareidolia.preprocess.get_nnz_union(mats)[source]¶
Given a list of sparse matrices, return the union of their nonzero coordinates, in the form of a 2D numpy array with 1 coordinate per row, with 2 columns representing coordinates rows and columns.
- Parameters
mats (Iterable of sp.csr_matrix) – List containing the sparse matrices from each sample.
- Returns
A 2D numpy array containing the union of nonzero coordinates in the input sparse matrices. The array has shape Nx2 where N is the number of coordinates. The columns represent row and column coordinates.
- Return type
np.ndarray of int
pareidolia.stats module¶
Functions used to evaluate statistical significance of changes. cmdoret, 20200403
- pareidolia.stats.pval_to_tval(pval, n_samples)[source]¶
Conversion of a single p-value to corresponding t Given a desired p-value and sample size, return the corresponding t-value. The absolute value for the 2 sided hypothesis is returned.
- pareidolia.stats.tvals_to_pvals(tvals, n_samples)[source]¶
Compute p-values from t-values Given an array of test statistics from multiple t-tests and the (fixed) number of samples used for each test, compute p-values for all tests
- pareidolia.stats.vals_to_percentiles(vals, dist)[source]¶
Return the percentiles corresponding to values in a distribution.
- Parameters
vals (np.ndarray of floats) – The values for which percentiles should be returned.
dist (np.ndarray of floats) – The distribution to use when computing the percentiles.
- Returns
The array of percentiles corresponding to input values
- Return type
np.ndarray of floats
- pareidolia.stats.vec_ttest(arr1, arr2)[source]¶
Given 2 numpy arrays containing data for 2 different conditions, where each array has shape shape rxd where r is the number of replicates and N is the number of independent tests to perform. N must be equal in both array, however the number of replicates can differ.
Module contents¶
pareidolia - Multi-sample change detection in Hi-C patterns