API#
Core class#
- class MATTE.AlignPipe(stats_type='mean', target='cluster', preprocess=True)#
Bases:
objectBasic class in MATTE. Stores a list of functions or transformers. One can see the pipeline in
__str__()The funcstions in
funcsare called in order, and the results are passed to the next function. All returns of the functions should be dict,(usingMATTE.utils.kw_decorator()are recommended). Thencluster_funcwill be called to cluster the results.Note
Use
MATTE.AlignPipe.add_step()to add a function tofuncs, but can change order or delete some functions like list.- add_param(**kwargs)#
add a function that return parameters to
funcs. Multiple parameters can be added at the same time.
- add_step(func, *setting, **kwsetting)#
Add a function to
funcs- Parameters
func (function) – function(should be decarated by
kw_decorator()) to add
- add_transformer(transformer)#
Add a transformer to
funcs- Parameters
transformer (objects that has fit and transform methods) – transformer to add
- Raises
TypeError – if transformer is not a transformer
- calculate(df_exp, df_pheno, verbose=True)#
Calculate the data using the pipeline.
- Parameters
df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples
df_pheno (pandas.Series) – phenotype data whose index are samples
verbose (bool, optional) – defaults to True
- Returns
clustering results
- Return type
- calculate_from_temp(tmpt_result, verbose=True)#
Calculate the data using the pipeline but from temp file.
- Parameters
tmpt_result (dict) – temp result saved by
MATTE.AlignPipe.fit_transform()verbose (bool, optional) – defaults to True
- Returns
clustering results
- Return type
- find_best_KernelTrans_params(df_exp, df_pheno, n_downsample=None, n_iters=None, inplace=True, verbose=True)#
Find the best parameters for Kernel_Transform. According to error of cluster.
- Parameters
df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples
df_pheno (pandas.Series) – phenotype data whose index are samples
n_downsample (int, optional) – number of down sample, defaults to None
n_iters (int, optional) – number of iterations, defaults to None
inplace (bool, optional) – inplace pipeline, defaults to True
verbose (bool, optional) – defaults to True
- Returns
best parameters or
AlignPipe- Return type
AlignPipeor dict
- get_attribute_from_transformer(attribute)#
Get the attribute from the transformer.
- Parameters
attribute (str) – attribute name
- Returns
attribute value
- Return type
any
- init_pipeline(stats_type, target, preprocess: bool)#
Initialize the pipeline.
- Parameters
stats_type (str) – type of statistics
target (str) – target name
preprocess (bool) – whether to preprocess the data
- set_cluster_method(func, *setting, **kwsetting)#
Add a function to
cluster_func- Parameters
func (function) – function(should be decarated by
kw_decorator()) to add
- class MATTE.GeneRanker(view, pipeline=None)#
Bases:
objectMATTE GeneRanker.
There are several types of GeneRanker: 1. module and gene will cluster genes according to their expression. And Use module SNR to rank genes. In gene mode, the SNR will be corrected by the correlation of gene expression and module eigen. 2. ‘dist’,’cross-dist’. In dist mode, the distance of each genes will be calculated. And genes will be ranked according to the sum of distance to each other genes. In cross-dist mode, the distance of differential expression and differential co-expression will be merged.
Note
Inputs of GeneRanker is not the same as pipeline. Row of Expression data is sample, column is gene.
- gene_rank(X, y, verbose=True, **kwargs)#
ranking genes.
- Parameters
X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True
- Returns
gene ranking score
- Return type
pd.Series
- gene_rank_dist(X, y, verbose=True, **kwargs)#
ranking genes of distance in each phenotype pairs. by runing this method,
GeneRanker.dist_matwill be set.- Parameters
X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True
- Returns
gene ranking score
- Return type
pd.Series
- gene_rank_module(X: DataFrame, y, verbose=True, **kwargs)#
ranking genes by their
MATTE.analysis.ClusterResult.ModuleSNR()in each phenotype pairs. Each gene in one phenotype is sum to 1000. Meanwhile, setGeneRanker.gene_ranking_septo calculte each phenotype score.- Parameters
X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True
- Returns
gene ranking score
- Return type
pd.Series
- pipeline_clustering(X, y, verbose=True)#
Clustering the data using pipeline for each pair of phenotypes. A progress bar is set to inform the user about the progress.
- Parameters
X (pandas.DataFrame) – expression data whose index are samples and columns are genes
y (pandas.Series) – phenotype data whose index are samples
verbose (bool, optional) – defaults to True
- save(save_path)#
saving the embedder using dill
- Parameters
save_path (str) – path to save the embedder
- class MATTE.PipeFunc(func, name=None, *args, **kwargs)#
Bases:
objectPipeFunc is a wrapper of a function with storing arguments and kwargs but not run it until calling it self. And is used in
AlignPipeas a step.- add_params(*args, **kwargs)#
Add params to PipeFunc, and refresh str text
- MATTE.merged_pipeline_clustering(df_exp: DataFrame, df_pheno: Series, pipelines: list, verbose=True)#
clustering genes using multiple pipelines.
- Parameters
df_exp (pd.DataFrame) – Expression data whose index are genes and columns are samples.
df_pheno (pd.Series) – phenotypes whose index are samples.
pipelines (list) – list of
MATTE.AlignPipeverbose (bool, optional) – defaults to True
- Returns
Cluster Result
- Return type
Submodules#
MATTE.analysis#
- class MATTE.analysis.ClusterResult(cluster_res, before_cluster_df, df_exp, df_pheno, cluster_properties, order_rule='input')#
Bases:
objectA class to store the result of clustering. contains inputs and results.
- res#
The result of clustering, a pandas.DataFrame. Including each gene’s label in each pheno.
- df_exp#
The input of clustering, a pandas.DataFrame
- df_pheno#
The input of clustering, a pandas.Series
- cluster_properties#
The property of clustering, a dict containing the score, loss, and some clustering parameters.
- label#
All labels, a numpy.array
- n_cluster#
The number of cluster, an int
- JM#
The J-matrix of clustering, a numpy.array can be calculated from
res- module_genes#
The genes of each module, a dict
Note
Only Not “matched” modules are contrained in this dict.
- GeneSNR(sample_feature, pheno=None)#
Use
ClusterResult.ModuleSNR()to calculate the correcting SNR of each gene.\(GeneSNR = ModuleSNR * PearsonCorrelation(GeneExpression, ModuleEigenvector)\)
- Parameters
sample_feature (
MATTE.WeightedDataFrame) – can be calculated by the functionMATTE.analysis.ClusterResult.SampleFeature()pheno (array-like, optional) – the label of samples, defaults to None (set to be
ClusterResult.pheno)
- Returns
SNR, a pd.Series whose index is Module ID, and value is the SNR.
- Return type
_type_
- MCCorrFeature(model=None)#
calculation group features.using PCA for it reserving more information than others.
- Parameters
model (object with fit and fit_transform, optional) – model used to decomposition, defaults to None
- Returns
group_feature: dict,keys : module id | values : eigenvector
group_weight: dict,keys : module id | values : lambda of model
- Return type
tuple
- MCFeature(model=None)#
calculation group features.using PCA for it reserving more information than others.
Note
if n_components >=2, MCs’ eigenvector can reserve more than one to make explained ratio >=80%.
- Parameters
df_exp (pd.DataFrame, optional) – defaults to None,
ClusterResult.df_expmodule_genes (dict, optional) – the mapping of module and genes. defaults to None, result of function
MATTE.analysis.ClusterResult._module_genes(),model (object, optional) – the model to fit transform data.needs attribute fit_transform, defaults to None(sklearn.decomposition.PCA(n_components=1))
- Returns
group_feature: dict,keys : module id | values : eigenvector, group_weight: dict,keys : module id | values : lambda of model
- Return type
tuple
- MCkwtest(sample_feature, pheno_series)#
Making the Kruskal-Wallis test on each MC. (if Each module’s Eigenvector is different)
- Parameters
sample_feature (
MATTE.WeightedDataFrame) – can be calculated by the functionMATTE.analysis.ClusterResult.SampleFeature()pheno_series (pd.Series) – specify samples’ phenotype
- Returns
kws: a pd.Series whose index is Module ID, and value is the p-value.
- Return type
pd.Series
- ModuleSNR(sample_feature, pheno=None)#
Calculate the SNR of each module.
- Parameters
sample_feature (
MATTE.WeightedDataFrame) – can be calculated by the functionMATTE.analysis.ClusterResult.SampleFeature()pheno (array-like, optional) – the label of samples, defaults to None (set to be
ClusterResult.pheno)
- Returns
SNR, a pd.Series whose index is Module ID, and value is the SNR.
- Return type
pd.Series
- PhenoMCCorr(sample_feature, pheno_series=None, pheno=None)#
Calculate the correlation between Phenotype and MC’s Eigenvector and test the correlation and return p-value
- Parameters
sample_feature (
MATTE.WeightedDataFrame) – can be calculated by the functionClusterResult.SampleFeature()pheno_series (pd.Series) – specify samples’ phenotype
pheno (any, optional) – choose which pheno the calculate, defaults to None (the first one of pheno_series)
- Returns
MC_corr: a pd.Series whose index is Module ID, and value is the pearson correlation. corr_p: p_value of correlation(Student t test)
- Return type
tuple
- SampleFeature(corr=False, **kwargs)#
Calculate Samples’ module feature.(average default)
- Parameters
corr (bool, optional) – if True, calculate correlation between samples, defaults to False
- Returns
depended by the arg:return_df
- Return type
samples_feature is a WeightedDataFrame object, whose index is samples id and columns is module id. Module Weights are calculated by function MCFeature.
- Vis_Jmat()#
plot J matrix.(heatmap)
- Returns
figure
- Return type
matplotlib.figure.Figure
- summary(fig=True)#
summary of cluster results,print
ClusterResult.cluster_properties, and plot J matrix.- Parameters
fig (bool, optional) – whether print figures or not, defaults to True
- Returns
summary of cluster results
- Return type
figures or None
- MATTE.analysis.Fig_Fuction(df_enrich, color_columns, cmap=<matplotlib.colors.ListedColormap object>, width_height_ratio=2, **figargs)#
figuring the function enrichment reuslt. .. note:: filtering the result first!
- Parameters
df_enrich (pd.DataFrame) – the result of
FunctionEnrich()color_columns (str) – used to shown in color, ‘fdr’ or ‘p_value’
cmap (object, optional) – chosen from matplotlib.cm, defaults to cm.Set1
width_height_ratio (int, optional) – defaults to 2
- Returns
the figure
- Return type
plt.figure
- MATTE.analysis.Fig_SampleFeature(sample_feature, labels, color=None, model=None, weighted_distcance=False, metric='euclidean', **fig_args)#
showing the fig of samples reprented by MCs.
- Parameters
sample_feature (WeightedDataFrame or pd.DataFrame) – index : samples’ id | columns : modules’ id . the result of
ClusterResult.SampleFeature()or just using some genes to explain a sample.labels (pd.Series) – index : samples’ id | columns : phenotype
color (array like, optional) – the second label used, and in the form of color. the first one will be in the form of shape., defaults to None
model (object, optional) – the model used to present samples in the two dimensions., defaults to None (sklearn.manifold.PCA(n_component=2))
weighted_distcance (bool, optional) – whether to used weighted distance. if sample_feature is pd.DataFrame, it should be set as False., defaults to False
metric (str, optional) – used to calculate the distance., defaults to ‘euclidean’
- Returns
the figure
- Return type
plt.figure
- MATTE.analysis.FunctionEnrich(annote_file, gene_set, category_seperate_cal=True)#
perform GO enrichment
- Parameters
annote_file (str or pd.DataFrame) –
Function annotation file dir or dataframe
Note
tab-seperate file with columns:[“Term_ID”,”GeneID”,”Term”,”Category”];File can be downloaded from https://ftp.ncbi.nih.gov/gene/DATA/
gene_set (array-like) – gene set to analysis
category_seperate_cal (bool, optional) – wheather to calculate fdr in each category or mixed., defaults to True
- Returns
items,catogory,enriched items number,backgroud item number,p value,fdr
- Return type
pd.DataFrame
- class MATTE.analysis.WeightedDataFrame(weight=None, data=None, index=None, columns=None, dtype=None, copy=None)#
Bases:
DataFrameA weighted dataframe inherit pandas.DataFrame, and add a new attribute weight to store the weight of each column.
- weight_distance(metric='c', **kwargs)#
Calculate weighted distance.
- Parameters
metric (str, optional) – The distance metric to use., defaults to ‘c’
Note
calling from different module can be very different for their definition is not same.
The distance function can be :
— calling from scipy.spatial.pdist —
‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.
— calling from Bio.cluster.distancematrix —
‘e’: Euclidean ; ‘b’: City-block . ‘c’: Pearson correlation; ‘a’: Absolute of Pearson correlation; ‘u’: Uncentered Pearson correlation; ‘x’: Absolute of Uncentered Pearson correlation; ‘s’: Spearman’s correlation; ‘k’: Kendall’s τ.
- Returns
distance matrix
- Return type
numpy.array
MATTE.cluster#
- class MATTE.cluster.CrossCluster(presetting='kmeans', use_affinity=False, verbose=True, **kwargs)#
Bases:
objectCross clustering of mixed expression matrix. if preset is not set, then must use
CrossCluster.build_from_func()orCrossCluster.build_from_model()to set the function to cluster.- build_from_func(func)#
build this class from function
- Parameters
func (function) – functions to cluster
- build_from_model(model, model_attr='labels_', **calling_kwargs)#
build this class from model
- Parameters
model (object) – model to cluster, must have model_attr attribute and fit method
model_attr (str, optional) – defaults to label_
- preset()#
preset the function to cluster
- Raises
NotImplementedError – preset is not implemented
- preset_kmeans()#
preset kmeans, default kwargs: n_clusters =8,`method` =’a’, npass =20
- preset_spectral_bicluster()#
preset spectral bicluster, default kwargs: n_clusters =8,`use_aff` =True, n_init =10,`method` =’log’,`n_component` =`m_cluster`, use_aff =True
- preset_spectrum()#
preset spectral clustering, default kwargs: n_clusters =8,`use_aff` =True, n_init =10
- MATTE.cluster.Cross_Distance(before_cluster_df, metric='euclidean', weights=None, kwargs={}, verbose=True)#
calculate the distance between each cluster and other cluster
- Parameters
before_cluster_df (pandas.DataFrame or np.array) – dataframe to cluster
metric (str, optional) – distance metric(implemented by scipy package,more information can be found in scipy’s documention), defaults to ‘euclidean’
weights (array-like, optional) – weights in calculate distance, defaults to None
kwargs (dict, optional) – other key words args passed to cdist function, defaults to {}
verbose (bool, optional) – defaults to True
- Returns
distance matrix
- Return type
numpy.array
- MATTE.cluster.build_results(cluster_label, cluster_properties, df_exp, df_pheno, before_cluster_df, order_rule='input', verbose=True)#
building results of
CrossClustertoMATTE.analysis.ClusterResultdecorated by :func:`MATTE.utils.kwdecorator`- Parameters
cluster_label (array like) – label of clustering
mixed_genes (array like) – index of mixed genes(in the order of cluster_label)
cluster_properties (dict) – information get from clustering
df_exp (pandas.DataFrame) – inputs to all pipeline
df_pheno (pandas.Series) – inputs to all pipeline
before_cluster_df (pandas.DataFrame) – inputs to
CrossClusterorder_rule (str, optional) – how to order module. ‘input’(mean exp) or ‘size’(module genes) or function; defaults to “input”
verbose (bool, optional) – defaults to True
- Returns
results
- Return type
MATTE.preprocess#
- MATTE.preprocess.CorrKernel_Transform(df_exp, df_pheno, n_components=16)#
Correlation Kernel Transform. This function is decorated by
MATTE.utils.kw_decorator().- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
n_components (int, optional) – number of components used in spectral embedding, defaults to 16
- Returns
kernel matrix
- Return type
dict { key:’before_cluster_df’ ; value:numpy.array}
- MATTE.preprocess.LocKernel_Transform(df_exp, df_pheno, kernel_type, centering_kernel=True, outer_subtract_absolute=True, double_centering=True, verbose=True)#
Kernel Transformation. Important preprocess to cross clustering. In this step, genes from different phenotypes are regarded as different genes. and the distance between them is computed. this function is decorated by
MATTE.utils.kw_decorator().- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
kernel_type (str or function) – Kernel type, one of the following: ‘mean’,’median’,functions are also allowed.
centering_kernel (bool, optional) – whether to centering kernel, defaults to True
outer_subtract_absolute (bool, optional) – in outer subtract, use absolute or not, defaults to True
double_centering (bool, optional) – whether double centering kernel matrix or not, defaults to True
verbose (bool, optional) – defaults to True
- Returns
kernel matrix and mixed genes
- Return type
dict
- MATTE.preprocess.RDC_Transform(df_exp, df_pheno, **kwargs)#
RDC transform.
- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
- Returns
RDC transformed dataframe.
- Return type
numpy.array
- MATTE.preprocess.RDE_Transform(df_exp, df_pheno, kernel_type, absolute=True)#
RDE transform.
- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
kernel_type (str) – kernel type.should be one of ‘mean’ and ‘median’.
absolute (bool, optional) – calculate outer subtract absolute or not, defaults to True
- Returns
RDE transformed dataframe.
- Return type
numpy.array
- MATTE.preprocess.RPKM2TPM(df_exp)#
Convert RPKM to TPM. decorated by
MATTE.utils.kw_decorator()- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
- Returns
TPM dataframe.
- Return type
dict
- MATTE.preprocess.expr_filter(df_exp, df_pheno, gene_filter=None, filter_args: dict = {})#
Filter the genes by rule. decorated by
MATTE.utils.kw_decorator()- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
gene_filter (str or function, optional) – gene filter rule. None will delete some genes with extreme low expression. f will filter genes by Anova f value(p=0.05 by default). function is also allowed.defaults to None
filter_args (dict, optional) – other args sent to gene filter function, defaults to {}
- Returns
filtered dataframe.
- Return type
dict
- MATTE.preprocess.inputs_check(df_exp, df_pheno, **kwargs)#
Checking inputs if they are correct. decorated by
MATTE.utils.kw_decorator()- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
- Raises
ValueError – If the input dataframe is not correct.
- Returns
checking status
- Return type
dict {mixed_genes:’OK’}
- MATTE.preprocess.log2transform(df_exp)#
Log2 transform the expression dataframe. decorated by
MATTE.utils.kw_decorator()- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
- Returns
Log2 transformed dataframe.
- Return type
dict
- MATTE.preprocess.normalization(df_exp, norm='l1')#
Normalize the expression dataframe. decorated by
MATTE.utils.kw_decorator()- Parameters
df_exp (pandas.DataFrame) – Expression dataframe.
norm (str, optional) – normalization type,should be one of l1,`l2` and standard, defaults to ‘l1’
- Returns
Normalized dataframe.
- Return type
dict
MATTE.utils#
- MATTE.utils.affinity_matrix(data, dist_type, type='distance', **kwargs)#
Calculate affinity matrix; both implement from
Bio.Clusterandscipy.spatial.distanceare supported.- Parameters
data (numpy.array or pandas.DataFrame) – data to calculate affinity matrix
dist_type (str) – string of distance type
type (str, optional) – distance or affinity, defaults to “distance”
- Returns
matrix
- Return type
np.array
- MATTE.utils.kw_decorator(kw=None)#
Import decarated function used in pipeline.This allow function to accept more than keyworks arguments it self accepted. and will make function return dict as kw set.
- Parameters
kw (str or list, optional) – keywords make of function’s return dict’s keys, defaults to None
- MATTE.utils.printv(*text, show_time=True, verbose=True)#
Prints text with time and verbose.
- Parameters
show_time (bool, optional) – defaults to True
verbose (bool, optional) – defaults to True