API#

Core class#

class MATTE.AlignPipe(stats_type='mean', target='cluster', preprocess=True)#

Bases: object

Basic class in MATTE. Stores a list of functions or transformers. One can see the pipeline in __str__()

The funcstions in funcs are called in order, and the results are passed to the next function. All returns of the functions should be dict,(using MATTE.utils.kw_decorator() are recommended). Then cluster_func will be called to cluster the results.

Note

Use MATTE.AlignPipe.add_step() to add a function to funcs, but can change order or delete some functions like list.

add_param(**kwargs)#: add a function that return parameters to funcs. Multiple parameters can be added at the same time.

add_step(func, *setting, **kwsetting)#

Add a function to funcs

Parameters: func (function) – function(should be decarated by kw_decorator()) to add

add_transformer(transformer)#

Add a transformer to funcs

Parameters: transformer (objects that has fit and transform methods) – transformer to add
Raises: TypeError – if transformer is not a transformer

calculate(df_exp, df_pheno, verbose=True)#

Calculate the data using the pipeline.

Parameters

df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples
df_pheno (pandas.Series) – phenotype data whose index are samples
verbose (bool, optional) – defaults to True

Returns

clustering results

Return type

MATTE.analysis.ClusterResult

calculate_from_temp(tmpt_result, verbose=True)#

Calculate the data using the pipeline but from temp file.

Parameters

tmpt_result (dict) – temp result saved by MATTE.AlignPipe.fit_transform()
verbose (bool, optional) – defaults to True

Returns

clustering results

Return type

MATTE.analysis.ClusterResult

find_best_KernelTrans_params(df_exp, df_pheno, n_downsample=None, n_iters=None, inplace=True, verbose=True)#

Find the best parameters for Kernel_Transform. According to error of cluster.

Parameters

df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples
df_pheno (pandas.Series) – phenotype data whose index are samples
n_downsample (int, optional) – number of down sample, defaults to None
n_iters (int, optional) – number of iterations, defaults to None
inplace (bool, optional) – inplace pipeline, defaults to True
verbose (bool, optional) – defaults to True

Returns

best parameters or AlignPipe

Return type

AlignPipe or dict

get_attribute_from_transformer(attribute)#

Get the attribute from the transformer.

Parameters: attribute (str) – attribute name
Returns: attribute value
Return type: any

init_pipeline(stats_type, target, preprocess: bool)#

Initialize the pipeline.

Parameters

stats_type (str) – type of statistics
target (str) – target name
preprocess (bool) – whether to preprocess the data

set_cluster_method(func, *setting, **kwsetting)#

Add a function to cluster_func

Parameters: func (function) – function(should be decarated by kw_decorator()) to add

class MATTE.GeneRanker(view, pipeline=None)#

Bases: object

MATTE GeneRanker.

There are several types of GeneRanker: 1. module and gene will cluster genes according to their expression. And Use module SNR to rank genes. In gene mode, the SNR will be corrected by the correlation of gene expression and module eigen. 2. ‘dist’,’cross-dist’. In dist mode, the distance of each genes will be calculated. And genes will be ranked according to the sum of distance to each other genes. In cross-dist mode, the distance of differential expression and differential co-expression will be merged.

Note

Inputs of GeneRanker is not the same as pipeline. Row of Expression data is sample, column is gene.

gene_rank(X, y, verbose=True, **kwargs)#

ranking genes.

Parameters

X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

gene_rank_dist(X, y, verbose=True, **kwargs)#

ranking genes of distance in each phenotype pairs. by runing this method, GeneRanker.dist_mat will be set.

Parameters

X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

gene_rank_module(X: DataFrame, y, verbose=True, **kwargs)#

ranking genes by their MATTE.analysis.ClusterResult.ModuleSNR() in each phenotype pairs. Each gene in one phenotype is sum to 1000. Meanwhile, set GeneRanker.gene_ranking_sep to calculte each phenotype score.

Parameters

X (pd.DataFrame) – Expression data whose index are samples and columns are genes.
y (pd.Series) – phenotypes whose index are samples.
verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

pipeline_clustering(X, y, verbose=True)#

Clustering the data using pipeline for each pair of phenotypes. A progress bar is set to inform the user about the progress.

Parameters

X (pandas.DataFrame) – expression data whose index are samples and columns are genes
y (pandas.Series) – phenotype data whose index are samples
verbose (bool, optional) – defaults to True

save(save_path)#

saving the embedder using dill

Parameters: save_path (str) – path to save the embedder

class MATTE.PipeFunc(func, name=None, *args, **kwargs)#

Bases: object

PipeFunc is a wrapper of a function with storing arguments and kwargs but not run it until calling it self. And is used in AlignPipe as a step.

add_params(*args, **kwargs)#: Add params to PipeFunc, and refresh str text

MATTE.merged_pipeline_clustering(df_exp: DataFrame, df_pheno: Series, pipelines: list, verbose=True)#

clustering genes using multiple pipelines.

Parameters

df_exp (pd.DataFrame) – Expression data whose index are genes and columns are samples.
df_pheno (pd.Series) – phenotypes whose index are samples.
pipelines (list) – list of MATTE.AlignPipe
verbose (bool, optional) – defaults to True

Returns

Cluster Result

Return type

MATTE.analysis.ClusterResult

Submodules#

MATTE.analysis#

class MATTE.analysis.ClusterResult(cluster_res, before_cluster_df, df_exp, df_pheno, cluster_properties, order_rule='input')#

Bases: object

A class to store the result of clustering. contains inputs and results.

res#

The result of clustering, a pandas.DataFrame. Including each gene’s label in each pheno.

df_exp#

The input of clustering, a pandas.DataFrame

df_pheno#

The input of clustering, a pandas.Series

cluster_properties#

The property of clustering, a dict containing the score, loss, and some clustering parameters.

label#

All labels, a numpy.array

n_cluster#

The number of cluster, an int

JM#

The J-matrix of clustering, a numpy.array can be calculated from res

module_genes#

The genes of each module, a dict

Note

Only Not “matched” modules are contrained in this dict.

GeneSNR(sample_feature, pheno=None)#

Use ClusterResult.ModuleSNR() to calculate the correcting SNR of each gene.

\(GeneSNR = ModuleSNR * PearsonCorrelation(GeneExpression, ModuleEigenvector)\)

Parameters

sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function MATTE.analysis.ClusterResult.SampleFeature()
pheno (array-like, optional) – the label of samples, defaults to None (set to be ClusterResult.pheno)

Returns

SNR, a pd.Series whose index is Module ID, and value is the SNR.

Return type

_type_

MCCorrFeature(model=None)#

calculation group features.using PCA for it reserving more information than others.

Parameters

model (object with fit and fit_transform, optional) – model used to decomposition, defaults to None

Returns

group_feature: dict,keys : module id | values : eigenvector

group_weight: dict,keys : module id | values : lambda of model

Return type

tuple

MCFeature(model=None)#

calculation group features.using PCA for it reserving more information than others.

Note

if n_components >=2, MCs’ eigenvector can reserve more than one to make explained ratio >=80%.

Parameters

df_exp (pd.DataFrame, optional) – defaults to None, ClusterResult.df_exp
module_genes (dict, optional) – the mapping of module and genes. defaults to None, result of function MATTE.analysis.ClusterResult._module_genes(),
model (object, optional) – the model to fit transform data.needs attribute fit_transform, defaults to None(sklearn.decomposition.PCA(n_components=1))

Returns

group_feature: dict,keys : module id | values : eigenvector, group_weight: dict,keys : module id | values : lambda of model

Return type

tuple

MCkwtest(sample_feature, pheno_series)#

Making the Kruskal-Wallis test on each MC. (if Each module’s Eigenvector is different)

Parameters

sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function MATTE.analysis.ClusterResult.SampleFeature()
pheno_series (pd.Series) – specify samples’ phenotype

Returns

kws: a pd.Series whose index is Module ID, and value is the p-value.

Return type

pd.Series

ModuleSNR(sample_feature, pheno=None)#

Calculate the SNR of each module.

Parameters

sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function MATTE.analysis.ClusterResult.SampleFeature()
pheno (array-like, optional) – the label of samples, defaults to None (set to be ClusterResult.pheno)

Returns

SNR, a pd.Series whose index is Module ID, and value is the SNR.

Return type

pd.Series

PhenoMCCorr(sample_feature, pheno_series=None, pheno=None)#

Calculate the correlation between Phenotype and MC’s Eigenvector and test the correlation and return p-value

Parameters

sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function ClusterResult.SampleFeature()
pheno_series (pd.Series) – specify samples’ phenotype
pheno (any, optional) – choose which pheno the calculate, defaults to None (the first one of pheno_series)

Returns

MC_corr: a pd.Series whose index is Module ID, and value is the pearson correlation. corr_p: p_value of correlation(Student t test)

Return type

tuple

SampleFeature(corr=False, **kwargs)#

Calculate Samples’ module feature.(average default)

Parameters: corr (bool, optional) – if True, calculate correlation between samples, defaults to False
Returns: depended by the arg:return_df
Return type: samples_feature is a WeightedDataFrame object, whose index is samples id and columns is module id. Module Weights are calculated by function MCFeature.

Vis_Jmat()#

plot J matrix.(heatmap)

Returns: figure
Return type: matplotlib.figure.Figure

summary(fig=True)#

summary of cluster results,print ClusterResult.cluster_properties, and plot J matrix.

Parameters: fig (bool, optional) – whether print figures or not, defaults to True
Returns: summary of cluster results
Return type: figures or None

MATTE.analysis.Fig_Fuction(df_enrich, color_columns, cmap=<matplotlib.colors.ListedColormap object>, width_height_ratio=2, **figargs)#

figuring the function enrichment reuslt. .. note:: filtering the result first!

Parameters

df_enrich (pd.DataFrame) – the result of FunctionEnrich()
color_columns (str) – used to shown in color, ‘fdr’ or ‘p_value’
cmap (object, optional) – chosen from matplotlib.cm, defaults to cm.Set1
width_height_ratio (int, optional) – defaults to 2

Returns

the figure

Return type

plt.figure

MATTE.analysis.Fig_SampleFeature(sample_feature, labels, color=None, model=None, weighted_distcance=False, metric='euclidean', **fig_args)#

showing the fig of samples reprented by MCs.

Parameters

sample_feature (WeightedDataFrame or pd.DataFrame) – index : samples’ id | columns : modules’ id . the result of ClusterResult.SampleFeature() or just using some genes to explain a sample.
labels (pd.Series) – index : samples’ id | columns : phenotype
color (array like, optional) – the second label used, and in the form of color. the first one will be in the form of shape., defaults to None
model (object, optional) – the model used to present samples in the two dimensions., defaults to None (sklearn.manifold.PCA(n_component=2))
weighted_distcance (bool, optional) – whether to used weighted distance. if sample_feature is pd.DataFrame, it should be set as False., defaults to False
metric (str, optional) – used to calculate the distance., defaults to ‘euclidean’

Returns

the figure

Return type

plt.figure

MATTE.analysis.FunctionEnrich(annote_file, gene_set, category_seperate_cal=True)#

perform GO enrichment

Parameters

annote_file (str or pd.DataFrame) –
Function annotation file dir or dataframe

Note

tab-seperate file with columns:[“Term_ID”,”GeneID”,”Term”,”Category”];File can be downloaded from https://ftp.ncbi.nih.gov/gene/DATA/
gene_set (array-like) – gene set to analysis
category_seperate_cal (bool, optional) – wheather to calculate fdr in each category or mixed., defaults to True

Returns

items,catogory,enriched items number,backgroud item number,p value,fdr

Return type

pd.DataFrame

class MATTE.analysis.WeightedDataFrame(weight=None, data=None, index=None, columns=None, dtype=None, copy=None)#

Bases: DataFrame

A weighted dataframe inherit pandas.DataFrame, and add a new attribute weight to store the weight of each column.

weight_distance(metric='c', **kwargs)#

Calculate weighted distance.

Parameters: metric (str, optional) – The distance metric to use., defaults to ‘c’

Note

calling from different module can be very different for their definition is not same.

The distance function can be :

— calling from scipy.spatial.pdist —

‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

— calling from Bio.cluster.distancematrix —

‘e’: Euclidean ; ‘b’: City-block . ‘c’: Pearson correlation; ‘a’: Absolute of Pearson correlation; ‘u’: Uncentered Pearson correlation; ‘x’: Absolute of Uncentered Pearson correlation; ‘s’: Spearman’s correlation; ‘k’: Kendall’s τ.

Returns: distance matrix
Return type: numpy.array

MATTE.cluster#

class MATTE.cluster.CrossCluster(presetting='kmeans', use_affinity=False, verbose=True, **kwargs)#

Bases: object

Cross clustering of mixed expression matrix. if preset is not set, then must use CrossCluster.build_from_func() or CrossCluster.build_from_model() to set the function to cluster.

build_from_func(func)#

build this class from function

Parameters: func (function) – functions to cluster

build_from_model(model, model_attr='labels_', **calling_kwargs)#

build this class from model

Parameters

model (object) – model to cluster, must have model_attr attribute and fit method
model_attr (str, optional) – defaults to label_

preset()#

preset the function to cluster

Raises: NotImplementedError – preset is not implemented

preset_kmeans()#: preset kmeans, default kwargs: n_clusters =8,`method` =’a’, npass =20

preset_spectral_bicluster()#: preset spectral bicluster, default kwargs: n_clusters =8,`use_aff` =True, n_init =10,`method` =’log’,`n_component` =`m_cluster`, use_aff =True

preset_spectrum()#: preset spectral clustering, default kwargs: n_clusters =8,`use_aff` =True, n_init =10

MATTE.cluster.Cross_Distance(before_cluster_df, metric='euclidean', weights=None, kwargs={}, verbose=True)#

calculate the distance between each cluster and other cluster

Parameters

before_cluster_df (pandas.DataFrame or np.array) – dataframe to cluster
metric (str, optional) – distance metric(implemented by scipy package,more information can be found in scipy’s documention), defaults to ‘euclidean’
weights (array-like, optional) – weights in calculate distance, defaults to None
kwargs (dict, optional) – other key words args passed to cdist function, defaults to {}
verbose (bool, optional) – defaults to True

Returns

distance matrix

Return type

numpy.array

MATTE.cluster.build_results(cluster_label, cluster_properties, df_exp, df_pheno, before_cluster_df, order_rule='input', verbose=True)#

building results of CrossCluster to MATTE.analysis.ClusterResult decorated by :func:`MATTE.utils.kwdecorator`

Parameters

cluster_label (array like) – label of clustering
mixed_genes (array like) – index of mixed genes(in the order of cluster_label)
cluster_properties (dict) – information get from clustering
df_exp (pandas.DataFrame) – inputs to all pipeline
df_pheno (pandas.Series) – inputs to all pipeline
before_cluster_df (pandas.DataFrame) – inputs to CrossCluster
order_rule (str, optional) – how to order module. ‘input’(mean exp) or ‘size’(module genes) or function; defaults to “input”
verbose (bool, optional) – defaults to True

Returns

results

Return type

MATTE.analysis.ClusterResult

MATTE.preprocess#

MATTE.preprocess.CorrKernel_Transform(df_exp, df_pheno, n_components=16)#

Correlation Kernel Transform. This function is decorated by MATTE.utils.kw_decorator().

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
n_components (int, optional) – number of components used in spectral embedding, defaults to 16

Returns

kernel matrix

Return type

dict { key:’before_cluster_df’ ; value:numpy.array}

MATTE.preprocess.LocKernel_Transform(df_exp, df_pheno, kernel_type, centering_kernel=True, outer_subtract_absolute=True, double_centering=True, verbose=True)#

Kernel Transformation. Important preprocess to cross clustering. In this step, genes from different phenotypes are regarded as different genes. and the distance between them is computed. this function is decorated by MATTE.utils.kw_decorator().

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
kernel_type (str or function) – Kernel type, one of the following: ‘mean’,’median’,functions are also allowed.
centering_kernel (bool, optional) – whether to centering kernel, defaults to True
outer_subtract_absolute (bool, optional) – in outer subtract, use absolute or not, defaults to True
double_centering (bool, optional) – whether double centering kernel matrix or not, defaults to True
verbose (bool, optional) – defaults to True

Returns

kernel matrix and mixed genes

Return type

dict

MATTE.preprocess.RDC_Transform(df_exp, df_pheno, **kwargs)#

RDC transform.

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.

Returns

RDC transformed dataframe.

Return type

numpy.array

MATTE.preprocess.RDE_Transform(df_exp, df_pheno, kernel_type, absolute=True)#

RDE transform.

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
kernel_type (str) – kernel type.should be one of ‘mean’ and ‘median’.
absolute (bool, optional) – calculate outer subtract absolute or not, defaults to True

Returns

RDE transformed dataframe.

Return type

numpy.array

MATTE.preprocess.RPKM2TPM(df_exp)#

Convert RPKM to TPM. decorated by MATTE.utils.kw_decorator()

Parameters: df_exp (pandas.DataFrame) – Expression dataframe.
Returns: TPM dataframe.
Return type: dict

MATTE.preprocess.expr_filter(df_exp, df_pheno, gene_filter=None, filter_args: dict = {})#

Filter the genes by rule. decorated by MATTE.utils.kw_decorator()

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.
gene_filter (str or function, optional) – gene filter rule. None will delete some genes with extreme low expression. f will filter genes by Anova f value(p=0.05 by default). function is also allowed.defaults to None
filter_args (dict, optional) – other args sent to gene filter function, defaults to {}

Returns

filtered dataframe.

Return type

dict

MATTE.preprocess.inputs_check(df_exp, df_pheno, **kwargs)#

Checking inputs if they are correct. decorated by MATTE.utils.kw_decorator()

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
df_pheno (pandas.Series) – Phenotype dataframe.

Raises

ValueError – If the input dataframe is not correct.

Returns

checking status

Return type

dict {mixed_genes:’OK’}

MATTE.preprocess.log2transform(df_exp)#

Log2 transform the expression dataframe. decorated by MATTE.utils.kw_decorator()

Parameters: df_exp (pandas.DataFrame) – Expression dataframe.
Returns: Log2 transformed dataframe.
Return type: dict

MATTE.preprocess.normalization(df_exp, norm='l1')#

Normalize the expression dataframe. decorated by MATTE.utils.kw_decorator()

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.
norm (str, optional) – normalization type,should be one of l1,`l2` and standard, defaults to ‘l1’

Returns

Normalized dataframe.

Return type

dict

MATTE.utils#

MATTE.utils.affinity_matrix(data, dist_type, type='distance', **kwargs)#

Calculate affinity matrix; both implement from Bio.Cluster and scipy.spatial.distance are supported.

Parameters

data (numpy.array or pandas.DataFrame) – data to calculate affinity matrix
dist_type (str) – string of distance type
type (str, optional) – distance or affinity, defaults to “distance”

Returns

matrix

Return type

np.array

MATTE.utils.kw_decorator(kw=None)#

Import decarated function used in pipeline.This allow function to accept more than keyworks arguments it self accepted. and will make function return dict as kw set.

Parameters: kw (str or list, optional) – keywords make of function’s return dict’s keys, defaults to None

MATTE.utils.printv(*text, show_time=True, verbose=True)#

Prints text with time and verbose.

Parameters

show_time (bool, optional) – defaults to True
verbose (bool, optional) – defaults to True