API#

Core class#

class MATTE.AlignPipe(stats_type='mean', target='cluster', preprocess=True)#

Bases: object

Basic class in MATTE. Stores a list of functions or transformers. One can see the pipeline in __str__()

The funcstions in funcs are called in order, and the results are passed to the next function. All returns of the functions should be dict,(using MATTE.utils.kw_decorator() are recommended). Then cluster_func will be called to cluster the results.

Note

Use MATTE.AlignPipe.add_step() to add a function to funcs, but can change order or delete some functions like list.

add_param(**kwargs)#

add a function that return parameters to funcs. Multiple parameters can be added at the same time.

add_step(func, *setting, **kwsetting)#

Add a function to funcs

Parameters

func (function) – function(should be decarated by kw_decorator()) to add

add_transformer(transformer)#

Add a transformer to funcs

Parameters

transformer (objects that has fit and transform methods) – transformer to add

Raises

TypeError – if transformer is not a transformer

calculate(df_exp, df_pheno, verbose=True)#

Calculate the data using the pipeline.

Parameters
  • df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples

  • df_pheno (pandas.Series) – phenotype data whose index are samples

  • verbose (bool, optional) – defaults to True

Returns

clustering results

Return type

MATTE.analysis.ClusterResult

calculate_from_temp(tmpt_result, verbose=True)#

Calculate the data using the pipeline but from temp file.

Parameters
  • tmpt_result (dict) – temp result saved by MATTE.AlignPipe.fit_transform()

  • verbose (bool, optional) – defaults to True

Returns

clustering results

Return type

MATTE.analysis.ClusterResult

find_best_KernelTrans_params(df_exp, df_pheno, n_downsample=None, n_iters=None, inplace=True, verbose=True)#

Find the best parameters for Kernel_Transform. According to error of cluster.

Parameters
  • df_exp (pandas.DataFrame) – expression data whose index are genes and columns are samples

  • df_pheno (pandas.Series) – phenotype data whose index are samples

  • n_downsample (int, optional) – number of down sample, defaults to None

  • n_iters (int, optional) – number of iterations, defaults to None

  • inplace (bool, optional) – inplace pipeline, defaults to True

  • verbose (bool, optional) – defaults to True

Returns

best parameters or AlignPipe

Return type

AlignPipe or dict

get_attribute_from_transformer(attribute)#

Get the attribute from the transformer.

Parameters

attribute (str) – attribute name

Returns

attribute value

Return type

any

init_pipeline(stats_type, target, preprocess: bool)#

Initialize the pipeline.

Parameters
  • stats_type (str) – type of statistics

  • target (str) – target name

  • preprocess (bool) – whether to preprocess the data

set_cluster_method(func, *setting, **kwsetting)#

Add a function to cluster_func

Parameters

func (function) – function(should be decarated by kw_decorator()) to add

class MATTE.GeneRanker(view, pipeline=None)#

Bases: object

MATTE GeneRanker.

There are several types of GeneRanker: 1. module and gene will cluster genes according to their expression. And Use module SNR to rank genes. In gene mode, the SNR will be corrected by the correlation of gene expression and module eigen. 2. ‘dist’,’cross-dist’. In dist mode, the distance of each genes will be calculated. And genes will be ranked according to the sum of distance to each other genes. In cross-dist mode, the distance of differential expression and differential co-expression will be merged.

Note

Inputs of GeneRanker is not the same as pipeline. Row of Expression data is sample, column is gene.

gene_rank(X, y, verbose=True, **kwargs)#

ranking genes.

Parameters
  • X (pd.DataFrame) – Expression data whose index are samples and columns are genes.

  • y (pd.Series) – phenotypes whose index are samples.

  • verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

gene_rank_dist(X, y, verbose=True, **kwargs)#

ranking genes of distance in each phenotype pairs. by runing this method, GeneRanker.dist_mat will be set.

Parameters
  • X (pd.DataFrame) – Expression data whose index are samples and columns are genes.

  • y (pd.Series) – phenotypes whose index are samples.

  • verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

gene_rank_module(X: DataFrame, y, verbose=True, **kwargs)#

ranking genes by their MATTE.analysis.ClusterResult.ModuleSNR() in each phenotype pairs. Each gene in one phenotype is sum to 1000. Meanwhile, set GeneRanker.gene_ranking_sep to calculte each phenotype score.

Parameters
  • X (pd.DataFrame) – Expression data whose index are samples and columns are genes.

  • y (pd.Series) – phenotypes whose index are samples.

  • verbose (bool, optional) – defaults to True

Returns

gene ranking score

Return type

pd.Series

pipeline_clustering(X, y, verbose=True)#

Clustering the data using pipeline for each pair of phenotypes. A progress bar is set to inform the user about the progress.

Parameters
  • X (pandas.DataFrame) – expression data whose index are samples and columns are genes

  • y (pandas.Series) – phenotype data whose index are samples

  • verbose (bool, optional) – defaults to True

save(save_path)#

saving the embedder using dill

Parameters

save_path (str) – path to save the embedder

class MATTE.PipeFunc(func, name=None, *args, **kwargs)#

Bases: object

PipeFunc is a wrapper of a function with storing arguments and kwargs but not run it until calling it self. And is used in AlignPipe as a step.

add_params(*args, **kwargs)#

Add params to PipeFunc, and refresh str text

MATTE.merged_pipeline_clustering(df_exp: DataFrame, df_pheno: Series, pipelines: list, verbose=True)#

clustering genes using multiple pipelines.

Parameters
  • df_exp (pd.DataFrame) – Expression data whose index are genes and columns are samples.

  • df_pheno (pd.Series) – phenotypes whose index are samples.

  • pipelines (list) – list of MATTE.AlignPipe

  • verbose (bool, optional) – defaults to True

Returns

Cluster Result

Return type

MATTE.analysis.ClusterResult

Submodules#

MATTE.analysis#

class MATTE.analysis.ClusterResult(cluster_res, before_cluster_df, df_exp, df_pheno, cluster_properties, order_rule='input')#

Bases: object

A class to store the result of clustering. contains inputs and results.

res#

The result of clustering, a pandas.DataFrame. Including each gene’s label in each pheno.

df_exp#

The input of clustering, a pandas.DataFrame

df_pheno#

The input of clustering, a pandas.Series

cluster_properties#

The property of clustering, a dict containing the score, loss, and some clustering parameters.

label#

All labels, a numpy.array

n_cluster#

The number of cluster, an int

JM#

The J-matrix of clustering, a numpy.array can be calculated from res

module_genes#

The genes of each module, a dict

Note

Only Not “matched” modules are contrained in this dict.

GeneSNR(sample_feature, pheno=None)#

Use ClusterResult.ModuleSNR() to calculate the correcting SNR of each gene.

\(GeneSNR = ModuleSNR * PearsonCorrelation(GeneExpression, ModuleEigenvector)\)

Parameters
  • sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function MATTE.analysis.ClusterResult.SampleFeature()

  • pheno (array-like, optional) – the label of samples, defaults to None (set to be ClusterResult.pheno)

Returns

SNR, a pd.Series whose index is Module ID, and value is the SNR.

Return type

_type_

MCCorrFeature(model=None)#

calculation group features.using PCA for it reserving more information than others.

Parameters

model (object with fit and fit_transform, optional) – model used to decomposition, defaults to None

Returns

group_feature: dict,keys : module id | values : eigenvector

group_weight: dict,keys : module id | values : lambda of model

Return type

tuple

MCFeature(model=None)#

calculation group features.using PCA for it reserving more information than others.

Note

if n_components >=2, MCs’ eigenvector can reserve more than one to make explained ratio >=80%.

Parameters
  • df_exp (pd.DataFrame, optional) – defaults to None, ClusterResult.df_exp

  • module_genes (dict, optional) – the mapping of module and genes. defaults to None, result of function MATTE.analysis.ClusterResult._module_genes(),

  • model (object, optional) – the model to fit transform data.needs attribute fit_transform, defaults to None(sklearn.decomposition.PCA(n_components=1))

Returns

group_feature: dict,keys : module id | values : eigenvector, group_weight: dict,keys : module id | values : lambda of model

Return type

tuple

MCkwtest(sample_feature, pheno_series)#

Making the Kruskal-Wallis test on each MC. (if Each module’s Eigenvector is different)

Parameters
Returns

kws: a pd.Series whose index is Module ID, and value is the p-value.

Return type

pd.Series

ModuleSNR(sample_feature, pheno=None)#

Calculate the SNR of each module.

Parameters
  • sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function MATTE.analysis.ClusterResult.SampleFeature()

  • pheno (array-like, optional) – the label of samples, defaults to None (set to be ClusterResult.pheno)

Returns

SNR, a pd.Series whose index is Module ID, and value is the SNR.

Return type

pd.Series

PhenoMCCorr(sample_feature, pheno_series=None, pheno=None)#

Calculate the correlation between Phenotype and MC’s Eigenvector and test the correlation and return p-value

Parameters
  • sample_feature (MATTE.WeightedDataFrame) – can be calculated by the function ClusterResult.SampleFeature()

  • pheno_series (pd.Series) – specify samples’ phenotype

  • pheno (any, optional) – choose which pheno the calculate, defaults to None (the first one of pheno_series)

Returns

MC_corr: a pd.Series whose index is Module ID, and value is the pearson correlation. corr_p: p_value of correlation(Student t test)

Return type

tuple

SampleFeature(corr=False, **kwargs)#

Calculate Samples’ module feature.(average default)

Parameters

corr (bool, optional) – if True, calculate correlation between samples, defaults to False

Returns

depended by the arg:return_df

Return type

samples_feature is a WeightedDataFrame object, whose index is samples id and columns is module id. Module Weights are calculated by function MCFeature.

Vis_Jmat()#

plot J matrix.(heatmap)

Returns

figure

Return type

matplotlib.figure.Figure

summary(fig=True)#

summary of cluster results,print ClusterResult.cluster_properties, and plot J matrix.

Parameters

fig (bool, optional) – whether print figures or not, defaults to True

Returns

summary of cluster results

Return type

figures or None

MATTE.analysis.Fig_Fuction(df_enrich, color_columns, cmap=<matplotlib.colors.ListedColormap object>, width_height_ratio=2, **figargs)#

figuring the function enrichment reuslt. .. note:: filtering the result first!

Parameters
  • df_enrich (pd.DataFrame) – the result of FunctionEnrich()

  • color_columns (str) – used to shown in color, ‘fdr’ or ‘p_value’

  • cmap (object, optional) – chosen from matplotlib.cm, defaults to cm.Set1

  • width_height_ratio (int, optional) – defaults to 2

Returns

the figure

Return type

plt.figure

MATTE.analysis.Fig_SampleFeature(sample_feature, labels, color=None, model=None, weighted_distcance=False, metric='euclidean', **fig_args)#

showing the fig of samples reprented by MCs.

Parameters
  • sample_feature (WeightedDataFrame or pd.DataFrame) – index : samples’ id | columns : modules’ id . the result of ClusterResult.SampleFeature() or just using some genes to explain a sample.

  • labels (pd.Series) – index : samples’ id | columns : phenotype

  • color (array like, optional) – the second label used, and in the form of color. the first one will be in the form of shape., defaults to None

  • model (object, optional) – the model used to present samples in the two dimensions., defaults to None (sklearn.manifold.PCA(n_component=2))

  • weighted_distcance (bool, optional) – whether to used weighted distance. if sample_feature is pd.DataFrame, it should be set as False., defaults to False

  • metric (str, optional) – used to calculate the distance., defaults to ‘euclidean’

Returns

the figure

Return type

plt.figure

MATTE.analysis.FunctionEnrich(annote_file, gene_set, category_seperate_cal=True)#

perform GO enrichment

Parameters
  • annote_file (str or pd.DataFrame) –

    Function annotation file dir or dataframe

    Note

    tab-seperate file with columns:[“Term_ID”,”GeneID”,”Term”,”Category”];File can be downloaded from https://ftp.ncbi.nih.gov/gene/DATA/

  • gene_set (array-like) – gene set to analysis

  • category_seperate_cal (bool, optional) – wheather to calculate fdr in each category or mixed., defaults to True

Returns

items,catogory,enriched items number,backgroud item number,p value,fdr

Return type

pd.DataFrame

class MATTE.analysis.WeightedDataFrame(weight=None, data=None, index=None, columns=None, dtype=None, copy=None)#

Bases: DataFrame

A weighted dataframe inherit pandas.DataFrame, and add a new attribute weight to store the weight of each column.

weight_distance(metric='c', **kwargs)#

Calculate weighted distance.

Parameters

metric (str, optional) – The distance metric to use., defaults to ‘c’

Note

calling from different module can be very different for their definition is not same.

The distance function can be :

— calling from scipy.spatial.pdist —

‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

— calling from Bio.cluster.distancematrix —

‘e’: Euclidean ; ‘b’: City-block . ‘c’: Pearson correlation; ‘a’: Absolute of Pearson correlation; ‘u’: Uncentered Pearson correlation; ‘x’: Absolute of Uncentered Pearson correlation; ‘s’: Spearman’s correlation; ‘k’: Kendall’s τ.

Returns

distance matrix

Return type

numpy.array

MATTE.cluster#

class MATTE.cluster.CrossCluster(presetting='kmeans', use_affinity=False, verbose=True, **kwargs)#

Bases: object

Cross clustering of mixed expression matrix. if preset is not set, then must use CrossCluster.build_from_func() or CrossCluster.build_from_model() to set the function to cluster.

build_from_func(func)#

build this class from function

Parameters

func (function) – functions to cluster

build_from_model(model, model_attr='labels_', **calling_kwargs)#

build this class from model

Parameters
  • model (object) – model to cluster, must have model_attr attribute and fit method

  • model_attr (str, optional) – defaults to label_

preset()#

preset the function to cluster

Raises

NotImplementedErrorpreset is not implemented

preset_kmeans()#

preset kmeans, default kwargs: n_clusters =8,`method` =’a’, npass =20

preset_spectral_bicluster()#

preset spectral bicluster, default kwargs: n_clusters =8,`use_aff` =True, n_init =10,`method` =’log’,`n_component` =`m_cluster`, use_aff =True

preset_spectrum()#

preset spectral clustering, default kwargs: n_clusters =8,`use_aff` =True, n_init =10

MATTE.cluster.Cross_Distance(before_cluster_df, metric='euclidean', weights=None, kwargs={}, verbose=True)#

calculate the distance between each cluster and other cluster

Parameters
  • before_cluster_df (pandas.DataFrame or np.array) – dataframe to cluster

  • metric (str, optional) – distance metric(implemented by scipy package,more information can be found in scipy’s documention), defaults to ‘euclidean’

  • weights (array-like, optional) – weights in calculate distance, defaults to None

  • kwargs (dict, optional) – other key words args passed to cdist function, defaults to {}

  • verbose (bool, optional) – defaults to True

Returns

distance matrix

Return type

numpy.array

MATTE.cluster.build_results(cluster_label, cluster_properties, df_exp, df_pheno, before_cluster_df, order_rule='input', verbose=True)#

building results of CrossCluster to MATTE.analysis.ClusterResult decorated by :func:`MATTE.utils.kwdecorator`

Parameters
  • cluster_label (array like) – label of clustering

  • mixed_genes (array like) – index of mixed genes(in the order of cluster_label)

  • cluster_properties (dict) – information get from clustering

  • df_exp (pandas.DataFrame) – inputs to all pipeline

  • df_pheno (pandas.Series) – inputs to all pipeline

  • before_cluster_df (pandas.DataFrame) – inputs to CrossCluster

  • order_rule (str, optional) – how to order module. ‘input’(mean exp) or ‘size’(module genes) or function; defaults to “input”

  • verbose (bool, optional) – defaults to True

Returns

results

Return type

MATTE.analysis.ClusterResult

MATTE.preprocess#

MATTE.preprocess.CorrKernel_Transform(df_exp, df_pheno, n_components=16)#

Correlation Kernel Transform. This function is decorated by MATTE.utils.kw_decorator().

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

  • n_components (int, optional) – number of components used in spectral embedding, defaults to 16

Returns

kernel matrix

Return type

dict { key:’before_cluster_df’ ; value:numpy.array}

MATTE.preprocess.LocKernel_Transform(df_exp, df_pheno, kernel_type, centering_kernel=True, outer_subtract_absolute=True, double_centering=True, verbose=True)#

Kernel Transformation. Important preprocess to cross clustering. In this step, genes from different phenotypes are regarded as different genes. and the distance between them is computed. this function is decorated by MATTE.utils.kw_decorator().

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

  • kernel_type (str or function) – Kernel type, one of the following: ‘mean’,’median’,functions are also allowed.

  • centering_kernel (bool, optional) – whether to centering kernel, defaults to True

  • outer_subtract_absolute (bool, optional) – in outer subtract, use absolute or not, defaults to True

  • double_centering (bool, optional) – whether double centering kernel matrix or not, defaults to True

  • verbose (bool, optional) – defaults to True

Returns

kernel matrix and mixed genes

Return type

dict

MATTE.preprocess.RDC_Transform(df_exp, df_pheno, **kwargs)#

RDC transform.

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

Returns

RDC transformed dataframe.

Return type

numpy.array

MATTE.preprocess.RDE_Transform(df_exp, df_pheno, kernel_type, absolute=True)#

RDE transform.

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

  • kernel_type (str) – kernel type.should be one of ‘mean’ and ‘median’.

  • absolute (bool, optional) – calculate outer subtract absolute or not, defaults to True

Returns

RDE transformed dataframe.

Return type

numpy.array

MATTE.preprocess.RPKM2TPM(df_exp)#

Convert RPKM to TPM. decorated by MATTE.utils.kw_decorator()

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.

Returns

TPM dataframe.

Return type

dict

MATTE.preprocess.expr_filter(df_exp, df_pheno, gene_filter=None, filter_args: dict = {})#

Filter the genes by rule. decorated by MATTE.utils.kw_decorator()

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

  • gene_filter (str or function, optional) – gene filter rule. None will delete some genes with extreme low expression. f will filter genes by Anova f value(p=0.05 by default). function is also allowed.defaults to None

  • filter_args (dict, optional) – other args sent to gene filter function, defaults to {}

Returns

filtered dataframe.

Return type

dict

MATTE.preprocess.inputs_check(df_exp, df_pheno, **kwargs)#

Checking inputs if they are correct. decorated by MATTE.utils.kw_decorator()

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • df_pheno (pandas.Series) – Phenotype dataframe.

Raises

ValueError – If the input dataframe is not correct.

Returns

checking status

Return type

dict {mixed_genes:’OK’}

MATTE.preprocess.log2transform(df_exp)#

Log2 transform the expression dataframe. decorated by MATTE.utils.kw_decorator()

Parameters

df_exp (pandas.DataFrame) – Expression dataframe.

Returns

Log2 transformed dataframe.

Return type

dict

MATTE.preprocess.normalization(df_exp, norm='l1')#

Normalize the expression dataframe. decorated by MATTE.utils.kw_decorator()

Parameters
  • df_exp (pandas.DataFrame) – Expression dataframe.

  • norm (str, optional) – normalization type,should be one of l1,`l2` and standard, defaults to ‘l1’

Returns

Normalized dataframe.

Return type

dict

MATTE.utils#

MATTE.utils.affinity_matrix(data, dist_type, type='distance', **kwargs)#

Calculate affinity matrix; both implement from Bio.Cluster and scipy.spatial.distance are supported.

Parameters
  • data (numpy.array or pandas.DataFrame) – data to calculate affinity matrix

  • dist_type (str) – string of distance type

  • type (str, optional) – distance or affinity, defaults to “distance”

Returns

matrix

Return type

np.array

MATTE.utils.kw_decorator(kw=None)#

Import decarated function used in pipeline.This allow function to accept more than keyworks arguments it self accepted. and will make function return dict as kw set.

Parameters

kw (str or list, optional) – keywords make of function’s return dict’s keys, defaults to None

MATTE.utils.printv(*text, show_time=True, verbose=True)#

Prints text with time and verbose.

Parameters
  • show_time (bool, optional) – defaults to True

  • verbose (bool, optional) – defaults to True