cia package

cia.external module

cia.external.celltypist_majority_vote(data, classification_obs, groups_obs=None, min_prop=0, unassigned_label='Unassigned')[source]

A function that wraps Celltypist majority voting (DOI: 10.1126/science.abl5197). Assigns cell group labels based on the majority voting of cell type predictions within each group.

If no reference cell groups are provided, an over-clustering step is performed using the Leiden algorithm.

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the cell data and, optionally, previous clustering results.

  • classification_obs (str or list of str) – The AnnData.obs column(s) where the cell type predictions (labels) are stored.

  • groups_obs (str, optional) – The AnnData.obs column where the reference group labels are stored. If None, an over-clustering with the Leiden algorithm is performed based on the dataset size.

  • min_prop (float, optional) – The minimum proportion of cells required to assign a majority vote label to a group. If the largest cell type in a group doesn’t reach this proportion, the group is labeled as ‘Unassigned’.

  • unassigned_label (str, optional) – The label to assign to cell groups where no cell type reaches the minimum proportion. Default is ‘Unassigned’.

Notes

The function automatically adjusts the resolution for the Leiden algorithm based on the number of observations in the data. Results of majority voting are stored back in the AnnData.obs, adding a column for each classification considered.

cia.investigate module

cia.investigate.CIA_classify(data, signatures_input, n_cpus=None, similarity_threshold=0, label_column='CIA_prediction', score_mode='scaled', unassigned_label='Unassigned')[source]

Classify cells in data based on gene signature scores.

This function computes scaled signature scores for the provided data against a set of gene signatures. It then classifies each cell based on the highest score unless the top two scores are too similar, in which case it assigns an ‘Unassigned’ label.

Parameters:
  • data (AnnData) – An AnnData object containing the dataset to compute scores for, expected to have a raw attribute containing a matrix (X) and var_names.

  • signatures_input (str or dict) – Path to a file or a dictionary containing gene signatures. If a string is provided, it should be the file path or URL to the signature file.

  • n_cpus (int, optional) – Number of CPU cores to use for parallel computation. If None, all available cores are used.

  • similarity_threshold (float, optional) – The threshold below which the top two scores are considered too similar, resulting in an ‘Unassigned’ label. Defaults to 0.1 (difference < 10%).

  • label_column (str, optional) – The column name in data.obs where the classification labels will be stored. Defaults to ‘CIA prediction’.

  • unassigned_label (str, optional) – The label to assign when the top two scores are too similar. Defaults to ‘Unassigned’.

Returns:

The function directly modifies the data object by adding classification labels to data.obs.

Return type:

None

Notes

The function calculates signature scores using the score_all_signatures function. The highest score is used for classification unless it is within the similarity_threshold of the second-highest score.

Examples

>>> data = sc.read_h5ad('path/to/your/data.h5ad')  # Assume sc is Scanpy and data is loaded
>>> signatures_input = 'path/to/signatures.txt'
>>> CIA_classify(data, signatures_input, similarity_threshold=0.1)
>>> data.obs['CIA prediction']
cia.investigate.load_signatures(signatures_input, description_field_available=True)[source]

Load gene signatures from a given source.

This function loads gene signatures from either a local file path, a URL, or directly from a dictionary. If a file path or URL is provided, the file should be in tab-separated format with the first column as keys and subsequent columns as values.

Parameters:
  • signatures_input (str or dict) – The source of the gene signatures. This can be a path to a tab-separated file, a URL pointing to such a file, or a dictionary where keys are signature names and values are lists of gene names.

  • description_field_available (bool) – Logical value, to accommodate for the use of “custom” gmt file formats, where the description field might not be provided. Defaults to TRUE.

Returns:

A dictionary where each key is a signature name and each value is a list of gene names associated with that signature.

Return type:

dict

Raises:

TypeError – If signatures_input is neither a string (for file paths or URLs) nor a dictionary.

Examples

>>> signatures = load_signatures('signatures.tsv')
>>> print(signatures['signature1'])
['gene1', 'gene2', 'gene3']
>>> signatures_dict = {'signature1': ['gene1', 'gene2'], 'signature2': ['gene3', 'gene4']}
>>> signatures = load_signatures(signatures_dict)
>>> print(signatures['signature1'])
['gene1', 'gene2']
cia.investigate.score_all_signatures(data, signatures_input, score_mode='raw', return_df=False, n_cpus=None)[source]

Compute signature scores for a given dataset and a set of gene signatures.

This function checks which genes from the signatures are present in the dataset, computes the signature scores for each cell in the dataset, and can return the scores as a DataFrame or add them to the data.obs.

Parameters:
  • data (AnnData) – An AnnData object containing the dataset to compute scores for, expected to have a raw attribute containing a matrix (X) and var_names.

  • signatures_input (str or dict) – Path to a file or a dictionary containing gene signatures. If a string is provided, it should be the file path or URL to the signature file.

  • score_mode (str, optional) – The mode of score calculation. Options are ‘raw’, ‘scaled’, ‘log’, ‘log2’, ‘log10’. Defaults to ‘raw’.

  • return_df (bool, optional) – If True, the function returns a DataFrame with signature scores. Otherwise, it adds the scores to data.obs. Defaults to False.

  • n_cpus (int, optional) – Number of CPU cores to use for parallel processing. If None, uses all available cores.

Returns:

A DataFrame containing the signature scores if return_df is True. Otherwise, the function adds the scores to data.obs and returns None.

Return type:

pandas.DataFrame or None

Notes

The function parallelizes the computation of signature scores across the specified number of CPU cores. It prints the number of genes found in both the signatures and the dataset for each signature.

Examples

>>> data = sc.read_h5ad('path/to/your/data.h5ad')  # Assume sc is Scanpy and data is loaded
>>> signatures_input = 'path/to/signatures.txt'
>>> signature_scores = signature_score(data, signatures_input, score_mode='scaled', return_df=True)
>>> signature_scores.head()
cia.investigate.score_signature(data, geneset)[source]

Compute signature scores (from https://doi.org/10.1038/s41467-021-22544-y) for a single gene set against the provided dataset.

This function calculates the signature scores based on the presence (count) and expression (exp) of genes in the specified gene set within the dataset. The score is the product of the count of genes expressed in a given cell and the sum of their expression levels, normalized by the total expression detected in the cell.

Parameters:
  • data (AnnData) – An AnnData object containing the dataset to compute scores for. It is expected to have an attribute raw containing an X matrix (observations x variables) and var_names (gene names).

  • geneset (array_like) – A list or array of gene names for which to compute the signature scores.

Returns:

An array of signature scores, one per observation (cell) in data.

Return type:

numpy.ndarray

Notes

The function first intersects the provided gene set with the gene names available in data.raw.var_names to ensure only relevant genes are considered. If no genes from the gene set are found in the data, the function returns an array of zeros.

Examples

>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> geneset = ['CD3D', 'CD3E', 'CD3G', 'CD4', 'CD8A', 'CD8B']
>>> scores = score_signature(adata, geneset)
>>> print(scores.shape)
(700,)
Raises:

AttributeError – If data does not have the required raw attribute or if raw does not have the X and var_names attributes.

cia.report module

cia.report.compute_classification_metrics(data, classification_obs, ref_obs, unassigned_label='')[source]

Computes the main metrics of classification by comparing labels of cells classified with given methods (methods of interest) to labels assigned with a different one (reference method). Cells labeled as unassigned_label in any method of interest are excluded from the metrics calculation. Additionally, if present, the percentage of unassigned cells for each classification method is calculated and reported.

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the cell data.

  • classification_obs (list of str) – A list of strings specifying the AnnData.obs columns where the labels assigned by the methods of interest are stored.

  • ref_obs (str) – A string specifying the AnnData.obs column where the labels assigned by the reference method are stored.

  • unassigned_label (str, optional) – The label used to mark unassigned cells in the classification columns. Cells with this label will be excluded from the metrics calculation. Default is an empty string, which means no cells are excluded based on their label.

Returns:

report – A pandas.DataFrame containing the overall sensitivity (SE), specificity (SP), precision (PR), accuracy (ACC), F1-score (F1), and, if specified, the percentage of unassigned cells (%UN) for each classification method compared to the reference method.

Return type:

pandas.DataFrame

Example

>>> import scanpy as sc
>>> adata = sc.read_h5ad('your_data_file.h5ad')  # Load your AnnData file
>>> adata.obs['method1'] = ['label1', 'label2', 'label1', 'label2']  # Example classification
>>> adata.obs['method2'] = ['label1', 'label1', 'label2', 'label2']  # Another example classification
>>> adata.obs['reference'] = ['label1', 'label1', 'label2', 'label2']  # Reference classification
>>> classification_metrics(adata, ['method1', 'method2'], 'reference', unassigned_label='Unassigned')
cia.report.group_composition(data, classification_obs, ref_obs, columns_order=None, cmap='Reds', save=None)[source]

Plots a heatmap showing the percentages of cells classified with a given method (method of interest) in cell groups defined with a different one (reference method).

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the cell classification data.

  • classification_obs (str) – A string specifying the AnnData.obs column where the labels assigned by the method of interest are stored.

  • ref_obs (str) – A string specifying the AnnData.obs column where the labels assigned by the reference method are stored.

  • columns_order (list of str, optional) – A list of strings specifying the order of columns in the heatmap.

  • cmap (str or matplotlib.colors.Colormap, optional) – The colormap for the heatmap. Defaults to ‘Reds’.

  • save (str, optional) – A filename to save the heatmap. If provided, the heatmap is saved in the ‘figures’ directory with ‘CIA_’ prefix.

Returns:

A heatmap AxesSubplot object is returned if save is None. Otherwise, the plot is saved to a file, and None is returned.

Return type:

matplotlib.axes.Axes or None

Examples

>>> group_composition(adata, 'method_labels', 'reference_labels')
cia.report.grouped_classification_metrics(data, classification_obs, ref_obs, unassigned_label='')[source]

Computes the main metrics of classification for each group defined by the reference method, comparing the labels from the method of interest with the reference labels. Additionally, if specified, computes the percentage of unlabelled cells for each group.

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the cell data.

  • classification_obs (str) – The AnnData.obs column where the labels assigned by the method of interest are stored.

  • ref_obs (str) – The AnnData.obs column where the labels assigned by the reference method are stored.

Returns:

report – A DataFrame containing the per-group sensitivity (SE), specificity (SP), precision (PR), accuracy (ACC), F1-score (F1), and if present, the percentage of unassigned cells (%UN) for the selected classification method.

Return type:

pandas.DataFrame

Example

>>> import scanpy as sc
>>> adata = sc.read_h5ad('your_data_file.h5ad')  # Load your AnnData file
>>> classification_obs = 'predicted_labels'
>>> ref_obs = 'actual_labels'
>>> metrics_report = grouped_classification_metrics(adata, classification_obs, ref_obs)
cia.report.grouped_distributions(data, columns_obs, ref_obs, cmap='Reds', scale_medians=None, save=None)[source]

Plots a heatmap of median values for selected columns in AnnData.obs across cell groups and performs statistical tests to evaluate the differences in distributions. The Wilcoxon test checks if each group’s signature score is significantly higher than others in the same group. The Mann-Whitney U test checks if each signature has the highest score values in the corresponding group compared to all other groups.

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the cell data.

  • columns_obs (list of str) – Column names in AnnData.obs where the values of interest are stored.

  • ref_obs (str) – Column name in AnnData.obs where the cell group labels are stored.

  • cmap (str or matplotlib.colors.Colormap, optional) – Colormap for the heatmap. Defaults to ‘Reds’.

  • scale_medians (str, optional) – How to scale the median values in the heatmap. Options: ‘row-wise’, ‘column-wise’, or None.

  • save (str, optional) – Filename to save the heatmap. If provided, saves the heatmap in ‘figures’ directory with ‘CIA_’ prefix.

Returns:

If save is provided, the heatmap is saved and None is returned. Otherwise, returns the AxesSubplot object.

Return type:

None or AxesSubplot

cia.report.plot_group_composition(df, ref_col, comp_col, plot_type='percentage', palette='Set3', show_legend=True)[source]

Plot the composition of each reference group as a horizontal stacked bar plot. The composition can be shown either as raw counts or as percentages.

Parameters: df : pandas.DataFrame

DataFrame containing the data to be plotted.

ref_colstr

the name of the column representing the reference grouping variable.

comp_col: str

the name of the column representing the grouping to be compared.

plot_typestr

indicates whether to plot ‘percentage’ or ‘raw’ counts. Defaults to ‘percentage’.

palettestr or list

the color palette to use. Defaults to ‘Set3’.

show_legendbool

whether to display the legend on the plot. Defaults to True.

Return type:

AxesSubplot

cia.utils module

cia.utils.filter_degs(data, groupby, uns_key='rank_genes_groups', direction='up', logFC=0, scores=None, perc=0, mean=0)[source]

Filters differentially expressed genes (DEGs) obtained with scanpy.tl.rank_genes_groups based on given thresholds.

Parameters:
  • data (anndata.AnnData) – An AnnData object containing the analysis results.

  • groupby (str) – Column in AnnData.obs containing cell group labels.

  • uns_key (str) – Key in AnnData.uns where differential expression analysis results are stored.

  • direction (str) – Specifies if filtering for upregulated (‘up’) or downregulated (‘down’) genes.

  • logFC (float) – Log fold change threshold to filter genes.

  • scores (float, optional) – Z score threshold to filter genes.

  • perc (float) – Percentage of cells expressing the gene threshold.

  • mean (float) – Mean expression threshold to filter genes.

Returns:

signatures_dict – Dictionary with cell group names as keys and lists of filtered gene names as values.

Return type:

dict

Raises:

ValueError – If ‘direction’ is not ‘up’ or ‘down’.

Example

>>> import scanpy as sc
>>> adata = sc.datasets.pbmc68k_reduced()
>>> sc.tl.rank_genes_groups(adata, 'louvain', method='t-test')
>>> filtered_genes = filter_degs(adata, 'louvain', direction='up', logFC=1, perc=10, mean=0.1)
>>> print(filtered_genes['0'])  # Show filtered genes for the first group
cia.utils.save_gmt(signatures_dict, file)[source]

A function to convert a dictionary of signatures in a gmt file correctly formatted for signature_score and signature_based_classification functions.

Parameters:
  • signatures_dict (dict) – a dictionary having as keys the signature names and as values the gene signatures (lists of gene names).

  • file (str) – filepath of gmt file. See pandas.DataFrame.to_csv documentation.

cia.utils.signatures_similarity(signatures_dict, show='J')[source]

Computes the similarity between gene signatures.

Parameters:
  • signatures_dict (dict) – A dictionary having as keys the signature names and as values the lists of gene names (gene signatures).

  • show (str, optional) – Specifies the metric for showing similarities: ‘J’ for Jaccard index or ‘%’ for percentages of intersection. Default is ‘J’.

Returns:

similarity – A DataFrame containing the similarity of each pair of signatures, with signatures as both rows and columns.

Return type:

pandas.DataFrame

Raises:

ValueError – If ‘show’ is different from ‘J’ or ‘%’.

Example

>>> signatures = {
>>>     'signature1': ['gene1', 'gene2', 'gene3'],
>>>     'signature2': ['gene2', 'gene3', 'gene4'],
>>>     'signature3': ['gene1', 'gene5']
>>> }
>>> similarity = signatures_similarity(signatures, show='J')
>>> print(similarity)