s2spy.rgdr.label_alignment

Label alignment tools for RGDR clusters.

Module Contents

Functions

_get_split_cluster_dict(→ dict)

Generate a dictionary of all cluster labels in each split.

_flatten_cluster_dict(→ list[tuple[int, int]])

Flattens a cluster dictionary to a list with (split, cluster) as values.

_init_overlap_df(cluster_labels)

Build an empty dataframe with multi-indexes for clusters and labels.

_calculate_overlap(→ float)

Calculate the overlapping fraction between two clusters, over different splits.

calculate_overlap_table(→ pandas.DataFrame)

Fill the overlap table with the overlap between clusters over different splits.

get_overlapping_clusters(→ set)

Create sets of overlapping clusters.

remove_subsets(→ set)

Remove subsets from the clusters.

remove_overlapping_clusters(→ set)

Remove clusters shared between two different groups of clusters.

name_clusters(→ dict)

Give each cluster a unique name.

create_renaming_dict(→ dict[int, list[tuple[int, str]]])

Create a dictionary that can be used to rename the clusters to the aligned names.

ensure_unique_names(→ dict)

Ensure that in every split, every cluster has a unique name.

_rename_datasets(→ list[xarray.DataArray])

Apply the renaming dictionary to the labels of the clustered data.

rename_labels(→ list[xarray.DataArray])

Return a new object with renamed cluster labels aligned over different splits.

s2spy.rgdr.label_alignment._get_split_cluster_dict(cluster_labels: xarray.DataArray) dict[source]

Generate a dictionary of all cluster labels in each split.

Parameters:

cluster_labels – DataArray containing all the cluster maps, with the dimension “split” for the different clusters over splits.

Returns:

[cluster_a, cluster_b], 1: [cluster_a], …}

Return type:

Dictionary in the form {0

s2spy.rgdr.label_alignment._flatten_cluster_dict(cluster_dict: dict) list[tuple[int, int]][source]

Flattens a cluster dictionary to a list with (split, cluster) as values.

For example, if the input is {0: [-1, -2, 1], 1: [-1, 1]}, this function will return the following list: [(0, -1), (0, -2), (0, 1), (1, -1), (1, 1)]

Parameters:

cluster_dict – The cluster dictionary which should be flattened

Returns:

A list of the clusters and their splits

s2spy.rgdr.label_alignment._init_overlap_df(cluster_labels: xarray.DataArray)[source]

Build an empty dataframe with multi-indexes for clusters and labels.

The structure will be something like the following table:

split | 0 1 label | -1 -1 ————|——————- split label | 0 -1 | NaN 0.583333 1 -1 | 0.333333 NaN

The same multi-index is used for both rows and columns, such that the dataframe can be populated with the overlap between labels from different splits.

Parameters:

cluster_labels – DataArray containing all the cluster maps, with the dimension “split” for the different clusters over splits.

Returns:

A pandas dataframe containing a table

s2spy.rgdr.label_alignment._calculate_overlap(cluster_labels: xarray.DataArray, split_a: int, cluster_a: int, split_b: int, cluster_b: int) float[source]

Calculate the overlapping fraction between two clusters, over different splits.

The overlap is defines as:

overlap = n_overlapping_cells / total_cells_cluster_a

Parameters:
  • cluster_labels – DataArray containing all the cluster maps, with the dimension “split” for the different clusters over splits.

  • split_a – The index of the split of the first cluster

  • cluster_a – The value of the first cluster in the clusters_da DataArray.

  • split_b – The index of the split of the second cluster

  • cluster_b – The value of the second cluster in the clusters_da DataArray.

Returns:

Overlap of the first cluster with the second cluster, as a fraction (0.0 - 1.0)

s2spy.rgdr.label_alignment.calculate_overlap_table(cluster_labels: xarray.DataArray) pandas.DataFrame[source]

Fill the overlap table with the overlap between clusters over different splits.

Parameters:

cluster_labels – DataArray containing all the cluster maps, with the dimension “split” for the different clusters over splits.

Returns:

The overlap table with all valid combinations filled in. Non valid combinations

of clusters (the cluster itself, or within the same split) will have NaN values.

s2spy.rgdr.label_alignment.get_overlapping_clusters(cluster_labels: xarray.DataArray, min_overlap: float = 0.1) set[source]

Create sets of overlapping clusters.

Clusters will be considered to have sufficient overlap if they overlap at least by the minimum threshold. Note that this is a one way criterion.

For example, if the overlap table is like the following: split | 0 1 label | -1 -1 ————|————- split label | 0 -1 | NaN 0.05 1 -1 | 0.20 NaN

Then cluster (split: 0, label: -1) will overlap with cluster (1, -1) by 0.05. This is insufficient to be considered the same cluster. However, cluster (1, -1) does overlap by 0.20 with cluster (0, -1), so they will be considered the same cluster. This situation can arise when one cluster is much bigger than another one.

In this example, the overlapping set will be {frozenset(”0_-1”, “1_-1”)}. Note that if we would use a threshold of 0.05, the output would not change, as the two nexted sets {”0_-1”, “1_-1”} and {”1_-1”, “0_-1”} are the same.

Parameters:
  • cluster_labels – DataArray containing all the cluster maps, with the dimension “split” for the different clusters over splits.

  • min_overlap – Minimum overlap (0.0 - 1.0) when clusters are considered to be sufficiently overlapping to belong to the same signal. Defaults to 0.1.

Returns:

A set of (frozen) sets, each set corresponding to a possible combination of

clusters that overlap.

s2spy.rgdr.label_alignment.remove_subsets(clusters: set) set[source]

Remove subsets from the clusters.

For example: {{“A”}, {“A”, “B”}} will become {{“A”, “B”}}, as “A” is a subset of the bigger cluster.

s2spy.rgdr.label_alignment.remove_overlapping_clusters(clusters: set) set[source]

Remove clusters shared between two different groups of clusters.

Largest cluster gets priority.

For example: {{“A”, “D”}, {“A”, “B”, “C”}} will become {{“D”}, {“A”, “B”, “C”}}

s2spy.rgdr.label_alignment.name_clusters(clusters: set) dict[source]

Give each cluster a unique name.

Note: the first 26 names will be from A - Z. If more than 26 clusters are present, these will get names with two uppercase letters (AA - ZZ).

Parameters:

clusters – A set of different clusters. Each element is a list of clusters and their splits.

Returns:

clusters0, cluster_name1: cluster1}

Return type:

A dictionary in the form {clustername0

s2spy.rgdr.label_alignment.create_renaming_dict(aligned_clusters: dict) dict[int, list[tuple[int, str]]][source]

Create a dictionary that can be used to rename the clusters to the aligned names.

Parameters:

aligned_clusters – A dictionary containing the different splits, and the mapping of RGDR clusters to new names.

Returns:

[(old_name0, new_name0),

(old_name1, new_name1)]}.

Return type:

A dictionary with the structure {split

s2spy.rgdr.label_alignment.ensure_unique_names(renaming_dict: dict[int, list[tuple[int, str]]]) dict[source]

Ensure that in every split, every cluster has a unique name.

The function finds the non-unqiue names within each split, and will rename these by adding a number. For example, there are three clusters in the first split with the name “C”. The new names will be “C”, “C1” and “C2”.

If renaming_dict is the following:

{0: [(-1, “A”), (1, “B”)], 1: [(-1, “A”), (-2, “A”)]}

The renamed dictionary will be:

{0: [(-1, “A1”), (1, “B”)], 1: [(-1, “A1”), (-2, “A2”)]}

Parameters:

renaming_dict – Renaming dictionary with non unique names.

Returns:

Renaming dictionary with only unique names

s2spy.rgdr.label_alignment._rename_datasets(rgdr_list: list[s2spy.rgdr.rgdr.RGDR], clustered_data: list[xarray.DataArray], renaming_dict: dict) list[xarray.DataArray][source]

Apply the renaming dictionary to the labels of the clustered data.

Parameters:
  • rgdr_list – List of RGDR objects that were used to fit and transform the data.

  • clustered_data – List of the RGDR-transformed data. This can either be the training data or the test data.

  • renaming_dict – Dictionary containing the mapping {old_label: new_label}

Returns:

A list of the input clustered data, with the labels renamed.

s2spy.rgdr.label_alignment.rename_labels(rgdr_list: list[s2spy.rgdr.rgdr.RGDR], clustered_data: list[xarray.DataArray]) list[xarray.DataArray][source]

Return a new object with renamed cluster labels aligned over different splits.

To aid in users comparing the clustering over different splits, this function tries to match the clusters over different splits, and give clusters that are in the same region the same name (e.g. “A”). The clusters themselves are not changed, only the labels renamed.

Parameters:
  • rgdr_list – List of RGDR objects that were used to fit and transform the data.

  • clustered_data – List of the RGDR-transformed datasets. This can either be the training data or the test data.

Returns:

A list of the input clustered data, with the labels renamed.