s2spy.rgdr.label_alignment
==========================

.. py:module:: s2spy.rgdr.label_alignment

.. autoapi-nested-parse::

   Label alignment tools for RGDR clusters.


Functions
---------

.. autoapisummary::

   s2spy.rgdr.label_alignment._get_split_cluster_dict
   s2spy.rgdr.label_alignment._flatten_cluster_dict
   s2spy.rgdr.label_alignment._init_overlap_df
   s2spy.rgdr.label_alignment._calculate_overlap
   s2spy.rgdr.label_alignment.calculate_overlap_table
   s2spy.rgdr.label_alignment.get_overlapping_clusters
   s2spy.rgdr.label_alignment.remove_subsets
   s2spy.rgdr.label_alignment.remove_overlapping_clusters
   s2spy.rgdr.label_alignment.name_clusters
   s2spy.rgdr.label_alignment.create_renaming_dict
   s2spy.rgdr.label_alignment.ensure_unique_names
   s2spy.rgdr.label_alignment._rename_datasets
   s2spy.rgdr.label_alignment.rename_labels


Module Contents
---------------

.. py:function:: _get_split_cluster_dict(cluster_labels: xarray.DataArray) -> dict

   Generate a dictionary of all cluster labels in each split.

   :param cluster_labels: DataArray containing all the cluster maps, with the dimension
                          "split" for the different clusters over splits.

   :returns: [cluster_a, cluster_b], 1: [cluster_a], ...}
   :rtype: Dictionary in the form {0


.. py:function:: _flatten_cluster_dict(cluster_dict: dict) -> list[tuple[int, int]]

   Flattens a cluster dictionary to a list with (split, cluster) as values.

   For example, if the input is {0: [-1, -2, 1], 1: [-1, 1]}, this function will return
   the following list: [(0, -1), (0, -2), (0, 1), (1, -1), (1, 1)]

   :param cluster_dict: The cluster dictionary which should be flattened

   :returns: A list of the clusters and their splits


.. py:function:: _init_overlap_df(cluster_labels: xarray.DataArray)

   Build an empty dataframe with multi-indexes for clusters and labels.

   The structure will be something like the following table:

   split       |        0         1
   label       |       -1        -1
   ------------|-------------------
   split label |
   0     -1    |      NaN  0.583333
   1     -1    | 0.333333       NaN

   The same multi-index is used for both rows and columns, such that the dataframe can
   be populated with the overlap between labels from different splits.

   :param cluster_labels: DataArray containing all the cluster maps, with the dimension
                          "split" for the different clusters over splits.

   :returns: A pandas dataframe containing a table


.. py:function:: _calculate_overlap(cluster_labels: xarray.DataArray, split_a: int, cluster_a: int, split_b: int, cluster_b: int) -> float

   Calculate the overlapping fraction between two clusters, over different splits.

   The overlap is defines as:
       overlap = n_overlapping_cells / total_cells_cluster_a

   :param cluster_labels: DataArray containing all the cluster maps, with the dimension
                          "split" for the different clusters over splits.
   :param split_a: The index of the split of the first cluster
   :param cluster_a: The value of the first cluster in the clusters_da DataArray.
   :param split_b: The index of the split of the second cluster
   :param cluster_b: The value of the second cluster in the clusters_da DataArray.

   :returns: Overlap of the first cluster with the second cluster, as a fraction (0.0 - 1.0)


.. py:function:: calculate_overlap_table(cluster_labels: xarray.DataArray) -> pandas.DataFrame

   Fill the overlap table with the overlap between clusters over different splits.

   :param cluster_labels: DataArray containing all the cluster maps, with the dimension
                          "split" for the different clusters over splits.

   :returns:

             The overlap table with all valid combinations filled in. Non valid combinations
                 of clusters (the cluster itself, or within the same split) will have NaN
                 values.


.. py:function:: get_overlapping_clusters(cluster_labels: xarray.DataArray, min_overlap: float = 0.1) -> set

   Create sets of overlapping clusters.

   Clusters will be considered to have sufficient overlap if they overlap at least by
   the minimum threshold. Note that this is a one way criterion.

   For example, if the overlap table is like the following:
   split       |    0       1
   label       |   -1      -1
   ------------|-------------
   split label |
   0     -1    |  NaN    0.05
   1     -1    | 0.20     NaN

   Then cluster (split: 0, label: -1) will overlap with cluster (1, -1) by 0.05. This
   is insufficient to be considered the same cluster. However, cluster (1, -1) does
   overlap by 0.20 with cluster (0, -1), so they *will* be considered the same cluster.
   This situation can arise when one cluster is much bigger than another one.

   In this example, the overlapping set will be {frozenset("0_-1", "1_-1")}.
   Note that if we would use a threshold of 0.05, the output would not change, as the
   two nexted sets {"0_-1", "1_-1"} and {"1_-1", "0_-1"} are the same.

   :param cluster_labels: DataArray containing all the cluster maps, with the dimension
                          "split" for the different clusters over splits.
   :param min_overlap: Minimum overlap (0.0 - 1.0) when clusters are considered to be
                       sufficiently overlapping to belong to the same signal. Defaults to 0.1.

   :returns:

             A set of (frozen) sets, each set corresponding to a possible combination of
                 clusters that overlap.


.. py:function:: remove_subsets(clusters: set) -> set

   Remove subsets from the clusters.

   For example: {{"A"}, {"A", "B"}} will become {{"A", "B"}}, as "A" is a subset of the
   bigger cluster.


.. py:function:: remove_overlapping_clusters(clusters: set) -> set

   Remove clusters shared between two different groups of clusters.

   Largest cluster gets priority.

   For example: {{"A", "D"}, {"A", "B", "C"}} will become {{"D"}, {"A", "B", "C"}}


.. py:function:: name_clusters(clusters: set) -> dict

   Give each cluster a unique name.

   Note: the first 26 names will be from A - Z. If more than 26 clusters are present,
   these will get names with two uppercase letters (AA - ZZ).

   :param clusters: A set of different clusters. Each element is a list of clusters and
                    their splits.

   :returns: clusters0, cluster_name1: cluster1}
   :rtype: A dictionary in the form {clustername0


.. py:function:: create_renaming_dict(aligned_clusters: dict) -> dict[int, list[tuple[int, str]]]

   Create a dictionary that can be used to rename the clusters to the aligned names.

   :param aligned_clusters: A dictionary containing the different splits, and the mapping
                            of RGDR clusters to new names.

   :returns:

             [(old_name0, new_name0),
                                                      (old_name1, new_name1)]}.
   :rtype: A dictionary with the structure {split


.. py:function:: ensure_unique_names(renaming_dict: dict[int, list[tuple[int, str]]]) -> dict

   Ensure that in every split, every cluster has a unique name.

   The function finds the non-unqiue names within each split, and will rename these by
   adding a number. For example, there are three clusters in the first split with the
   name "C". The new names will be "C", "C1" and "C2".

   If renaming_dict is the following:
       {0: [(-1, "A"), (1, "B")], 1: [(-1, "A"), (-2, "A")]}
   The renamed dictionary will be:
       {0: [(-1, "A1"), (1, "B")], 1: [(-1, "A1"), (-2, "A2")]}

   :param renaming_dict: Renaming dictionary with non unique names.

   :returns: Renaming dictionary with only unique names


.. py:function:: _rename_datasets(rgdr_list: list[s2spy.rgdr.rgdr.RGDR], clustered_data: list[xarray.DataArray], renaming_dict: dict) -> list[xarray.DataArray]

   Apply the renaming dictionary to the labels of the clustered data.

   :param rgdr_list: List of RGDR objects that were used to fit and transform the data.
   :param clustered_data: List of the RGDR-transformed data. This can either be the
                          training data or the test data.
   :param renaming_dict: Dictionary containing the mapping {old_label: new_label}

   :returns: A list of the input clustered data, with the labels renamed.


.. py:function:: rename_labels(rgdr_list: list[s2spy.rgdr.rgdr.RGDR], clustered_data: list[xarray.DataArray]) -> list[xarray.DataArray]

   Return a new object with renamed cluster labels aligned over different splits.

   To aid in users comparing the clustering over different splits, this function tries
   to match the clusters over different splits, and give clusters that are in the same
   region the same name (e.g. "A"). The clusters themselves are not changed, only the
   labels renamed.

   :param rgdr_list: List of RGDR objects that were used to fit and transform the data.
   :param clustered_data: List of the RGDR-transformed datasets. This can either be the
                          training data or the test data.

   :returns: A list of the input clustered data, with the labels renamed.