Lifelong Intelligence Beyond the Edge using Hyperdimensional Computing: LifeDH

cover
24 Jul 2024

Authors:

(1) Xiaofan Yu, University of California San Diego, La Jolla, California, USA (x1yu@ucsd.edu);

(2) Anthony Thomas, University of California San Diego, La Jolla, California, USA (ahthomas@ucsd.edu);

(3) Ivannia Gomez Moreno, CETYS University, Campus Tijuana, Tijuana, Mexico (ivannia.gomez@cetys.edu.mx);

(4) Louis Gutierrez, University of California San Diego, La Jolla, California, USA (l8gutierrez@ucsd.edu);

(5) Tajana Šimunić Rosing, University of California San Diego, La Jolla, USA (tajana@ucsd.edu).

Abstract and 1. Introduction

2 Related Work

3 Background on HDC

4 Problem Definition

5 LifeDH

6 Variants of LifeHD

7 Evaluation of LifeHD

8 Evaluation of LifeHD semi and LifeHDa

9 Discussions and Future Works

10 Conclusion, Acknowledgments, and References

5 LIFEHD

In this section, we present the design of LifeHD, the first unsupervised HDC framework for lifelong learning in general edge IoT applications. Compared to operating in the original data space, HDC improves pattern separability through sparsity and high dimensionality, making it more resilient against catastrophic forgetting [52]. LifeHD preserves the advantages of HDC in computational efficiency and lifelong learning, while handling the input of unlabeled streaming data, which has not been achieved in previous work [18, 21, 22, 41, 65].

5.1 LifeHD Overview

Fig. 4 gives an overview of how LifeHD works. The first step is HDC encoding of data into hypervectors as described in Sec. 3. Training samples 𝑋 are organized into batches of size 𝑏𝑆𝑖𝑧𝑒 and input into an optional fixed NN for feature extraction (e.g. for images and sound) and the encoding module. The encoded hypervectors πœ™ (𝑋) are input to LifeHD’s two-tier memory design inspired by cognitive science studies [5], consisting of working memory and long-term memory. This memory system intelligently and dynamically manages historical patterns, stored as hypervectors and referred to as cluster HVs. As shown in Fig. 4, the working memory is designed with three components: novelty detection, cluster HV update and cluster HV merge. πœ™ (𝑋) is first input into novelty detection step (1). An insertion to the cluster HVs is made if a novelty flag is raised, otherwise πœ™ (𝑋) updates the existing cluster HVs (2). The third component, cluster HV merge (3), retrieves the cluster HVs from long-term memory, and merges similar cluster HVs into a supercluster via a novel spectral clustering-based merging algorithm [59]. The interaction between working and long-term memory happens as commonly encountered cluster HVs are copied to long-term memory, which we call consolidation (4). Finally, when the size

Figure 4: The end-to-end algorithm flow of LifeHD.

Table 1: List of important notations.

limit of either working or long-term memory is reached, the least recently used cluster HVs are forgotten (5).

ecently used cluster HVs are forgotten (β—‹5 ). All modules in LifeHD work collaboratively, making it adaptive and robust to continuously changing streams without relying on any form of prior knowledge. For example, in scenarios of distribution drift, LifeHD may generate new cluster HVs upon encountering drifted samples initially, which can later be merged into coarse clusters. This approach ensures that LifeHD can efficiently capture and retain historical patterns.

In the following, we discuss more details about the major components of LifeHD: novelty detection (Sec. 5.2), cluster HV update (Sec. 5.3), and cluster HV merging (Sec. 5.4). We summarize the important notations used in this paper in Table 1.

5.2 Novelty Detection

The initial novelty detection step (1 in Fig. 4) is crucial for identifying emerging patterns in the environment. SupposeM = {π‘š1, ...,π‘šπ‘€} is the set of cluster HVs stored in the working memory. We gauge the "radius" of each cluster by tracking two scalars for each cluster HV 𝑖: πœ‡π‘– and πœŽΛ†π‘– , which represent the mean cosine difference and standard difference between the cluster HV and its assigned inputs. Given πœ™ (𝑋), we first identify the most similar cluster HV, denoted by 𝑗. LifeHD marks πœ™ (𝑋) as β€œnovel" if it substantially differs from its nearest cluster HV. Specifically, this dissimilarity is measured by comparing cos(πœ™ (𝑋),π‘šπ‘—) with a threshold based on the historical distance distribution of cluster HV 𝑗:

The hyperparameter 𝛾 fine-tunes the sensitivity to novelties.

LifeHD recognizes new πœ™ (𝑋) as prototypes and inserts them into the working memory. When reaching its size limit 𝑀, the working memory experiences forgetting (β—‹5 in Fig. 4). The least recently used (LRU) cluster HV, represented by πΏπ‘…π‘ˆ = argmin𝑀 𝑖=1 𝑝𝑖 , is replaced. Here 𝑝 corresponds to the latest batch index where the cluster HV was accessed. A similar forgetting mechanism is configured for the long-term memory, where the last batch accessed is marked with π‘ž.

5.3 Cluster HV Update

If novelty is not detected, indicating that πœ™ (𝑋) closely matches cluster HV 𝑗, we proceed to update the cluster HV and its associated information (2 in Fig. 4). This update process involves bundling πœ™ (𝑋) with cluster HV π‘šπ‘— , akin to how class hypervectors are updated as described in Sec. 3, and updated πœ‡π‘— and πœŽΛ†π‘— with their moving average

The hyperparameter 𝛼 adjusts the balance between historical and recent inputs, where a higher 𝛼 gives more weight to recent samples. Properly maintaining πœ‡π‘— and πœŽΛ†π‘— is vital for tracking the β€œradius” of each cluster HV, affecting future novelty detection. We also increase the hit frequency β„Žπ‘–π‘‘π‘— and refresh 𝑝𝑗 with current batch index 𝑖𝑑π‘₯. β„Žπ‘–π‘‘π‘— is further used to compared with a predetermined threshold β„Žπ‘–π‘‘π‘‘β„Ž to decide when a working memory cluster HV appears sufficiently frequently to be consolidated to long-term memory (4 in Fig. 4). 𝑝𝑗 determines forgetting as described in the previous section. With this lightweight approach, LifeHD continually records temporal cluster HVs from the environment, while the most prominent cluster HVs are transferred to long-term memory.

Figure 5: An intuitive visualization of cluster HV merging.

5.4 Cluster HV Merging

Cluster HV merge (3 in Fig. 4) has the dual benefit of reducing memory use and of elucidating underlying similarity structure in the data. Intuitively, a group of cluster HVs can be merged if they are similar to each other and dissimilar from other cluster HVs. For instance, one might merge the cluster HVs for Bulldog and Chihuahua into a single β€œDog” cluster HV, that remains distinct from the cluster HV for β€œTabby Cat”.

To merge the cluster HVs, we first construct a similarity graph defined over the cluster HVs from the long-term memory. The cluster HVs correspond to nodes, and a pair of cluster HVs are connected by an edge if they are sufficiently similar. We then merge the cluster HVs by computing a particular type of cut in the graph in a manner similar to spectral clustering [40]. This graph based formalism for clustering is able to capture complex types of cluster geometry and often substantially outperforms simpler approaches like K-Means [59]. We detail the steps of cluster HV merging in LifeHD below, while Fig. 5 offers an illustrative overview.

Step 2: Decomposition. We compute the Laplacianπ‘Š = 𝐷 βˆ’π΄, where 𝐷 is the diagonal matrix in which 𝐷𝑖𝑖 = Í 𝑗 𝐴𝑖𝑗 . We then compute the eigenvalues πœ†1, .., πœ†πΏ, sorted in increasing order, and eigenvectors 𝜈1, ..., 𝜈𝐿 of π‘Š.

Step 3: Grouping. We infer π‘˜ = maxπ‘–βˆˆ [𝐿] πœ†π‘– ≀ 𝑔𝑒𝑏, and merge the cluster HVs by running K-Means on 𝜈1, ..., πœˆπ‘˜ . The upperbound 𝑔𝑒𝑏 is a hyperparameter that adjusts the granularity of merging, with a smaller 𝑔𝑒𝑏 leading to smaller π‘˜ thus encouraging merging more aggressively.

Our merging approach is formally grounded, as discussed in [59]. It is a well-known fact that the eigenvectors of π‘Š encode information about the connected components of G. When G has π‘˜ connected components, the eigenvalues πœ†1 = πœ†2 = ... = πœ†π‘˜ = 0. To recover these components, K-Means clustering on 𝜈1, ..., πœˆπ‘˜ can be employed, as explained in [59]. However, practical scenarios may have a few inter-component edges that should ideally be distinct. For instance, when the similarity threshold is imprecisely set, erroneous edges may appear in the graph, causing πœ†1, ..., πœ†π‘˜ to be only approximately zero. Our merging approach is designed to handle this situation by introducing 𝑔𝑒𝑏. The cluster HV merging is evaluated every π‘“π‘šπ‘’π‘Ÿπ‘”π‘’ batches, where π‘“π‘šπ‘’π‘Ÿπ‘”π‘’ is a hyperparameter that controls the trade-off between merging performance and computational latency. Both 𝑔𝑒𝑏 and π‘“π‘šπ‘’π‘Ÿπ‘”π‘’ are analyzed in Sec. 7.8, along with other key hyperparameters in LifeHD.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.