# Storage and Retrieval: Memory Search and Indexing (Java-focused)

## Java Memory Management

Java has automatic memory management that automatically performs *garbage collection*, which auto-release objects that are no longer in use to free up memory.

## Java Memory Structure

- Stack: primitive types (int, char, …) and holds a reference to heap objects.
- Heap: actual objects, objects store their value inside the heap.

**Stack vs Heap**

- Stack: memory allocation is automatic, size is flexible but limited, and variables inside the stack remain there as long as the function that created them is running; If stack space is full, Java throws
**java.lang.StackOverFlowError** - Heap: not auto managed, prone to memory leak
**java.lang.OutOfMemoryError**

We have only one heap memory for each JVM process. Therefore, Heap is a shared part of memory regardless of how many threads are running inside the Java.

**Algorithm Complexity**

no matter how large the data is, will always take the same time to return.*O(1)*— constant time:grows much more slowly than the data size, as the base is always 2 in computer science, if n is 8, we would have our n 3 times because Log base 2 of 8 is 3, like*O(log n)*— logarithmic time:*binary search*because it doesn’t look at the majority of the elements for large collections.performance grows linearly with respect to the size of the collection, like*O(n)*— linear time:*looping through a list*and returning the number of elements matching a string will take linear time.repeatedly divide a set of data in half, then processes those halves independently; if the n is 8, then this algorithm will run 8 * log(8) = 8 * 3 = 24 times; dividing the unsorted list into sub-lists until each sublist contains 1 element,*O(n log n)*— loglinear time:*Merge sort, Heap sort, and Quick sort*.code that has*O(n2)*— n squared time:*nested loops*where each loop goes through the data takes*n*squared time

**Search Improvement Methods**

**Hash Tables**

- maps the original data (value) to a smaller size data (key), and a hash function applies a set of data (2D data), we get a 2D hash table.
- The hash function usually converts the large strings into smaller strings with equal sizes, it compresses the data and enables the system to rapidly access the data.

Hash table as a data structure that associates a key to a value to enable the algorithm very quickly lookup for the original data (with an index).

**BitCask Hash Table**

- supports hash index and offers high-performance read and write
- all information will be stored inside memory, but with one single disk seek it can read the required data from the disk into memory if there is a need.
- very useful when we have lots of updates in our key-value pairs.

**Tree Structures**

Tree is a sub type a graph data structure, which has a root and child nodes (but without cycle)

*Binary Tree*

- Each root node in the tree has only two child nodes, and all nodes on its left size should be smaller than the nodes on the right side.
**Searching**information in a binary tree usually takes*O (log n)***Inserting**a new key is, but when the table is*O(2n)***not empty**since the algorithm navigates through a correct branch to fit the new key into its correct position or perhaps update a node and its children structure*O(n2)***full binary tree —**means that each node of the table is either zero or has two children.**balanced binary tree —**search cost and insert cost is guaranteed to be logarithmic*O(n log n).*- Useful when the data objects (or keys) are having a
**sort of order**inside the tree, and the structure of data is**static**and**not changing**.

*2–3 Search Tree*

- nodes could hold more than one key — there are
*n*keys in a node, it always has*n+1*children. - Search and insert costs are both guaranteed to be
*O(log n)*

*B-Tree (Balancing Factor Tree)*

- a type of a 2–3 search tree, a node can have more than two children.
- all of the B-Tree leaves must be at the same depth.

*Trie, Prefix Tree, Radius Tree*

- Trie nodes are assumed to be
**ordered from the root**and the root node of the trie is always**empty,**they hold a**partial key**. - Trie can be used to store words, the path from the root to the last child presents a word, and each node stores a character
- As the tree is getting bigger, the tree holds more words, and it requires less time to find the right place for the newly inserted data object
- Trie can find the word in the dictionary in
**O(n)**time, n is the length of words; it can insert the word in**O(n)**time.

**Bit Manipulations & Compression**

- a bit is the smallest unit of information that can be either 0 or 1.
- 8 bit constitute a larger data called a byte, 1024 bytes called kilo byte, 1024 kilo byte called megabyte, 1024 megabyte called giga byte and 1024 giga byte called tera byte,…
- transforming the data from one format into another format that is
*smaller => occupy less space => reduce the latency* - usually, the process is done on a bit level and the data converted into a binary format

*Bitmap Index*

- useful for data with low cardinality, which means the diversity of data is limited, for instance, gender could be either ‘male’ or ‘female’.

**Huffman Encoding**

- lossless compression technique (compress much better)
- instead of using ASCII codes (8 bits) to store each character, we can use three bits to present all characters
- useful when we have lots of repetitive information, like words inside the text document, pixels inside a picture, etc.
- operates based on creating a binary tree, and assigns values to each character based on their position inside the tree

**Sliding Window**

**MinHashing**

- MinHash estimates similar sets of information objects — could be the image, time series (audio signal, heart rate), or anything.
- MinHash is using Jaccard. It maps each set of information into
*signature vectors*which are vectors of fixed length. This enables the Jaccard to identify how similar are two sets*without enumerating*each of their elements.

**BloomFilter**

- A data structure tells us if a particular data existed in a set or not, by checking the bits which are set to 1.
- It is memory efficient and fast, the base structure is a
**bit vector**(each cell is a bit).