Storage and Retrieval: Memory Search and Indexing (Java-focused)
Java Memory Management
Java has automatic memory management that automatically performs garbage collection, which auto-release objects that are no longer in use to free up memory.
Java Memory Structure
- Stack: primitive types (int, char, …) and holds a reference to heap objects.
- Heap: actual objects, objects store their value inside the heap.
Stack vs Heap
- Stack: memory allocation is automatic, size is flexible but limited, and variables inside the stack remain there as long as the function that created them is running; If stack space is full, Java throws java.lang.StackOverFlowError
- Heap: not auto managed, prone to memory leak problems, we need to free allocated memory by code, no size limit, with slower access; if heap space is full, Java throws java.lang.OutOfMemoryError
We have only one heap memory for each JVM process. Therefore, Heap is a shared part of memory regardless of how many threads are running inside the Java.
- O(1) — constant time: no matter how large the data is, will always take the same time to return.
- O(log n) — logarithmic time: grows much more slowly than the data size, as the base is always 2 in computer science, if n is 8, we would have our n 3 times because Log base 2 of 8 is 3, like binary search because it doesn’t look at the majority of the elements for large collections.
- O(n) — linear time: performance grows linearly with respect to the size of the collection, like looping through a list and returning the number of elements matching a string will take linear time.
- O(n log n) — loglinear time: repeatedly divide a set of data in half, then processes those halves independently; if the n is 8, then this algorithm will run 8 * log(8) = 8 * 3 = 24 times; dividing the unsorted list into sub-lists until each sublist contains 1 element, Merge sort, Heap sort, and Quick sort.
- O(n2) — n squared time: code that has nested loops where each loop goes through the data takes n squared time
Search Improvement Methods
- maps the original data (value) to a smaller size data (key), and a hash function applies a set of data (2D data), we get a 2D hash table.
- The hash function usually converts the large strings into smaller strings with equal sizes, it compresses the data and enables the system to rapidly access the data.
Hash table as a data structure that associates a key to a value to enable the algorithm very quickly lookup for the original data (with an index).
BitCask Hash Table
- supports hash index and offers high-performance read and write
- all information will be stored inside memory, but with one single disk seek it can read the required data from the disk into memory if there is a need.
- very useful when we have lots of updates in our key-value pairs.
Tree is a sub type a graph data structure, which has a root and child nodes (but without cycle)
- Each root node in the tree has only two child nodes, and all nodes on its left size should be smaller than the nodes on the right side.
- Searching information in a binary tree usually takes O (log n)
- Inserting a new key is O(2n), but when the table is not empty O(n2) since the algorithm navigates through a correct branch to fit the new key into its correct position or perhaps update a node and its children structure
- full binary tree — means that each node of the table is either zero or has two children.
- balanced binary tree — search cost and insert cost is guaranteed to be logarithmic O(n log n).
- Useful when the data objects (or keys) are having a sort of order inside the tree, and the structure of data is static and not changing.
2–3 Search Tree
- nodes could hold more than one key — there are n keys in a node, it always has n+1 children.
- Search and insert costs are both guaranteed to be O(log n)
B-Tree (Balancing Factor Tree)
- a type of a 2–3 search tree, a node can have more than two children.
- all of the B-Tree leaves must be at the same depth.
Trie, Prefix Tree, Radius Tree
- Trie nodes are assumed to be ordered from the root and the root node of the trie is always empty, they hold a partial key.
- Trie can be used to store words, the path from the root to the last child presents a word, and each node stores a character
- As the tree is getting bigger, the tree holds more words, and it requires less time to find the right place for the newly inserted data object
- Trie can find the word in the dictionary in O(n) time, n is the length of words; it can insert the word in O(n) time.
Bit Manipulations & Compression
- a bit is the smallest unit of information that can be either 0 or 1.
- 8 bit constitute a larger data called a byte, 1024 bytes called kilo byte, 1024 kilo byte called megabyte, 1024 megabyte called giga byte and 1024 giga byte called tera byte,…
- transforming the data from one format into another format that is smaller => occupy less space => reduce the latency
- usually, the process is done on a bit level and the data converted into a binary format
- useful for data with low cardinality, which means the diversity of data is limited, for instance, gender could be either ‘male’ or ‘female’.
- lossless compression technique (compress much better)
- instead of using ASCII codes (8 bits) to store each character, we can use three bits to present all characters
- useful when we have lots of repetitive information, like words inside the text document, pixels inside a picture, etc.
- operates based on creating a binary tree, and assigns values to each character based on their position inside the tree
- MinHash estimates similar sets of information objects — could be the image, time series (audio signal, heart rate), or anything.
- MinHash is using Jaccard. It maps each set of information into signature vectors which are vectors of fixed length. This enables the Jaccard to identify how similar are two sets without enumerating each of their elements.
- A data structure tells us if a particular data existed in a set or not, by checking the bits which are set to 1.
- It is memory efficient and fast, the base structure is a bit vector (each cell is a bit).