2 paragraphs
Saturday, June 20, 2026
Today, I worked on creating script for the #cosine-similarity blog. I used the BGE-M3 to create initial dense embeddings. I retrieve top 50 results and do re-ranking using BGE-reranker-v2 to get top 5 results. I have setup search and retrieval pipelines for GitHub commits and Youtube videos. I can fetch commits based on repo name, and date range. For YouTube videos, I use a database from kaggle. Tomorrow, I plan to finish up the presentation and add a #blog entry.
Monday, June 8, 2026
I started a #blog for #cosine-similarity. This blog starts with a visual of the definition of the cos function. The blog then talks about vectors and how the cos theta function value is 0 when the vectors overlap and how the value of the function is 1 when the vectors are orthogonal. This is essentially the dot product of the vectors. I then created a script that used word embeddings using the gensim package to illustrate how learned word embeddings exist in a latent space that encodes semantic meaning of the words. For example, we can do things like Queen = Kind - Male + Female in the latent space. However, this is not exactly but we use nearest neighbor search to find the closest vector. Next work is to play with sentence embeddings and create the GitHub commit understanding examples.