Release Notes 6.0
Super Excited to share the latest development in our library, which essentially giving you more embedding choices -- Cohere and siglip, new chunking method-- late chunking and more crates that facilitates amazing modality and maintainability for our rust codebase, --processor crate. so let's dive in.
Late Chunking
The new 0.5.6 version adds Late Chunking to EmbedAnything, a technique introduced by Jina AI and Weaviate. Here's how we've implemented Late Chunking in EA:
๐๐ฎ๐๐ฐ๐ต ๐ฎ๐ ๐๐ต๐๐ป๐ธ ๐๐ฟ๐ผ๐๐ฝ: In EmbedAnything, with late chunking enabled, the batch size determines the number of neighboring chunks that will be processed together.
๐๐ผ๐ถ๐ป๐ ๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด: The grouped chunks are fed into the embedding model as a single, larger input. This allows the model to capture relationships and dependencies between adjacent chunks.
๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด ๐ฆ๐ฝ๐น๐ถ๐: After embedding, the combined output is divided back into the embeddings for the original, individual chunks.
๐ ๐ฒ๐ฎ๐ป ๐ฃ๐ผ๐ผ๐น๐ถ๐ป๐ด (๐ฝ๐ฒ๐ฟ ๐๐ต๐๐ป๐ธ): Mean pooling is then applied to each individual chunk's embedding, incorporating the contextual information learned during the joint embedding phase.
๐พ๐๐ฆ ๐ต๐๐๐๐๐๐ก๐ :
๐๐ผ๐ป๐๐ฒ๐ ๐-๐๐๐ฎ๐ฟ๐ฒ ๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐: By embedding neighboring chunks together, we capture crucial contextual information that would be lost with independent chunking.
๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฒ๐ฑ ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐ฃ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ: Expect a significant improvement in the accuracy and relevance of your search results.
model:EmbeddingModel = EmbeddingModel.from_pretrained_onnx(
WhichModel.Jina, hf_model_id="jinaai/jina-embeddings-v2-small-en", path_in_repo="model.onnx"
)
config = TextEmbedConfig(
chunk_size=1000,
batch_size=8,
splitting_strategy="sentence",
late_chunking=True,
)
# Embed a single file
data: list[EmbedData] = model.embed_file("test_files/attention.pdf", config=config)
Cohere Embed 4:
๐ง Single embedding per document, even for multimodal inputs ๐ Handles up to 128K tokens โ perfect for long-form business documents ๐๏ธ Supports compressed vector formats (int8, binary) for real-world scalability ๐ Multilingual across 100+ languages
The catch? Itโs not open-sourceโand even if it were, the model would be quite hefty to run locally. But if youโre already using cloud-based embeddings like OpenAIโs, Embed v4 is worth testing.
# Initialize the model once
model: EmbeddingModel = EmbeddingModel.from_pretrained_cloud(
WhichModel.CohereVision, model_id="embed-v4.0"
)
SigLIP
We already had Clip support but many of you asked for siglip support. It out performs clip for zero shot classification for smaller batch. It also has better memory efficinecy.
# Load the model.
model = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Clip,
model_id="google/siglip-base-patch16-224",
)
Processor Crate:
This crate contains various "processors" that accepts files and produces a chunked, metadata-rich document description. This is especially helpful for retrieval-augmented generation!
We have also received some additional cool feature requests on GitHub, which we would like to implement. If you want to help out please check out EmbedAnything on GitHub. We would love to have a contribution. ๐