Microsoft Florence-2
Microsoft has launched its new vision foundation model called Florence which is trained to be a (UNIFIED REPRESENTATION FOR A VARIETY OF VISION TASKS) #Generalist Model that can be fine-tuned for multiple tasks like Segmentation, Object Detection , O.C.R, Image Description etc. Despite of being a small model it does a pretty good job in beating larger models like flamingo 80B in zero shot tasks.
Here is an example of models output for object detection where i have asked the model with a task to perform object detection for wheels Here is the result.
As we can notice the model has correctly identified the region where wheels are present in the Image and it has also given out the bounding boxes in the outputs including labels and bounding boxes which is quiet amazing.
Here is the link of hugging face spaces to use this model [Florence-2]
How the Florence Model is Built
Microsoft puts this in a 2d view were on x axis it represents spatial hierarchy and on y axis it represents Semantic Granularity.
X-Axis (Spatial Hierarchy) which represents how deeper are you going into image from left to right the complex is the task ie Image Level (classification Tasks), Region Level(Object detection Tasks) and pixel level(Image Segmentation Tasks)
Y-Axis (semantic granularity )ie None(No semantic), coarse semantic(Level 1 captioning and description as shown in image simply stating Person, car, road), Fine grained semantic(Level 2 captioning a very detailed explanation of the image as shown in figure)
So the model is trained to perform across all these tasks as a part of uniform representation for a variety of tasks which focuses on replacing current large vision models which perform very well on certain transfer learning problem statements but fail to excel in handling all the tasks across the hierarchy.
Data-set Preparation
To help Florence excel in performing multiple tasks, Microsoft has created a high quality annotated dataset. ie FLD 5B 5.4 billion image annotations on 126 billion image dataset.
Florence Data Engine:
Step-1: Microsoft has generated annotations for images with the help of existing models that are specialist in a particular task like azure ocr api for Ocr , for object detection a specialist model etc.
Step-2: A Multifaceted filtering process to refine and eliminate undesired annotations. Our general filtering protocol mainly focuses on two data types in the annotations: text and region data
Step-3: They trained a multitask model adopt at processing sequential data. Evaluating this model against the training images revealed significant improvements in its predictions, especially for cases where original labels contained inaccuracies or irrelevant information, common in alt-texts. Inspired by these promising results, They incorporated these refined annotations alongside their originals
MODEL ARCHITECTURE
As shown in the image image will be passed into a vision encoder and alongside image a multitask prompts are passed into a Multi modal encoder which returns language token embeddings and further these image and language token embeddings are concatenated and passed into transformer encoder and decoder blocks and then a final outcome is generated a text plus location tokens.
Given the input x combined from the image and the prompt, and the target y, they have used a standard language Modelling with cross-entropy loss for all the tasks.
Conclusion
kudos to Microsoft Team to built a model Despite of being smaller in size its performance is absolutely Impressive and comparable with other large vision models , I see a huge potential of further fine tuning this model for custom tasks and use this as a replacement for much larger computer vision models which Reduces cost and Inference Time with a good quality results. References
Research Paper [https://arxiv.org/pdf/2311.06242]
Read More
By Venkata Sai Santosh
Senior Data Scientist
Read other blogs
Your go-to resource for IT knowledge. Explore our blog for practical advice and industry updates.
Discover valuable insights and expert advice.
Uncover valuable insights and stay ahead of the curve by subscribing to our newsletter.
Space Inventive | Powered by SpaceAI
Space Bot is typing