Microsoft Florence-2

Microsoft has launched its new vision foundation model called Florence which is trained to be a (UNIFIED REPRESENTATION FOR A VARIETY OF VISION TASKS) #Generalist Model that can be fine-tuned for multiple tasks like Segmentation, Object Detection , O.C.R, Image Description etc. Despite of being a small model it does a pretty good job in beating larger models like flamingo 80B in zero shot tasks.

Here is an example of models output for object detection where i have asked the model with a task to perform object detection for wheels Here is the result.

As we can notice the model has correctly identified the region where wheels are present in the Image and it has also given out the bounding boxes in the outputs including labels and bounding boxes which is quiet amazing.

Here is the link of hugging face spaces to use this model  [Florence-2] 

How the Florence Model is Built

Microsoft puts this in a 2d view were on x axis it represents spatial hierarchy and on y axis it represents Semantic Granularity.

X-Axis (Spatial Hierarchy) which represents how deeper are you going into image from left to right the complex is the task ie Image Level (classification Tasks), Region Level(Object detection Tasks) and pixel level(Image Segmentation Tasks)

Y-Axis (semantic granularity )ie None(No semantic), coarse semantic(Level 1 captioning and description as shown in image simply stating Person, car, road), Fine grained semantic(Level 2 captioning a very detailed explanation of the image as shown in figure)

So the model is trained to perform across all these tasks as a part of uniform representation for a variety of tasks which focuses on replacing current large vision models which perform very well on certain transfer learning problem statements but fail to excel in handling all the tasks across the hierarchy.

Data-set Preparation

To help Florence excel in performing multiple tasks, Microsoft has created a high quality annotated dataset. ie FLD 5B 5.4 billion image annotations on 126 billion image dataset.

Florence Data Engine:

Step-1: Microsoft has generated annotations for images with the help of existing models that are specialist in a particular task like azure ocr api for Ocr , for object detection a specialist model etc.

Step-2: A Multifaceted filtering process to refine and eliminate undesired annotations. Our general filtering protocol mainly focuses on two data types in the annotations: text and region data

Step-3: They trained a multitask model adopt at processing sequential data. Evaluating this model against the training images revealed significant improvements in its predictions, especially for cases where original labels contained inaccuracies or irrelevant information, common in alt-texts. Inspired by these promising results, They incorporated these refined annotations alongside their originals

MODEL ARCHITECTURE

As shown in the image image will be passed into a vision encoder and alongside image a multitask prompts are passed into a Multi modal encoder which returns language token embeddings and further these image and language token embeddings are concatenated and passed into transformer encoder and decoder blocks and then a final outcome is generated a text plus location tokens.

Given the input x combined from the image and the prompt, and the target y, they have used a standard language Modelling with cross-entropy loss for all the tasks.

Conclusion

kudos to Microsoft Team to built a model Despite of being smaller in size its performance is absolutely Impressive and comparable with other large vision models , I see a huge potential of further fine tuning this model for custom tasks and use this as a replacement for much larger computer vision models which Reduces cost and Inference Time with a good quality results. References

Research Paper  [https://arxiv.org/pdf/2311.06242] 

Read More

author image

By Venkata Sai Santosh

Senior Data Scientist

Read other blogs

Your go-to resource for IT knowledge. Explore our blog for practical advice and industry updates.

Discover valuable insights and expert advice.

Uncover valuable insights and stay ahead of the curve by subscribing to our newsletter.

Please enter name

Please enter E-mail Id

Please enter contact number

Sign up

Space Inventive | Powered by SpaceAI

Welcome to Space Inventive!

Space Bot is typing

This website uses cookies

We use cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners who may combine it with other information that you’ve provided to them or that they’ve collected from your use of their services.

Deny
Allow