Microsoft's computer vision model will generate alt text for Reddit images

Kyle Wiggers

Updated March 7, 2023 at 1:13 p.m.·4 min read

Two years ago, Microsoft announced Florence, an AI system that it pitched as a "complete rethinking" of modern computer vision models. Unlike most vision models at the time, Florence was both "unified" and "multimodal," meaning it could (1) understand language as well as images and (2) handle a range of tasks rather than being limited to specific applications, like generating captions.

Now, as a part of Microsoft's broader, ongoing effort to commercialize its AI research, Florence is arriving as a part of an update to the Vision APIs in Azure Cognitive Services. The Florence-powered Microsoft Vision Services launches today in preview for existing Azure customers, with capabilities ranging from automatic captioning, background removal and video summarization to image retrieval.

"Florence is trained on billions of image-text pairs. As a result, it's incredibly versatile," John Montgomery, CVP of Azure AI, told TechCrunch in an email interview. "Ask Florence to find a particular frame in a video, and it can do that; ask it to tell the difference between a Cosmic Crisp apple and a Honeycrisp apple, and it can do that."

The AI research community, which includes tech giants like Microsoft, have increasingly coalesced around the idea that multimodal models are the best path forward to more capable AI systems. Naturally, multimodal models -- models that, once again, understand multiple modalities, such as language and images or videos and audio -- are able to perform tasks in one shot that unimodal models simply cannot (e.g. captioning videos).

Why not string several "unimodal" models together to achieve the same end, like a model that understands only images and another that understands exclusively language? A few reasons, the first being that multimodal models in some cases perform better at the same task than their unimodal counterpart thanks to the contextual information from the additional modalities. For example, an AI assistant that understands images, pricing data and purchasing history is likely to offer better-personalized product suggestions than one that only understands pricing data.

The second reason is, multimodal models tend to be more efficient from a computational standpoint -- leading to speedups in processing and (presumably) cost reductions on the backend. Microsoft being the profit-driven business that it is, that is, no doubt, a plus.

So what about Florence? Well, because it understands images, video and language and the relationships between those modalities, it can do things like measure the similarity between images and text or segment objects in a photo and paste them onto another background.

I asked Montgomery which data Microsoft used to train Florence -- a timely question, I thought, in light of pending lawsuits that could decide whether AI systems trained on copyrighted data, including images, are in violation of the rights of intellectual property holders. He wouldn't give specifics, save that Florence uses "responsibly obtained" data sources "including data from partners." In addition, Montgomery said that Florence's training data was scrubbed of potentially problematic content -- another all-too-common feature of public training datasets.

"When using large foundational models, it is paramount to assure the quality of the training dataset, to create the foundation for the adapted models for each Vision task," Montgomery said. "Furthermore, the adapted models for each Vision task has been tested for fairness, adversarial and challenging cases and implement the same content moderation services we’ve been using for Azure Open AI Service and DALL-E."

Image Credits: Microsoft

We'll have to take the company's word for it. Some customers are, it seems. Montgomery says that Reddit will use the new Florence-powered APIs to generate captions for images on its platform, creating "alt text" so users with vision challenges can better follow along in threads.

"Florence’s ability to generate up to 10,000 tags per image will give Reddit much more control over how many objects in a picture they can identify and help generate much better captions," Montgomery said. "Reddit will also use the captioning to help all users improve article ranking for searching for posts."

Microsoft is also using Florence across a swath of its own platforms, products and services.

On LinkedIn, as on Reddit, Florence-powered services will generate captions to edit and support alt text image descriptions. In Microsoft Teams, Florence is driving video segmentation capabilities. PowerPoint, Outlook and Word are leveraging Florence's image captioning abilities for automatic alt text generation. And Designer and OneDrive, courtesy of Florence, have gained better image tagging, image search and background generation.

Montgomery sees Florence being used by customers for much more down the line, like detecting defects in manufacturing and enabling self-checkout in retail stores. None of those use cases require a multimodal vision model, I'd note. But Montgomery asserts that multimodality adds something valuable to the equation.

"Florence is a complete re-thinking of vision models," Montgomery said. "Once there’s easy and high-quality translation between images and text, a world of possibilities opens up. Customers will be able to experience significantly improved image search, to train image and vision models and other model types like language and speech into entirely new types of applications and to easily improve the quality of their own customized versions."

Associated Press
Judge strikes down one North Carolina abortion restriction but upholds another
A federal judge ruled Friday that a provision in North Carolina's abortion laws requiring doctors to document the location of a pregnancy before prescribing abortion pills should be blocked permanently, affirming that it was too vague to be enforced reasonably. The implementation of that requirement was already halted last year by U.S. District Judge Catherine Eagles until a lawsuit challenging portions of the abortion law enacted by the Republican-dominated General Assembly in 2023 was litigated further. Eagles now says a permanent injunction would be issued at some point.
The Canadian Press
Scores of wildfires are scorching swaths of the US and Canada. Here's the latest on some of them
LOS ANGELES (AP) — Scores of wildfires across the United States and Canada have scorched swaths of land in California, Oregon, Idaho, Alaska, Alberta and beyond, forcing evacuations and road closures, as well as destroying and threatening structures. Air quality advisorie s and alerts have been issued in some affected areas.
Sacramento Bee
Park Fire is now California’s largest recorded blaze caused by arson. What are the penalties?
Around 15% of wildfires are caused by arson. Depending on the intent, punishments range from fines to an 8 year prison sentence
The Canadian Press
A tanker plane crash has killed a firefighting pilot in Oregon as Western wildfires spread
Communities in the U.S. West and Canada were under siege from raging wildfires on Friday, as a fast-moving blaze sparked by lightning sent people fleeing on fire-ringed roads in rural Idaho and a human-caused inferno forced the evacuation of hundreds of homes in northern California.
The Canadian Press
North Carolina regulators says nonprofit run by lieutenant governor's wife owes the state $132K
RALEIGH. N.C. (AP) — North Carolina state regulators now declare a nonprofit run by the wife of North Carolina Lt. Gov. Mark Robinson must repay over $132,000 for what they call disallowed expenses while carrying out a federally funded child care meal program.
Associated Press Videos
New York City turns to AI-powered scanners in push to keep guns out of the subway system
New York City is turning to AI-powered scanners in a new bid to keep guns out of its subway system but the pilot program launched Friday is already being met with skepticism from riders.
WEWS-Cleveland Videos
Myles Garrett: Browns' star, team leader, DPOY, 'OG'
Garrett is entering his eighth year in the league with the goal of growing as a team leader. He's taking the action to do so already at the Greenbrier while he awaits his ramp-up back to work.
Miami Herald
E1 electric powerboat racing league to bring race to Miami in November 2025
The E1 electric powerboat racing series is hoping to bring attention to electric motorsports and sustainability efforts.
People
Kelly Clarkson Moved to Tears After Céline Dion's Opening Ceremony Performance at Paris Olympics: 'I Actually Can't Talk'
Dion performed Edith Piaf's 'Hymne A L'Amour' under the Eiffel Tower after the Olympic cauldron was lit
The Independent
Surrounding states bracing for spike in demand as Iowa abortion ban set to take effect
Abortions expected to drop 97 percent once law takes effect
Evening Standard
Liverpool vs Real Betis LIVE! Friendly match stream, latest score and goal updates in Arne Slot's first match
Reds are in Pittsburgh for first game of Arne Slot era
The Guardian
A teenager faces court in regional NSW, but her suspected cognitive impairment goes unrecognised
Without access to psychologists through the courts, youth offenders are being denied a path to diversion and getting stuck in ‘quicksand’
USA TODAY
Why are more adults not having children? New study may have an explanation.
About 47% of U.S. adults younger than 50 without kids polled in 2023 said they were unlikely to have children, up 10% from 2018.
The Guardian
Olympic dream lives on for hockey player who amputated finger to reach Paris
Australia’s Matthew Dawson thought his Games were over after a freak accident but a bold decision ensured he will still be part of a team chasing a medal
People
Macy Gray Says She Finds 'Healing' Through Cocaine, Alcohol, Marijuana and Pizza — but No 'Hippie S---'
The "I Try" singer made the comments on MTV's 'The Surreal Life' after the celebrity cast participated in a group breathing exercise
Canadian Press Videos
Newsroom Ready: Jasper wildfire evacuee describes being rescued from forest
Nadya Peretroukhina was hiking in the forest of Jasper National Park when nearby wildfires broke out and forced the evacuation of 25,000 people. Peretroukhina says she was recused by helicopter crews after signaling for help through a satellite device. (July 26, 2024)
Canadian Press Videos
Jasper wildfire evacuee describes being rescued from forest
Nadya Peretroukhina was hiking in the forest of Jasper National Park when nearby wildfires broke out and forced the evacuation of 25,000 people. Peretroukhina says she was recused by helicopter crews after signaling for help through a satellite device. (July 26, 2024)
People
Aisha Tyler Recalls What Matthew Perry Told Her Before She Took Her First Bow on “Friends: ”'I Never Forgot That Moment'
"It was just such a sweet, generous thing to say," she said.
Simply Wall St.
With a 53% stake, Kelly Partners Group Holdings Limited (ASX:KPG) insiders have a lot riding on the company
Key Insights Kelly Partners Group Holdings' significant insider ownership suggests inherent interests in company's...
BBC
Venezuela holds elections on Sunday. Could real change be coming?
An economic crisis has forced millions to flee Venezuela. Could Sunday's election bring real change?

Latest Stories