Implementing Azure AI Vision with Python: OCR, Object Detection, Tagging & Captioning
Learn to Use Python for OCR, Object Detection, Tagging, and Captioning with Azure AI Vision

Passionate about helping organizations build scalable infrastructure and DevOps solutions with cloud technologies. Experienced in designing robust systems, automating processes, and driving efficiency through innovative cloud solutions. Advocate for best practices in DevOps and cloud computing, committed to enabling teams to achieve their full potential.
Introduction
Computer Vision is a branch of Artificial Intelligence that allows machines to see, analyze, and understand images and videos. Azure AI Vision provides pre-built models through simple APIs, enabling developers to integrate features like object detection, OCR, tagging, and caption generation into their applications.
In this blog, we’ll walk through how to implement Azure AI Vision using Python, showcasing its major features with practical code examples.
Azure AI Vision & Use Cases
Instead of spending time training deep learning models from scratch, Azure AI Vision provides ready-to-use models that can:
Detect objects in images
Extract text using OCR
Automatically generate tags
Create human-like captions describing images
This makes it extremely useful for building real-world AI solutions quickly and at scale.

How It Helps LLMs
When combined with Large Language Models (LLMs), these visual capabilities make AI systems multimodal:
LLMs can reason not just on text, but also on what’s inside an image.
Example: Azure Vision extracts text from an invoice (OCR) → LLM interprets and summarizes it → business system updates automatically.
Real-World Use Case
Imagine a retail company managing thousands of product images:
Azure Vision tags and captions products for search and cataloging.
OCR extracts text from product labels.
LLMs use this extracted data to auto-generate product descriptions for e-commerce.
This saves time, improves accuracy, and makes systems more intelligent.
Python Implementation
1- Analyzing an Image with Azure AI Vision in Python (Image Tagging)
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv
load_dotenv("env.env")
endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")
client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("ai_vision_test.jpg", "rb") as image:
image_details=image.read()
response=client.analyze(
image_data=image_details,
visual_features=[VisualFeatures.TAGS, VisualFeatures.CAPTION]
)
print(json.dumps(response.as_dict(), indent=4))
Code Walkthrough
Importing Required Libraries
azure.ai.vision.imageanalysisprovides theImageAnalysisClientto interact with the Azure AI Vision service.VisualFeaturesdefines which features (Tags, Captions, OCR, etc.) we want to extract.AzureKeyCredentialsecurely passes our API key to the client.dotenvis used to safely load credentials stored in an.envfile.
Setting Up Authentication
The endpoint is the URL of your Azure AI Vision resource.
The key is stored in an environment file (
env.env) for security, instead of hardcoding it into the script.
Creating the Client
- The
ImageAnalysisClientconnects to Azure’s AI Vision service using the endpoint and API key.
- The
Reading the Image
- The image (
ai_vision_test.jpg) is read in binary format because the API expects raw bytes of the image.
- The image (
Analyzing the Image
We call the
analyze()method and pass two visual features:VisualFeatures.TAGS→ Returns a list of keywords that describe the image.VisualFeatures.CAPTION→ Generates a human-readable caption summarizing the image.
Printing Results
The response is converted into a dictionary and printed in a nicely formatted JSON output using
json.dumps().This makes it easy to see what Azure Vision has detected in the image.
2- Object Detection with Azure AI Vision in Python
Another powerful feature of Azure AI Vision is object detection. It not only identifies what objects are present in an image but also returns their locations using bounding boxes. This is especially useful in applications like surveillance, manufacturing defect detection, and retail product recognition.
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv
load_dotenv("env.env")
endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")
client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("fruit_bucket.png", "rb") as image:
image_details=image.read()
response=client.analyze(
image_data=image_details,
visual_features=[VisualFeatures.OBJECTS] # detects objects in the image with bounding boxes
)
print(json.dumps(response.as_dict(), indent=4))
Code Walkthrough
Importing Libraries
We bring in the same core classes (ImageAnalysisClient,VisualFeatures, andAzureKeyCredential) used earlier for tagging and captions.Authentication
The
endpointpoints to your Azure AI Vision resource.The
keyis securely loaded from an.envfile.
Image Input
- We load the image
fruit_bucket.pngin binary format so it can be sent to the API.
- We load the image
Object Detection Request
VisualFeatures.OBJECTStells Azure Vision to detect all visible objects in the image.The API responds with a list of objects, their confidence scores, and bounding box coordinates.
Readable Output
The
response.as_dict()method returns the structured result as a dictionary.json.dumps()formats it neatly so we can see exactly what was detected.
3- Extracting Text from Images with Azure AI Vision (OCR)
Azure AI Vision provides powerful OCR (Optical Character Recognition) capabilities. With this feature, we can detect both printed and handwritten text from images and documents. This is especially useful in scenarios like digitizing scanned documents, extracting text from receipts, or reading quotes from images.
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv
load_dotenv("env.env")
endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")
client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("quote.jpg", "rb") as image:
image_details=image.read()
response=client.analyze(
image_data=image_details,
visual_features=[VisualFeatures.READ] # detects text in the image using Optical Character Recognition (OCR)
)
for line in response.read.blocks[0].lines:
print(line. Text)
Code Walkthrough
Authentication
The endpoint and key are loaded from the
.envfile to keep secrets secure.ImageAnalysisClientis initialized to interact with Azure AI Vision.
Reading the Image
- The file
quote.jpgis loaded in binary mode before sending it to the API.
- The file
Performing OCR
The
VisualFeatures.READoption tells Azure Vision to run OCR.The response contains detected text blocks, lines, and even word-level details if needed.
Extracting Results
- We loop through
response.read.blocks[0].linesand print each line of detected text.
- We loop through



