Dataset: SEMDataset
Dataset: SEMDataset
The SEMDataset class prepares a cache of preprocessed images in /tmp/<dataset_name> and provides convenient access to images and their paths.
Main features
- Automatic caching of preprocessed images
- Parallel processing using all available CPU cores
- Cache integrity checking and automatic update when needed
- Support for folder structure with image classes
Creating dataset
from combra.data.dataset import SEMDataset
# Create dataset
# Path should not end with '/'
dataset = SEMDataset(
images_folder_path='data/wc_co',
max_images_num_per_class=100,
workers=None # Default: cpu_count()-1
)Parameters:
images_folder_path(str): Root folder with class subfolders. Important: path should not end with/max_images_num_per_class(int): Image limit per class. IfNone, all available images are usedworkers(int, optional): Number of processes for parallel processing. Default iscpu_count()-1
Folder structure
The dataset expects the following structure:
data/wc_co/
├── Ultra_Co11/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── Ultra_Co25/
│ ├── image1.jpg
│ └── ...
└── Ultra_Co8/
└── ...Data access
Getting image
# Get image and path by class and image indices
image, path = dataset.__getitem__(class_idx=0, idx=0)
print(f"Image path: {path}")
print(f"Image shape: {image.shape}")Number of classes
# Number of classes in dataset
num_classes = len(dataset)
print(f"Classes in dataset: {num_classes}")Iterating over dataset
# Iterate over all classes and images
for class_idx in range(len(dataset)):
num_images = dataset.images_paths.shape[1]
for img_idx in range(num_images):
image, path = dataset.__getitem__(class_idx, img_idx)
# Process image
print(f"Class {class_idx}, Image {img_idx}: {path}")Image preprocessing
The class automatically applies preprocessing to each image:
- Conversion to grayscale (if necessary)
- Median filtering
- Otsu binarization
- Gradient computation
Using preprocessing method separately
from combra.data.dataset import SEMDataset
from skimage import io
# Load image
image = io.imread('path/to/image.jpg')
# Preprocessing
processed = SEMDataset.preprocess_image(
image,
pad=False, # Add border
border=30, # Border size in pixels
disk=3 # Disk size for median filter
)Caching
The dataset automatically caches preprocessed images in /tmp/<dataset_name>.
Cache checking
When creating the dataset, the following checks are performed:
- Cache existence
- Folder structure match
- Number of images
If the cache is incomplete or outdated, it is automatically recreated.
Clearing cache
import shutil
import os
cache_dir = '/tmp/wc_co' # Replace with your dataset name
if os.path.exists(cache_dir):
shutil.rmtree(cache_dir)
print("Cache deleted")Usage examples
Basic example
from combra.data.dataset import SEMDataset
# Create dataset
dataset = SEMDataset('data/wc_co', max_images_num_per_class=50)
# Get first image of first class
image, path = dataset.__getitem__(0, 0)
print(f"Loaded: {path}")
print(f"Size: {image.shape}")Processing all images
from combra.data.dataset import SEMDataset
from combra import angles
dataset = SEMDataset('data/wc_co', max_images_num_per_class=100)
all_angles = []
for class_idx in range(len(dataset)):
for img_idx in range(dataset.images_paths.shape[1]):
image, path = dataset.__getitem__(class_idx, img_idx)
# Compute angles
angles_array, _ = angles.get_angles(image, border_eps=5, tol=3)
all_angles.extend(angles_array)
print(f"Processed: {path}")
print(f"Total angles: {len(all_angles)}")Using with other modules
from combra.data.dataset import SEMDataset
from combra import angles, stats, approx
import matplotlib.pyplot as plt
# Create dataset
dataset = SEMDataset('data/wc_co', max_images_num_per_class=50)
# Collect angles from all images
all_angles = []
for class_idx in range(len(dataset)):
for img_idx in range(dataset.images_paths.shape[1]):
image, _ = dataset.__getitem__(class_idx, img_idx)
angles_array, _ = angles.get_angles(image)
all_angles.extend(angles_array)
# Statistical analysis
x, y = stats.stats_preprocess(all_angles, step=5)
# Approximation
(x_gauss, y_gauss), mus, sigmas, amps = approx.bimodal_gauss_approx(x, y)
# Visualization
plt.plot(x, y, 'o', label='Data')
plt.plot(x_gauss, y_gauss, '-', label='Approximation')
plt.legend()
plt.show()Notes
- Dataset path should not end with
/ - Cache is created once and reused on subsequent runs
- Preprocessed images are saved in PNG format
- All images should have the same size (check is not performed automatically)
Class methods
__init__(images_folder_path, max_images_num_per_class=100, workers=None)
Initialize dataset.
__getitem__(class_idx, idx)
Get image and path.
Parameters:
class_idx(int): Class indexidx(int): Image index in class
Returns:
tuple:(image, path)whereimage- numpy array,path- string with path
__len__()
Returns the number of classes in the dataset.
preprocess_image(image, pad=False, border=30, disk=3) (classmethod)
Static method for preprocessing a single image.
Parameters:
image(ndarray): Input imagepad(bool): Add borderborder(int): Border sizedisk(int): Disk size for median filter
Returns:
ndarray: Preprocessed image