Dataset: SEMDataset

The SEMDataset class prepares a cache of preprocessed images in /tmp/<dataset_name> and provides convenient access to images and their paths.

Main features

Automatic caching of preprocessed images
Parallel processing using all available CPU cores
Cache integrity checking and automatic update when needed
Support for folder structure with image classes

Creating dataset

from combra.data.dataset import SEMDataset

# Create dataset
# Path should not end with '/'
dataset = SEMDataset(
    images_folder_path='data/wc_co',
    max_images_num_per_class=100,
    workers=None  # Default: cpu_count()-1
)

Parameters:

images_folder_path (str): Root folder with class subfolders. Important: path should not end with /
max_images_num_per_class (int): Image limit per class. If None, all available images are used
workers (int, optional): Number of processes for parallel processing. Default is cpu_count()-1

Folder structure

The dataset expects the following structure:

data/wc_co/
├── Ultra_Co11/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── Ultra_Co25/
│   ├── image1.jpg
│   └── ...
└── Ultra_Co8/
    └── ...

Data access

Getting image

# Get image and path by class and image indices
image, path = dataset.__getitem__(class_idx=0, idx=0)

print(f"Image path: {path}")
print(f"Image shape: {image.shape}")

Number of classes

# Number of classes in dataset
num_classes = len(dataset)
print(f"Classes in dataset: {num_classes}")

Iterating over dataset

# Iterate over all classes and images
for class_idx in range(len(dataset)):
    num_images = dataset.images_paths.shape[1]
    for img_idx in range(num_images):
        image, path = dataset.__getitem__(class_idx, img_idx)
        # Process image
        print(f"Class {class_idx}, Image {img_idx}: {path}")

Image preprocessing

The class automatically applies preprocessing to each image:

Conversion to grayscale (if necessary)
Median filtering
Otsu binarization
Gradient computation

Using preprocessing method separately

from combra.data.dataset import SEMDataset
from skimage import io

# Load image
image = io.imread('path/to/image.jpg')

# Preprocessing
processed = SEMDataset.preprocess_image(
    image,
    pad=False,      # Add border
    border=30,       # Border size in pixels
    disk=3           # Disk size for median filter
)

Caching

The dataset automatically caches preprocessed images in /tmp/<dataset_name>.

Cache checking

When creating the dataset, the following checks are performed:

Cache existence
Folder structure match
Number of images

If the cache is incomplete or outdated, it is automatically recreated.

Clearing cache

import shutil
import os

cache_dir = '/tmp/wc_co'  # Replace with your dataset name
if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)
    print("Cache deleted")

Usage examples

Basic example

from combra.data.dataset import SEMDataset

# Create dataset
dataset = SEMDataset('data/wc_co', max_images_num_per_class=50)

# Get first image of first class
image, path = dataset.__getitem__(0, 0)
print(f"Loaded: {path}")
print(f"Size: {image.shape}")

Processing all images

from combra.data.dataset import SEMDataset
from combra import angles

dataset = SEMDataset('data/wc_co', max_images_num_per_class=100)

all_angles = []

for class_idx in range(len(dataset)):
    for img_idx in range(dataset.images_paths.shape[1]):
        image, path = dataset.__getitem__(class_idx, img_idx)
        
        # Compute angles
        angles_array, _ = angles.get_angles(image, border_eps=5, tol=3)
        all_angles.extend(angles_array)
        
        print(f"Processed: {path}")

print(f"Total angles: {len(all_angles)}")

Using with other modules

from combra.data.dataset import SEMDataset
from combra import angles, stats, approx
import matplotlib.pyplot as plt

# Create dataset
dataset = SEMDataset('data/wc_co', max_images_num_per_class=50)

# Collect angles from all images
all_angles = []
for class_idx in range(len(dataset)):
    for img_idx in range(dataset.images_paths.shape[1]):
        image, _ = dataset.__getitem__(class_idx, img_idx)
        angles_array, _ = angles.get_angles(image)
        all_angles.extend(angles_array)

# Statistical analysis
x, y = stats.stats_preprocess(all_angles, step=5)

# Approximation
(x_gauss, y_gauss), mus, sigmas, amps = approx.bimodal_gauss_approx(x, y)

# Visualization
plt.plot(x, y, 'o', label='Data')
plt.plot(x_gauss, y_gauss, '-', label='Approximation')
plt.legend()
plt.show()

Notes

Dataset path should not end with /
Cache is created once and reused on subsequent runs
Preprocessed images are saved in PNG format
All images should have the same size (check is not performed automatically)

Class methods

`init(images_folder_path, max_images_num_per_class=100, workers=None)`

Initialize dataset.

`getitem(class_idx, idx)`

Get image and path.

Parameters:

class_idx (int): Class index
idx (int): Image index in class

Returns:

tuple: (image, path) where image - numpy array, path - string with path

`len()`

Returns the number of classes in the dataset.

`preprocess_image(image, pad=False, border=30, disk=3)` (classmethod)

Static method for preprocessing a single image.

Parameters:

image (ndarray): Input image
pad (bool): Add border
border (int): Border size
disk (int): Disk size for median filter

Returns:

ndarray: Preprocessed image

Crack Graph Image Preprocessing

Dataset: SEMDataset

Main features

Creating dataset

Folder structure

Data access

Getting image

Number of classes

Iterating over dataset

Image preprocessing

Using preprocessing method separately

Caching

Cache checking

Clearing cache

Usage examples

Basic example

Processing all images

Using with other modules

Notes

Class methods

__init__(images_folder_path, max_images_num_per_class=100, workers=None)

__getitem__(class_idx, idx)

__len__()

preprocess_image(image, pad=False, border=30, disk=3) (classmethod)

`init(images_folder_path, max_images_num_per_class=100, workers=None)`

`getitem(class_idx, idx)`

`len()`

`preprocess_image(image, pad=False, border=30, disk=3)` (classmethod)