DatasetManager and Dataset API Reference¶
Overview¶
The datasets module provides comprehensive management of music datasets with quality scoring, license validation, and metadata analysis. The DatasetManager class orchestrates multiple datasets with quality assurance, while the Dataset class represents individual music collections with track management and analytics. The DatasetLicense class handles open-source license compatibility and restrictions.
Table of Contents¶
- DatasetManager Class
- Constructor
- Dataset Operations
- Quality Scoring
- License Management
- Statistics
- Dataset Class
- Constructor
- Track Management
- Metadata Retrieval
- Validation
- DatasetLicense Class
- Constructor
- License Validation
- Code Examples
- Quality Scoring Algorithm
DatasetManager Class¶
Constructor¶
def __init__(self, allowed_licenses: Optional[List[str]] = None) -> None
Initialize the Dataset Manager with license configuration.
Parameters:
allowed_licenses(Optional[List[str]]): List of allowed license types. Default:['CC-BY', 'CC-BY-SA', 'CC0']
Supported licenses:
- "CC0": Public domain
- "CC-BY": Attribution required
- "CC-BY-SA": Attribution and Share-Alike
- "CC-BY-NC": Non-commercial
- "CC-BY-NC-SA": Non-commercial and Share-Alike
- "MIT": MIT License
- "Apache-2.0": Apache License
- "GPL-3.0": GNU General Public License
- "Public-Domain": Public domain
Returns: None
Side Effects: - Initializes empty datasets dictionary - Sets allowed licenses - Logs initialization
Example:
from qfzz.datasets.manager import DatasetManager
# Use default licenses
manager1 = DatasetManager()
# Use custom allowed licenses
manager2 = DatasetManager(allowed_licenses=[
"CC-BY",
"CC-BY-SA",
"MIT",
"Apache-2.0"
])
Dataset Operations¶
add_dataset()¶
def add_dataset(self, dataset: Dataset) -> bool
Add a dataset to the manager with license and quality validation.
Parameters:
dataset(Dataset): Dataset object to add
Returns: bool - True if added successfully, False if license not compatible
Raises:
- ValueError: If dataset validation fails
Side Effects: - Validates dataset license compatibility - Calculates and assigns quality score - Stores dataset in internal registry - Logs dataset addition
Example:
from qfzz.datasets.models import Dataset, DatasetLicense
license = DatasetLicense(
license_type="CC-BY",
license_url="https://creativecommons.org/licenses/by/4.0/",
attribution_required=True,
commercial_use=True,
derivative_works=True,
share_alike=False
)
dataset = Dataset(
dataset_id="dataset_001",
name="Jazz Classics",
description="A collection of classic jazz recordings",
version="1.0.0",
license=license,
creator_id="creator_jazz"
)
# Add tracks to dataset
dataset.add_track({
"track_id": "track_001",
"title": "Autumn Leaves",
"artist": "Bill Evans",
"album": "Sunday at the Village Vanguard",
"genre": "jazz",
"mood": "mellow",
"energy": 0.5,
"tempo": "slow",
"duration": 480
})
# Add to manager
success = manager.add_dataset(dataset)
if success:
print(f"Dataset added with quality score: {dataset.quality_score:.3f}")
else:
print("License not compatible with allowed licenses")
remove_dataset()¶
def remove_dataset(self, dataset_id: str) -> bool
Remove a dataset from the manager.
Parameters:
dataset_id(str): Dataset identifier
Returns: bool - True if removed, False if not found
Side Effects: - Removes dataset from internal registry - Logs removal
Example:
removed = manager.remove_dataset("dataset_001")
if removed:
print("Dataset removed successfully")
else:
print("Dataset not found")
get_dataset()¶
def get_dataset(self, dataset_id: str) -> Optional[Dataset]
Retrieve a dataset by ID.
Parameters:
dataset_id(str): Dataset identifier
Returns: Dataset if found, None otherwise
Example:
dataset = manager.get_dataset("dataset_001")
if dataset:
print(f"Dataset: {dataset.name}")
print(f"Tracks: {dataset.get_track_count()}")
print(f"Quality: {dataset.quality_score:.3f}")
else:
print("Dataset not found")
list_datasets()¶
def list_datasets(self, min_quality: Optional[float] = None) -> List[Dataset]
List all datasets with optional quality filtering.
Parameters:
min_quality(Optional[float]): Minimum quality score filter (0.0-1.0)
Returns: List[Dataset] - Datasets sorted by quality (highest first)
Example:
# Get all datasets
all_datasets = manager.list_datasets()
print(f"Total datasets: {len(all_datasets)}")
# Get high-quality datasets only
high_quality = manager.list_datasets(min_quality=0.8)
for dataset in high_quality:
print(f" {dataset.name}: {dataset.quality_score:.3f}")
# Get medium-quality datasets
medium_quality = manager.list_datasets(min_quality=0.5)
print(f"Medium+ quality: {len(medium_quality)} datasets")
Quality Scoring¶
calculate_quality_score()¶
def calculate_quality_score(self, dataset: Dataset) -> float
Calculate comprehensive quality score for a dataset.
Parameters:
dataset(Dataset): Dataset to score
Returns: float - Quality score (0.0-1.0)
Scoring Factors:
| Factor | Weight | Criteria |
|---|---|---|
| Metadata Completeness | 30% | Required/optional fields present |
| Data Consistency | 25% | Field consistency across tracks, valid values |
| Dataset Size | 20% | Number of tracks (logarithmic scale) |
| Diversity | 15% | Genre and artist variety |
| License Permissiveness | 10% | Commercial use, derivatives allowed |
Metadata Completeness Scoring:
Required fields (70% of score): - title, artist, genre, duration
Optional fields (30% of score): - album, year, mood, energy, tempo
Example:
# Create a high-quality dataset
license = DatasetLicense(
license_type="CC-BY",
license_url="https://example.com/license",
commercial_use=True,
derivative_works=True
)
dataset = Dataset(
dataset_id="high_quality",
name="Complete Jazz Collection",
description="1000 jazz tracks with full metadata",
version="2.0",
license=license,
creator_id="jazz_curator"
)
# Add 500+ tracks with complete metadata
for i in range(500):
dataset.add_track({
"track_id": f"track_{i:04d}",
"title": f"Track {i}",
"artist": f"Artist {i % 20}",
"album": f"Album {i // 50}",
"genre": ["jazz", "blues", "soul", "funk"][i % 4],
"mood": ["calm", "upbeat", "energetic", "mellow"][i % 4],
"energy": (i % 10) / 10.0,
"tempo": ["slow", "medium", "fast"][i % 3],
"duration": 200 + (i % 200)
})
quality = manager.calculate_quality_score(dataset)
print(f"Quality score: {quality:.3f}") # ~0.85-0.95
License Management¶
validate_license()¶
def validate_license(self, license: DatasetLicense) -> bool
Validate if a license is acceptable.
Parameters:
license(DatasetLicense): License to validate
Returns: bool - True if valid, False otherwise
Example:
# Valid license
valid_license = DatasetLicense(
license_type="CC-BY",
license_url="https://example.com/cc-by"
)
if manager.validate_license(valid_license):
print("License accepted")
# Invalid license
invalid_license = DatasetLicense(
license_type="Proprietary",
license_url="https://example.com/proprietary"
)
if not manager.validate_license(invalid_license):
print("License not in allowed list")
Statistics¶
get_statistics()¶
def get_statistics(self) -> Dict[str, Any]
Get comprehensive statistics about managed datasets.
Parameters: None
Returns: Dict containing:
- total_datasets (int): Number of managed datasets
- total_tracks (int): Total tracks across all datasets
- average_quality_score (float): Mean quality score
- unique_genres (int): Count of unique genres
- unique_artists (int): Count of unique artists
- allowed_licenses (List[str]): Configured allowed licenses
Example:
stats = manager.get_statistics()
print(f"=== Dataset Statistics ===")
print(f"Datasets: {stats['total_datasets']}")
print(f"Total tracks: {stats['total_tracks']}")
print(f"Average quality: {stats['average_quality_score']:.3f}")
print(f"Unique genres: {stats['unique_genres']}")
print(f"Unique artists: {stats['unique_artists']}")
Dataset Class¶
Constructor¶
@dataclass
class Dataset:
dataset_id: str
name: str
description: str
version: str
license: DatasetLicense
creator_id: str
tracks: List[Dict[str, Any]] = field(default_factory=list)
metadata: Dict[str, Any] = field(default_factory=dict)
quality_score: float = 0.0
created_at: str = field(default_factory=lambda: datetime.now().isoformat())
updated_at: str = field(default_factory=lambda: datetime.now().isoformat())
Create a music dataset with tracks and metadata.
Parameters:
dataset_id(str): Unique dataset identifiername(str): Human-readable dataset namedescription(str): Dataset descriptionversion(str): Version string (e.g., "1.0.0")license(DatasetLicense): License informationcreator_id(str): Creator identifiertracks(List): Track list, default emptymetadata(Dict): Additional metadata, default emptyquality_score(float): Quality score (0.0-1.0), default 0.0created_at(str): ISO-8601 creation timestampupdated_at(str): ISO-8601 update timestamp
Raises: ValueError if validation fails
Example:
from qfzz.datasets.models import Dataset, DatasetLicense
license = DatasetLicense(
license_type="CC-BY-SA",
license_url="https://creativecommons.org/licenses/by-sa/4.0/",
attribution_required=True,
commercial_use=True,
derivative_works=True,
share_alike=True
)
dataset = Dataset(
dataset_id="rock_anthems",
name="Rock Anthems Collection",
description="Essential rock music tracks",
version="1.5.0",
license=license,
creator_id="curator_rock",
metadata={
"category": "rock",
"year_range": "1970-2023",
"language": "en"
}
)
Track Management¶
add_track()¶
def add_track(self, track: Dict[str, Any]) -> None
Add a track to the dataset.
Parameters:
track(Dict[str, Any]): Track dictionary with fields:track_id(str): Unique track IDtitle(str): Track titleartist(str): Artist namegenre(str): Music genre- Plus any optional fields (album, year, mood, energy, tempo, etc.)
Returns: None
Side Effects:
- Appends track to tracks list
- Updates updated_at timestamp
Example:
dataset.add_track({
"track_id": "track_001",
"title": "Stairway to Heaven",
"artist": "Led Zeppelin",
"album": "Led Zeppelin IV",
"genre": "rock",
"year": 1971,
"mood": "epic",
"energy": 0.8,
"tempo": "varied",
"duration": 482
})
remove_track()¶
def remove_track(self, track_id: str) -> bool
Remove a track from the dataset.
Parameters:
track_id(str): Track identifier
Returns: bool - True if removed, False if not found
Side Effects:
- Removes track from tracks list
- Updates updated_at timestamp if successful
Example:
if dataset.remove_track("track_001"):
print("Track removed")
else:
print("Track not found")
Metadata Retrieval¶
get_track_count()¶
def get_track_count(self) -> int
Get the number of tracks in the dataset.
Parameters: None
Returns: int - Number of tracks
Example:
count = dataset.get_track_count()
print(f"Dataset contains {count} tracks")
get_total_duration()¶
def get_total_duration(self) -> int
Get total duration of all tracks in seconds.
Parameters: None
Returns: int - Total duration in seconds
Example:
total_seconds = dataset.get_total_duration()
hours = total_seconds / 3600
print(f"Total duration: {hours:.1f} hours")
get_genres()¶
def get_genres(self) -> List[str]
Get unique genres in dataset.
Parameters: None
Returns: List[str] - Sorted list of unique genre names
Example:
genres = dataset.get_genres()
print(f"Genres ({len(genres)}): {', '.join(genres)}")
get_artists()¶
def get_artists(self) -> List[str]
Get unique artists in dataset.
Parameters: None
Returns: List[str] - Sorted list of unique artist names
Example:
artists = dataset.get_artists()
print(f"Artists ({len(artists)}): {', '.join(artists[:10])}...")
to_dict()¶
def to_dict(self) -> Dict[str, Any]
Convert dataset to dictionary representation.
Parameters: None
Returns: Dict with all dataset attributes plus computed fields:
- track_count (int): Number of tracks
- total_duration (int): Total duration in seconds
Example:
dataset_dict = dataset.to_dict()
print(dataset_dict)
# {
# 'dataset_id': 'rock_anthems',
# 'name': 'Rock Anthems Collection',
# 'track_count': 250,
# 'total_duration': 54000,
# 'quality_score': 0.82,
# ...
# }
Validation¶
validate()¶
def validate(self) -> None
Validate dataset parameters.
Parameters: None
Returns: None
Raises: ValueError if any parameter invalid
Validation Rules:
dataset_id: Must be non-emptyname: Must be non-emptyversion: Must be non-emptycreator_id: Must be non-emptyquality_score: Must be 0.0-1.0
Example:
try:
dataset = Dataset(
dataset_id="", # Invalid!
name="Test",
description="Test",
version="1.0",
license=license,
creator_id="test"
)
except ValueError as e:
print(f"Validation error: {e}")
DatasetLicense Class¶
Constructor¶
@dataclass
class DatasetLicense:
license_type: str
license_url: str
attribution_required: bool = True
commercial_use: bool = True
derivative_works: bool = True
share_alike: bool = False
additional_terms: str = ""
Create license information for a dataset.
Parameters:
license_type(str): License type identifierlicense_url(str): URL to license textattribution_required(bool): Attribution requirement, default Truecommercial_use(bool): Commercial use allowed, default Truederivative_works(bool): Derivative works allowed, default Trueshare_alike(bool): Share-alike requirement, default Falseadditional_terms(str): Additional terms text, default ""
Example:
from qfzz.datasets.models import DatasetLicense
# Creative Commons BY license
cc_by = DatasetLicense(
license_type="CC-BY",
license_url="https://creativecommons.org/licenses/by/4.0/",
attribution_required=True,
commercial_use=True,
derivative_works=True,
share_alike=False
)
# Creative Commons BY-SA license
cc_by_sa = DatasetLicense(
license_type="CC-BY-SA",
license_url="https://creativecommons.org/licenses/by-sa/4.0/",
attribution_required=True,
commercial_use=True,
derivative_works=True,
share_alike=True,
additional_terms="Derivative works must use CC-BY-SA"
)
# MIT License
mit = DatasetLicense(
license_type="MIT",
license_url="https://opensource.org/licenses/MIT",
attribution_required=False,
commercial_use=True,
derivative_works=True,
share_alike=False
)
License Validation¶
is_compatible_with()¶
def is_compatible_with(self, allowed_licenses: List[str]) -> bool
Check if license is compatible with allowed licenses.
Parameters:
allowed_licenses(List[str]): List of allowed license type strings
Returns: bool - True if compatible, False otherwise
Example:
cc_by = DatasetLicense(
license_type="CC-BY",
license_url="https://example.com/cc-by"
)
allowed = ["CC-BY", "CC-BY-SA", "CC0"]
if cc_by.is_compatible_with(allowed):
print("License is compatible")
proprietary = DatasetLicense(
license_type="Proprietary",
license_url="https://example.com/proprietary"
)
if not proprietary.is_compatible_with(allowed):
print("License not in allowed list")
to_dict()¶
def to_dict(self) -> Dict[str, Any]
Convert license to dictionary representation.
Parameters: None
Returns: Dict with all license attributes
Code Examples¶
Complete Dataset Management Workflow¶
from qfzz.datasets.manager import DatasetManager
from qfzz.datasets.models import Dataset, DatasetLicense
# Initialize manager with allowed licenses
manager = DatasetManager(allowed_licenses=[
"CC-BY",
"CC-BY-SA",
"CC0",
"MIT"
])
# Create three datasets
datasets = []
for i in range(3):
license_type = ["CC-BY", "CC-BY-SA", "CC0"][i]
license = DatasetLicense(
license_type=license_type,
license_url=f"https://example.com/{license_type}",
commercial_use=True,
derivative_works=True
)
dataset = Dataset(
dataset_id=f"dataset_{i:03d}",
name=f"Collection {i+1}",
description=f"Music collection {i+1}",
version="1.0.0",
license=license,
creator_id=f"curator_{i}"
)
# Add tracks
for j in range(100):
dataset.add_track({
"track_id": f"track_{i:03d}_{j:04d}",
"title": f"Track {j}",
"artist": f"Artist {j % 10}",
"genre": ["rock", "pop", "jazz"][j % 3],
"mood": ["upbeat", "calm", "energetic"][j % 3],
"energy": (j % 10) / 10.0,
"tempo": ["slow", "medium", "fast"][j % 3],
"duration": 200 + (j % 200)
})
datasets.append(dataset)
# Add datasets to manager
for dataset in datasets:
if manager.add_dataset(dataset):
print(f"✓ {dataset.name}: {dataset.quality_score:.3f}")
else:
print(f"✗ {dataset.name}: License not compatible")
# Get statistics
stats = manager.get_statistics()
print(f"\n=== Statistics ===")
print(f"Total datasets: {stats['total_datasets']}")
print(f"Total tracks: {stats['total_tracks']}")
print(f"Average quality: {stats['average_quality_score']:.3f}")
print(f"Unique genres: {stats['unique_genres']}")
print(f"Unique artists: {stats['unique_artists']}")
# List high-quality datasets
print(f"\n=== High Quality Datasets ===")
high_quality = manager.list_datasets(min_quality=0.75)
for dataset in high_quality:
print(f" {dataset.name}")
print(f" Tracks: {dataset.get_track_count()}")
print(f" Genres: {', '.join(dataset.get_genres()[:3])}")
print(f" Quality: {dataset.quality_score:.3f}")
Quality Scoring Algorithm Details¶
Metadata Completeness (30%)¶
Score = (Required Fields × 0.7) + (Optional Fields × 0.3)
Required: title, artist, genre, duration
Optional: album, year, mood, energy, tempo
Data Consistency (25%)¶
Score = (Field Consistency × 0.5) + (Value Validity × 0.5)
Field Consistency: How many tracks have same fields
Value Validity: Check duration > 0, energy in [0.0, 1.0], etc.
Dataset Size (20%)¶
Logarithmic scoring: - 0 tracks: 0.0 - <10 tracks: 0.2 - <50 tracks: 0.4 - <100 tracks: 0.6 - <500 tracks: 0.8 - 500+ tracks: 1.0
Diversity (15%)¶
Score = (Genre Diversity × 0.5) + (Artist Diversity × 0.5)
Genre: min(1.0, unique_genres / 10)
Artist: min(1.0, unique_artists / (tracks / 5))
License Permissiveness (10%)¶
Base Score: 0.5
+ 0.2 if commercial_use allowed
+ 0.2 if derivative_works allowed
+ 0.1 if share_alike NOT required
= 0.0 to 1.0
Version Information¶
- Module:
qfzz.datasets.manager,qfzz.datasets.models - Classes:
DatasetManager,Dataset,DatasetLicense - Python: 3.8+
- Dependencies: dataclasses, typing, datetime, enum