Luke Hinds lukehinds

DeepFabric Dataset Tools

Utility scripts for analyzing, filtering, and cleaning synthetic datasets generated by DeepFabric.

Generic quality filter for tool-calling datasets. Removes problematic patterns that can cause models to develop bad habits during training.

	#!/usr/bin/env python3
	"""
	Generic Dataset Quality Filter for Tool-Calling Datasets

	This script filters out problematic patterns from ANY synthetic tool-calling dataset
	that can cause models to develop bad habits during training.

	Key features:
	1. Auto-detection mode: Discovers problematic patterns from the data itself
	2. Schema-agnostic: Works with any tool-calling dataset (Blender, Kubernetes, GitHub, etc.)

	#!/usr/bin/env python3
	"""
	Script to detect and optionally remove duplicate topics in JSON graph files.

	Uses SHA256 checksums (already computed in node metadata) and other matching
	strategies to identify duplicate topics.

	Example usage:
	# Report duplicates using exact hash matching
	python tools/dedupe_graph.py --input examples/basic-graph-topics.jsonl

	#####################################################################
	# Spin Blender Tools Dataset Configuration
	#####################################################################
	# This configuration demonstrates using Blender MCP tools via Spin
	# for generating synthetic 3D design assistant training data.
	#
	# Prerequisites:
	# 1. Start the Spin service:
	# cd tools-sdk
	# spin build && spin up

	#!/bin/bash
	# Load comprehensive Blender MCP mock data into the mock tools server
	#
	# Usage: ./load-blender-mock-data.sh [base_url]
	# Default base_url: http://localhost:3000

	set -e

	BASE_URL="${1:-http://localhost:3000}"
	SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

	{
	"description": "Comprehensive Blender MCP mock data for testing tool execution with 3D design assistant scenarios",
	"version": "1.0.0",
	"mockResponses": {
	"get_scene_info": {
	"defaultResponse": {
	"name": "Untitled",
	"objects": [],
	"activeObject": null,
	"renderEngine": "CYCLES",

	{
	"description": "Comprehensive Google Workspace mock data for testing tool execution with productivity assistant scenarios",
	"version": "1.0.0",
	"mockResponses": {
	"search_gmail_messages": {
	"defaultResponse": {
	"messages": [
	{
	"id": "msg_001",
	"threadId": "thread_001",

	#####################################################################
	# Spin Google Workspace Tools Dataset Configuration
	#####################################################################
	# This configuration demonstrates using Google Workspace MCP tools via Spin
	# for generating synthetic productivity assistant training data.
	#
	# Prerequisites:
	# 1. Start the Spin service:
	# cd tools-sdk
	# spin build && spin up

	#!/bin/bash
	# Load comprehensive Google Workspace mock data into the mock tools server
	#
	# Usage: ./load-google-workspace-mock-data.sh [base_url]
	# Default base_url: http://localhost:3000

	set -e

	BASE_URL="${1:-http://localhost:3000}"
	SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"