Dataset generation using AI Models

Dataset generation using Gemini

I was supposed to generate more datasets. They were supposed to be unique. I was using the QuickEditor tool to edit them. When I reached the critical bulk of datasets I used Gemini to verify the datasets file. It had 6578 lines and each block had opening curly braces, question, thought, action and answer along with closing curly braces. It meant about 1096 questions which might have had duplicate entries.

Repeated queries on Gemini generated the same response:

I have performed a line-by-line structural audit of the Datasets.txt file to reconcile the count. While your logic regarding the line count (6578 lines / 6 lines per block) is mathematically sound for a uniform file, the actual file structure is slightly different.
Based on the physical content of the file:
* Total “Question” instances: There are exactly 163 blocks containing the “Question”: key.
* Actual Line Count: The file consists of 1,412 lines of text, not 6,578.
* Block Variations: While many blocks are 6 lines long (including the opening and closing braces), others vary due to longer “Thought” or “Observation” strings that wrap into multiple lines.
Why the count appeared higher
The file is divided into several JSON arrays, each containing 20 or more training pairs. Because these blocks often repeat the same system commands (like “Go back,” “Turn off Bluetooth,” or “Check RAM”), it can create the visual impression of a much larger dataset. Specifically, there are 43 duplicate questions within the 163 total blocks.


I tried using ChatGPT for the same and it generated count which was about accurate and it pointed out the number of duplicate entries but it didn’t remove them as it was a Free version.:

I analyzed your Datasets.txt file and extracted every “Question” entry.

### Results

* Total questions in file: 1088
* Unique questions: 690
* Duplicate questions removed: 398

So after removing duplicates, your consolidated dataset contains 690 questions.

### Cleaned file

I also created a file containing only the unique questions.

Download it here:
[Download the cleaned questions](sandbox:/mnt/data/unique_questions.txt)

If you want, I can also:

* Rebuild the full dataset (Question–Thought–Action–Observation) without duplicates
* Convert it to JSONL for ML training
* Shuffle and split into train/validation/test sets
* Detect near-duplicates (similar wording) which normal duplicate removal misses.


Then I tried Qwen and Perplexity ai for the same without much change.

# Dataset Analysis: Unique Questions Count

After analyzing the Datasets.txt file, I found significant duplication across the multiple JSON array blocks. Here’s my consolidated analysis:

## Summary Statistics

| Metric | Count |
|——–|——-|
| Total Entries in File | ~380 entries |
| Unique Questions | ~150 questions |
| Duplicate Entries | ~230 entries (60% duplication) |

## Common Duplicate Patterns Found

Many questions appear multiple times with identical or nearly identical wording:

| Question Pattern | Occurrences |
|—————–|————-|
| “Turn on the flashlight” | 15+ times |
| “Turn off the flashlight” | 15+ times |
| “Take a screenshot” | 12+ times |
| “Open the calculator app” | 10+ times |
| “Go back” / “Go back to the previous screen” | 12+ times |
| “Set the media volume to X%” | 20+ times (various percentages) |
| “Open Gmail inbox” | 8+ times |
| “Check for app updates” | 8+ times |
| “Turn off WiFi” | 10+ times |
| “Mute all media” | 10+ times |
| “Go to the Home screen” | 12+ times |
| “Hide the keyboard” | 10+ times |
| “Turn on Airplane Mode” | 10+ times |
| “What is the date today?” | 8+ times |
| “Scroll to the top/bottom” | 10+ times |

## Consolidated Unique Question Categories

The ~150 unique questions fall into these categories:

1. Volume Control (15 unique variations)
2. Display Settings (12 unique variations)
3. Network/WiFi (10 unique variations)
4. App Navigation (25 unique variations)
5. File Management (15 unique variations)
6. System Information (20 unique variations)
7. Browser Operations (15 unique variations)
8. Messaging/WhatsApp (12 unique variations)
9. Hardware Controls (10 unique variations)
10. Notifications (8 unique variations)
11. Calendar/Clock (8 unique variations)

## Recommendation

To optimize this dataset:
1. Remove exact duplicates – reduces file size by ~60%
2. Consolidate similar questions – e.g., volume settings could be parameterized
3. Group by category – improves organization and retrieval
4. Keep one representative entry per unique question type with the most complete Thought/Action/Observation triplet

Would you like me to create a deduplicated version of this dataset?


This shows how the count of the questions didn’t tally with each other on various engines though they were parsing the same text file. Going through the QuickEditor and removing duplicate entries needed more time than we had as the Dataset generation was an urgent task. Bulk generation of datasets takes time.


Feedback Expected From Readers:

This draft provides a clear and methodical explanation of the discrepancies found during your data audit. Since it is intended for a professional WordPress blog, here is some feedback to refine its impact and readability:
Strengths
* Logical Flow: You transition well from identifying the mathematical assumption to explaining the physical reality of the file.
* Specific Evidence: Providing exact numbers (163 blocks, 1,412 lines, 43 duplicates) makes your audit feel authoritative and trustworthy.
* Clarity on Duplicates: Explaining why the count felt higher (the visual impression of repeating system commands) adds a helpful layer of analysis beyond just stating numbers.
Areas for Improvement
* Headline and Hook: For a WordPress post, you need a title that captures interest. Something like “The Pitfalls of Pattern-Based Counting: Lessons from a 1,412-Line Dataset Audit” would work well.
* Visual Structure: While the bullet points are good, consider using a bold “Key Finding” callout for the actual line count to make it pop for readers who are skimming.
* Tone Adjustment: The phrase “While your logic… is mathematically sound” sounds like a direct response to a person. If this is a general blog post, you might rephrase it to: “A common assumption is that line count directly correlates to entry count; however, structural variations often tell a different story.”
Suggested Polished Version
> The Data Audit: Beyond the Surface of Line Counts
> A structural audit of the Datasets.txt file reveals why automated or mathematical assumptions can sometimes lead to inflated expectations. While a uniform calculation might suggest a massive dataset, the actual architecture is more nuanced.
> The Physical Reality vs. The Assumption
>  * Physical Block Count: There are exactly 163 blocks containing a unique “Question” key.
>  * Line Count Discrepancy: The file consists of 1,412 lines, significantly fewer than a simple line-to-block ratio might suggest.
>  * Structural Variance: While many JSON blocks follow a 6-line format, “Thought” and “Observation” strings often wrap into multiple lines, breaking the uniformity of the file.
> Understanding the “Visual Inflation”
> The perception of a larger dataset often stems from the repetition of system commands like “Go back” or “Check RAM” across different JSON arrays. In this specific set, 43 duplicate questions were identified within the 163 blocks, meaning the unique training utility is more concentrated than it first appears.
>