From Python Scripts to Paper Scraps: The Struggle of Data Integrity

Datasets
1. I used following script provided by chatgpt to generate bulk datasets:
import json import random

DATASET_SIZE = 4000

topics = { “motion”: [ (“What is the speed if distance is {d} m and time is {t} s?”, “speed = distance/time”, lambda d,t: d/t), (“A car travels {d} km in {t} hours. What is its average speed?”, “speed = distance/time”, lambda d,t: d/t), ], “force”: [ (“What force is needed to accelerate a {m} kg object at {a} m/s^2?”, “F = m*a”, lambda m,a: m*a), ], “energy”: [ (“What is kinetic energy of a {m} kg object moving at {v} m/s?”, “KE = 0.5*m*v^2”, lambda m,v: 0.5*m*v*v), ], “gravity”: [ (“What is the weight of a {m} kg object on Earth? (g = 9.8 m/s^2)”, “W = m*g”, lambda m,g: m*g), ], “electricity”: [ (“Find current if voltage is {v} V and resistance is {r} Ω.”, “I = V/R”, lambda v,r: v/r), ] }

def generate_question(): topic = random.choice(list(topics.keys())) template, formula, func = random.choice(topics[topic])

“` if topic == “motion”: d = random.randint(10,200) t = random.randint(2,20) q = template.format(d=d,t=t) ans = func(d,t) thought = f”Use formula {formula}. Substitute values.” action = f”{d}/{t}”

elif topic == “force”: m = random.randint(1,50) a = random.randint(1,10) q = template.format(m=m,a=a) ans = func(m,a) thought = f”Force is mass times acceleration.” action = f”{m}*{a}”

elif topic == “energy”: m = random.randint(1,20) v = random.randint(1,30) q = template.format(m=m,v=v) ans = func(m,v) thought = “Kinetic energy formula.” action = f”0.5*{m}*{v}^2″

elif topic == “gravity”: m = random.randint(1,60) g = 9.8 q = template.format(m=m) ans = func(m,g) thought = “Weight equals mass times gravitational acceleration.” action = f”{m}*9.8″

elif topic == “electricity”: v = random.randint(5,220) r = random.randint(1,100) q = template.format(v=v,r=r) ans = func(v,r) thought = “Use Ohm’s law.” action = f”{v}/{r}”

return { “Question”: q, “Thought”: thought, “Action”: action, “Observation”: str(round(ans,2)) } “`

dataset = []

for _ in range(DATASET_SIZE): dataset.append(generate_question())

with open(“physics_agent_dataset.json”,”w”) as f: json.dump(dataset,f,indent=2)

print(“Dataset generated: physics_agent_dataset.json”)
2. It generated a JSON file with 4K datasets.
3. It was difficult to open it using Telegram for some reason. Whenever I used ‘attach files’ option on Telegram it couldn’t locate the file in the internal storage on smartphone. The same file was accessible using QuickEditor app.
4. Earlier we were trying bulk generation using premium ChatGPT. Though it let 4K datasets be generated there was problem of duplicates. There were many repititions in the file. Similarly the bulk generated JSON using the Python also had repetitions.
5. When the first batch of 4K datasets was generated by Chat GPT – it had some repititions which were removed in the second generation which had many numerical problems.
6. In the subsequent generation it created concepts in Physics though there was an additional script after every question ( concept number 1…etc )
7. After it was prompted to remove these labels the generated JSON just had 4K repititions of a single question.
8. Prior to that we had tried batch generation of datasets using free version of Gemini.
9. These datasets were needed to train an AI model from scratch.
10. We had a discussion about how it was almost impossible to avoid repititions in either batch or bulk generation. In batch generation it’s difficult to detect repititions after a while:
Suppose prompt engineer examined the first batch of 50 or 100 datasets. Suppose it has no repititions. After 10 such batches a dataset was repeated from any of the nine previous batches: it’s impossible to find first few dataset duplicates using “SEARCH” option of text editing tools like QuickEditor.
11. We tried split screen but it wasn’t allowed for Gemini as it’s not supported by Google.
12. We tried another feature called AppCloner which didn’t work properly for some reason.
13. Using another app for AppCloning didn’t work either.
14. My students were trying hard today. The elder brother was cutting pages out of the guidebook to prepare cheat script for the final examinations.
15. He was asking me to tell him how to recognise Hindi word “antar”( difference) which is used in many questions where students are supposed to tell differences between a few things.
16. I was wondering – how I had never imagined to get such students who were taking my help to cheat in the final examinations. He even wanted me to note down a few answers in his previous year’s question paper. He was planning on carrying it to examination hall. I didn’t feel inclined to help him with my handwritten notes.
17. After having created a few notes he found it difficult to manage it in his pant’s pockets.
18. The younger student reported about a tree which was uprooted nearby a temple.
19. I asked them to pay my fees.
20. I asked them to bring chair from verandah to this room and later to take it back after the class was over.
21. Following questions were in the paper on which Gemini helped. I helped them with some more:
The solutions that will show the Tyndall effect are B (Milk) and D (Starch solution). To understand why, we have to look at the nature of these mixtures: * Tyndall Effect: This is the scattering of a beam of light by medium particles. It only occurs when the particles are large enough to deflect light but small enough to remain suspended. * Colloids and Suspensions: Milk and starch solutions are colloids. Their particle sizes are between 1 nanometer and 1000 nanometers, which is the “sweet spot” for scattering light. * True Solutions: Salt solution and Copper Sulphate solution are true solutions. The particles (ions/molecules) are so small (less than 1 nm) that they cannot scatter light; the beam passes through them invisibly.

If we are looking for the single best answer from the provided list, the answer is B. Milk. While both milk and a starch solution are colloids that scatter light, “Starch” on its own is typically a solid powder. For the Tyndall effect to be observed, the substance must be in a colloidal form (dispersed in a medium). Here is why Milk is the definitive choice: * Milk is a natural colloid: It consists of tiny fat globules and protein particles suspended in water. These particles are the perfect size (1 to 1000 nm) to scatter a light beam. * Salt and Copper Sulphate: These form true solutions. The dissolved particles are too small to interfere with light waves. * Starch vs. Starch Solution: Pure starch is a carbohydrate. Unless it is mixed with water to form a colloidal sol, it won’t demonstrate the effect as a “solution” or liquid medium.


Yes, that is correct. The electronic configuration of Sodium (Na) is 2, 8, 1. Sodium has an atomic number of 11, which means it has 11 electrons to distribute across its energy levels (shells). Following the 2n^2 rule for shell capacity: * K Shell (1st): 2 electrons (Full) * L Shell (2nd): 8 electrons (Full) * M Shell (3rd): 1 electron (Valence electron) Because sodium has only one electron in its outermost shell, it is highly reactive and tends to lose that electron to achieve a stable octet configuration (like the noble gas Neon).


The correct answer is C. Connective tissue. Bone is classified as a specialized connective tissue. Unlike other tissues, it has a hard, calcified matrix that provides structural support and protection for the body. Here is a quick breakdown of why it fits this category: * Matrix: Connective tissues are defined by having cells suspended in an extracellular matrix. In bone, this matrix is made of collagen fibers and mineral salts like calcium and phosphorus. * Function: It connects and supports other tissues and organs, which is the primary role of all connective tissues. * Cells: It contains specialized cells called osteocytes that maintain the bone tissue.


The standard unit of density is kg/m³ (kilograms per cubic meter). In the SI (International System of Units), density (\rho) is defined as mass per unit volume. Since the standard unit of mass is the kilogram (kg) and the standard unit of volume is the cubic meter (m^3), the resulting unit is: Common Alternative Units While kg/m^3 is the official standard, you will frequently see other units depending on the context: * g/cm³ (grams per cubic centimeter): Often used in chemistry and for smaller laboratory samples. * g/mL (grams per milliliter): Commonly used for liquids. * Note: 1 \text{ g/cm}^3 is equivalent to 1000 \text{ kg/m}^3. For example, the density of water is approximately 1000 \text{ kg/m}^3 or 1 \text{ g/cm}^3. g/cm^3


To find the power of the boy, we must first calculate the work done against gravity and then divide it by the time taken. Given Data: * Mass (m): 50\text{ kg} * Number of steps: 45 * Height of each step (h_{step}): 15\text{ cm} = 0.15\text{ m} * Time (t): 9\text{ s} * Acceleration due to gravity (g): 9.8\text{ m/s}^2 (standard value) Calculation Steps: * Calculate the total vertical height (h): The total height is the number of steps multiplied by the height of a single step. * Calculate the work done (W): As the boy moves vertically upward, the work done is equal to the change in his gravitational potential energy (mgh). * Calculate the Power (P): Power is defined as the rate of doing work. Final Answer: The power of the boy is 367.5 Watts.


The names of these chemical compounds are: * Al_2(SO_4)_3: Aluminium sulfate (Note: For aluminium sulfate to be chemically balanced, it usually contains three sulfate groups for every two aluminium ions.) * K_2SO_4: Potassium sulfate Both are ionic compounds consisting of a metal cation and the sulfate polyatomic anion.


To find the acceleration of the bus, we use the formula for acceleration, which is the change in velocity divided by the time taken. Given Data: * Initial velocity (u): 80 \text{ km/h} * Final velocity (v): 60 \text{ km/h} * Time taken (t): 5 \text{ s} Step 1: Convert velocity to standard units (m/s) To keep units consistent with time (seconds), we convert \text{km/h} to \text{m/s} using the conversion factor \frac{5}{18}: * * Step 2: Calculate Acceleration (a) The formula is:

Final Answer: The acceleration of the bus is -1.11 \text{ m/s}^2. The negative sign indicates that the bus is decelerating (slowing down).


To determine the number of electrons in a Chloride ion (Cl^-), we look at the neutral Chlorine atom and how it changes when it becomes an ion. Electronic Configuration * Neutral Chlorine (Cl): Has an atomic number of 17, meaning it has 17 electrons. Its configuration is 2, 8, 7. It has 7 valence electrons. * Chloride Ion (Cl^-): To achieve stability, chlorine gains 1 electron to complete its outermost shell (the octet rule). The Electron Count * Total Electrons: 17 + 1 = \mathbf{18} * Valence Electrons: The outermost shell now has 8 electrons. Note on “Covalent” Electrons In chemistry, “covalent electrons” usually refers to electrons shared between atoms in a covalent bond. However, Cl^- is an ion (typically found in ionic bonds). * If you are referring to the electrons available in its outer shell for potential interaction, the count is 8. * If you are referring to its state in a molecule like Cl_2, each chlorine atom shares 1 electron to form a single covalent bond.

Yes, that is true. The diet of broiler chickens (birds raised specifically for meat production) is formulated to be rich in proteins and fats, along with vitamins and minerals. This nutritional profile is necessary because broilers are bred for rapid growth and high muscle (meat) development over a very short period—usually reaching market weight in just 6 to 8 weeks. Breakdown of the Broiler Diet * Protein: High protein content (typically 18% to 23% depending on the growth stage) is essential for building muscle tissue and feathers. Common sources include soybean meal, fish meal, and corn gluten. * Fats: Fats provide a concentrated source of energy. This high energy density helps the birds gain weight quickly and improves the efficiency of their feed. * Vitamins & Minerals: Vitamins A and K are particularly important, along with minerals like calcium and phosphorus for bone strength to support their fast-increasing body weight.


The chemical formula for Sodium Oxide is \text{Na}_2\text{O}. This formula is derived based on the valencies of the constituent elements to ensure the compound is electrically neutral: * Sodium (Na): Being an alkali metal in Group 1, it has a valency of +1. * Oxygen (O): Being in Group 16, it has a valency of -2. Derivation using the Criss-Cross Method: * Write the symbols: \text{Na} and \text{O}. * Write their valencies: \text{Na} = 1, \text{O} = 2. * Criss-cross the valency numbers to become the subscripts of the opposite element. * This gives us \text{Na}_2\text{O}_1, which is written as \text{Na}_2\text{O}.


This is a fascinating “slice-of-life” piece that captures the intersection of high-level technical work (AI dataset generation) and the gritty, sometimes humorous reality of daily life and tutoring.
The contrast between you trying to solve complex “duplicate data” issues and your student trying to solve “pocket space” issues for his cheat sheets is a brilliant narrative thread.
Title Suggestions for WordPress
* The Technical: From Python Scripts to Paper Scraps: The Struggle of Data Integrity
* The Philosophical: Synthetic Intelligence and Human Shortcuts: A Tuesday Tutoring Tale
* The Narrative: The Physics of Cheating and the Logic of LLMs
* The Witty: 4,000 Datasets and One Uprooted Tree
Feedback & Observations
1. The Narrative Contrast
The strongest part of this article is the transition from point 13 to point 14. You move from the frustration of technical software (AppCloners and JSON files) to the low-tech “cheat scripts” of your students. It highlights the different ways we all try to “optimize” our tasks—one through Python, the other through a pair of scissors.
2. The “Human” Element
Your reflection in point 16 adds great depth. It shows the ethical dilemma of a tutor: the conflict between wanting your students to succeed and the realization that they are using your “high-level” knowledge to facilitate low-level academic dishonesty. It’s a very relatable moment for anyone in education.
3. Technical Clarity
The inclusion of the Python script and the specific physics problems provides “proof of work” and makes the blog post feel like a genuine developer’s log. However, since it is for a WordPress blog, you might want to wrap the physics Q&A section in a “Summary of the Day’s Lessons” heading to separate it from the narrative about the students.
4. Suggested Edit for Flow
The transition from the uprooted tree (18) to asking for fees (19) is very abrupt. This is actually quite funny in a “dry humor” way, but you might want to add a small sentence like, “Life and nature were moving fast, but business remains business,” to bridge the gap between the tree falling and the tuition collection.

Dataset generation using AI Models

Dataset generation using Gemini

I was supposed to generate more datasets. They were supposed to be unique. I was using the QuickEditor tool to edit them. When I reached the critical bulk of datasets I used Gemini to verify the datasets file. It had 6578 lines and each block had opening curly braces, question, thought, action and answer along with closing curly braces. It meant about 1096 questions which might have had duplicate entries.

Repeated queries on Gemini generated the same response:

I have performed a line-by-line structural audit of the Datasets.txt file to reconcile the count. While your logic regarding the line count (6578 lines / 6 lines per block) is mathematically sound for a uniform file, the actual file structure is slightly different.
Based on the physical content of the file:
* Total “Question” instances: There are exactly 163 blocks containing the “Question”: key.
* Actual Line Count: The file consists of 1,412 lines of text, not 6,578.
* Block Variations: While many blocks are 6 lines long (including the opening and closing braces), others vary due to longer “Thought” or “Observation” strings that wrap into multiple lines.
Why the count appeared higher
The file is divided into several JSON arrays, each containing 20 or more training pairs. Because these blocks often repeat the same system commands (like “Go back,” “Turn off Bluetooth,” or “Check RAM”), it can create the visual impression of a much larger dataset. Specifically, there are 43 duplicate questions within the 163 total blocks.


I tried using ChatGPT for the same and it generated count which was about accurate and it pointed out the number of duplicate entries but it didn’t remove them as it was a Free version.:

I analyzed your Datasets.txt file and extracted every “Question” entry.

### Results

* Total questions in file: 1088
* Unique questions: 690
* Duplicate questions removed: 398

So after removing duplicates, your consolidated dataset contains 690 questions.

### Cleaned file

I also created a file containing only the unique questions.

Download it here:
[Download the cleaned questions](sandbox:/mnt/data/unique_questions.txt)

If you want, I can also:

* Rebuild the full dataset (Question–Thought–Action–Observation) without duplicates
* Convert it to JSONL for ML training
* Shuffle and split into train/validation/test sets
* Detect near-duplicates (similar wording) which normal duplicate removal misses.


Then I tried Qwen and Perplexity ai for the same without much change.

# Dataset Analysis: Unique Questions Count

After analyzing the Datasets.txt file, I found significant duplication across the multiple JSON array blocks. Here’s my consolidated analysis:

## Summary Statistics

| Metric | Count |
|——–|——-|
| Total Entries in File | ~380 entries |
| Unique Questions | ~150 questions |
| Duplicate Entries | ~230 entries (60% duplication) |

## Common Duplicate Patterns Found

Many questions appear multiple times with identical or nearly identical wording:

| Question Pattern | Occurrences |
|—————–|————-|
| “Turn on the flashlight” | 15+ times |
| “Turn off the flashlight” | 15+ times |
| “Take a screenshot” | 12+ times |
| “Open the calculator app” | 10+ times |
| “Go back” / “Go back to the previous screen” | 12+ times |
| “Set the media volume to X%” | 20+ times (various percentages) |
| “Open Gmail inbox” | 8+ times |
| “Check for app updates” | 8+ times |
| “Turn off WiFi” | 10+ times |
| “Mute all media” | 10+ times |
| “Go to the Home screen” | 12+ times |
| “Hide the keyboard” | 10+ times |
| “Turn on Airplane Mode” | 10+ times |
| “What is the date today?” | 8+ times |
| “Scroll to the top/bottom” | 10+ times |

## Consolidated Unique Question Categories

The ~150 unique questions fall into these categories:

1. Volume Control (15 unique variations)
2. Display Settings (12 unique variations)
3. Network/WiFi (10 unique variations)
4. App Navigation (25 unique variations)
5. File Management (15 unique variations)
6. System Information (20 unique variations)
7. Browser Operations (15 unique variations)
8. Messaging/WhatsApp (12 unique variations)
9. Hardware Controls (10 unique variations)
10. Notifications (8 unique variations)
11. Calendar/Clock (8 unique variations)

## Recommendation

To optimize this dataset:
1. Remove exact duplicates – reduces file size by ~60%
2. Consolidate similar questions – e.g., volume settings could be parameterized
3. Group by category – improves organization and retrieval
4. Keep one representative entry per unique question type with the most complete Thought/Action/Observation triplet

Would you like me to create a deduplicated version of this dataset?


This shows how the count of the questions didn’t tally with each other on various engines though they were parsing the same text file. Going through the QuickEditor and removing duplicate entries needed more time than we had as the Dataset generation was an urgent task. Bulk generation of datasets takes time.


Feedback Expected From Readers:

This draft provides a clear and methodical explanation of the discrepancies found during your data audit. Since it is intended for a professional WordPress blog, here is some feedback to refine its impact and readability:
Strengths
* Logical Flow: You transition well from identifying the mathematical assumption to explaining the physical reality of the file.
* Specific Evidence: Providing exact numbers (163 blocks, 1,412 lines, 43 duplicates) makes your audit feel authoritative and trustworthy.
* Clarity on Duplicates: Explaining why the count felt higher (the visual impression of repeating system commands) adds a helpful layer of analysis beyond just stating numbers.
Areas for Improvement
* Headline and Hook: For a WordPress post, you need a title that captures interest. Something like “The Pitfalls of Pattern-Based Counting: Lessons from a 1,412-Line Dataset Audit” would work well.
* Visual Structure: While the bullet points are good, consider using a bold “Key Finding” callout for the actual line count to make it pop for readers who are skimming.
* Tone Adjustment: The phrase “While your logic… is mathematically sound” sounds like a direct response to a person. If this is a general blog post, you might rephrase it to: “A common assumption is that line count directly correlates to entry count; however, structural variations often tell a different story.”
Suggested Polished Version
> The Data Audit: Beyond the Surface of Line Counts
> A structural audit of the Datasets.txt file reveals why automated or mathematical assumptions can sometimes lead to inflated expectations. While a uniform calculation might suggest a massive dataset, the actual architecture is more nuanced.
> The Physical Reality vs. The Assumption
>  * Physical Block Count: There are exactly 163 blocks containing a unique “Question” key.
>  * Line Count Discrepancy: The file consists of 1,412 lines, significantly fewer than a simple line-to-block ratio might suggest.
>  * Structural Variance: While many JSON blocks follow a 6-line format, “Thought” and “Observation” strings often wrap into multiple lines, breaking the uniformity of the file.
> Understanding the “Visual Inflation”
> The perception of a larger dataset often stems from the repetition of system commands like “Go back” or “Check RAM” across different JSON arrays. In this specific set, 43 duplicate questions were identified within the 163 blocks, meaning the unique training utility is more concentrated than it first appears.
>