News & Views

Building a Synthetic Data Generator with Python

Engineering

 

As data professionals, we’ve all been there: you need data for testing, development, or prototyping, but the data isn’t available because of privacy concerns, access restrictions or simply unaccessible. At Tasman Analytics, we faced this challenge repeatedly when implementing tracking for clients who needed to understand user behaviour on their websites.

The Problem: When Real Data Isn’t Available

When modelling the output of these website tracking systems, we need realistic user journey data. This data can be complex and collect a number of data points from user devices and profiles. Think of a typical user flow: a visitor lands on your homepage, browses to a services page, checks out the blog, and eventually fills out a contact form. These journeys aren’t linear, users drop off at different stages, and the data needs to reflect these real-world patterns.

But what happens when you can’t access the actual tracking data because the tracking isn’t implemented yet or you are working through some data quality issues? In the past, this would be a significant blocker, but with the help of some internal tooling, we’re now able to create synthetic source data that can help unblock ourselves.

First Attempt: Can’t AI Just Do This?

Like many of us, our first instinct was to turn to LLMs. “Just generate some fake user data,” we thought. And initially, it worked beautifully:

  • Small scale: Creating 5 user profiles? Perfect. Sensible usernames, consistent emails, realistic device information.
  • Simple journeys: Asking for a single user journey with proper schema formatting? No problem.

To specify the type of data we wanted the LLM to generate, we used the following schema specification and some additional prompting.


{
    "$schema": "<http://json-schema.org/draft-07/schema#>",
    "type": "object",
    "properties": {
        "anonymousId": { "type": "string" },
        "channel": { "type": "string" },
        "context": {
            "type": "object",
            "properties": {
                "page": {
                    "type": "object",
                    "properties": {
                        "initial_referrer": { "type": "string", "format": "uri" },
                        "initial_referring_domain": { "type": "string" },
                        "path": { "type": "string" },
                        "referrer": { "type": "string", "format": "uri" },
                        "referring_domain": { "type": "string" },
                        "tab_url": { "type": "string", "format": "uri" },
                        "title": { "type": "string" },
                        "url": { "type": "string", "format": "uri" }
                    },
                    "required": ["path", "title", "url"]
                },
                "sessionId": { "type": "integer" },
                "timezone": { "type": "string" },
                "userAgent": { "type": "string" }
            },
            "required": ["page", "sessionId", "timezone", "userAgent"]
        },
        "event": { "const": "sign_up" },
        "messageId": { "type": "string" },
        "originalTimestamp": { "type": "string", "format": "date-time" },
        "properties": {
            "type": "object",
            "properties": {
                "user_id": { "type": "string" },
                "email": { "type": "string", "format": "email" },
                "referrer": { "type": "string", "format": "uri" }
            },
            "required": ["user_id", "email"]
        },
        "receivedAt": { "type": "string", "format": "date-time" },
        "request_ip": { "type": "string", "format": "ipv4" },
        "userId": { "type": ["string", "null"] }
    },
    "required": [
        "anonymousId",
        "channel",
        "context",
        "event",
        "originalTimestamp",
        "properties",
        "receivedAt",
        "request_ip"
    ]
}

But as we scaled up, the cracks began to show:

  • Inconsistency: When generating 100 users with 1,000 journeys, we started seeing smart TVs and smartwatches as devices (who’s browsing websites on those?)
  • Performance: What should take seconds stretched to minutes
  • Repetition: Every user ended up with identical journey patterns
  • Maintenance: The solution wasn’t reusable across different clients

After three hours of prompt engineering, the verdict was clear: time to code it ourselves.

Enter Faker: The Python Solution

Faker is a lightweight Python library designed specifically for generating fake data. Unlike LLMs, it’s fast, consistent, and predictable. The power of this library comes from it’s simplicity:

The Basics

from faker import Faker
fake = Faker()

# Generate consistent fake data
name = fake.name()
email = fake.email()

Faker works by maintaining large sets of realistic data that it randomly samples from. This means you get realistic-looking results without the unpredictability of generative AI.

Providers

Providers group data into logical collections. These can be important into your project as required and provide the (fake) information you need. For example if you wanted to generate some fake passport profiles, you’d use the Passport Provider:

from faker import Faker
from faker.providers import passport

f = Faker()
f.add_provider(passport)
new_identity = f.passport_full()

print(new_identity)

Giving the following random profile.

>>> Benjamin # First Name
>>> Howell # Family Name
>>> M # Gender 
>>> 11 Aug 1940 # Date of Birth
>>> 17 Dec 2022 # Date of issue
>>> 17 Dec 2032 # Date of expiration
>>> H09112191 # Passport Number 

Building Consistency Through Custom Providers

The real power came when we built custom providers to ensure logical consistency. For example, our user provider ensures that Sarah Johnson from “Mitchell, Rodriguez and Sons” gets an email like sarah.johnson@mitchell-rodriguez-sons.com – not some random Gmail address.

class UserProvider(BaseProvider):
    def _create_company_email(self, name: str, company: str) -> str:
        email_name = name.lower().replace(" ", ".")
        email_domain = company.lower().replace(" ", "-").replace(",", "")
        email_ending = random.choice(["com", "io", "ai", "co.uk"])
        return f"{email_name}@{email_domain}.{email_ending}"

We created providers that collected information about Users, their Devices and also common Internet Sources of visitors. This allows us to specify closely what is requ

The YAML-Driven Approach

To make this tool usable by our entire team, we built a YAML configuration system that defines:

  • Global parameters: Website URL, number of users, session counts
  • Event properties: Canonical properties that events can use
  • Journey pathways: The flow between different website events, including dropout rates. Each journey makes up a percentage of all journeys, and there are continuation thresholds which gate which journeys make it to the next stage.
  • Platform wrappers: Different tracking platforms capture different metadata

This approach meant that non-technical team members could generate realistic datasets by simply editing a configuration file.

The configuration file is parsed by the tool, which then uses this to generate these journeys in a logically consistent way and to provide the raw data to allow downstream modelling.

name: Fake Event Stream
start_date: "2025-08-01"
end_date: "2025-08-31"
url_root: "<https://www.acme.ai/>"
number_of_users: 100
number_of_journeys: 1000
number_of_products: 4
number_of_articles: 3
destination: "rudderstack"
properties: # Properties used in events
  user_email: 'user.email'
  user_company: 'user.company'
  user_name: 'user.name'
events: # Unique events with associated properties
  login:
    user_email: '$user.email'
    user_company: '$user.company'
    user_name: '$user.name'
    page_url_stub: '/login'
  home_page_view:
    user_email: '$user.email'
    user_company: '$user.company'
    user_name: '$user.name'
    page_title: 'Home'
    page_url_stub: '/'
journeys: # Journeys which are a flow of connected events
  casual_visitor:
    journey_percentage: 0.4
    event_order:
    - id: login
      percentage: 1.0
    - id: home_page_view
      percentage: 0.8

The Results

Our final solution generates 3,000 realistic user journey records in under 2 seconds, complete with:

  • Consistent user identities across multiple sessions
  • Realistic journey pathways with appropriate dropout rates
  • Platform-specific metadata and formatting
  • Logical relationships between user attributes

Compare this to our LLM approach, which took minutes and produced inconsistent, unreproducible and repeatable results.

When to Use Each Approach

We’re bullish on the use of LLMs to help build internal tools, but also recognise the need to identify where they are best used. Here is our recomendation if you’d like to generate synthetic data for your use cases:

Use LLMs when:

  • Dataset is small (< 1000 records)
  • Logical consistency isn’t critical
  • You need it once and don’t plan to repeat the data generation process
  • The data structure is simple and easy to describe (or you have a schema to describe it)
  • You don’t need large amounts of variability in your synthetic data.

Use Faker when:

  • You need complex, highly consistent data
  • Performance matters
  • You’ll be generating data repeatedly
  • Multiple team members need to use the tool
  • You need precise control over data relationships.

Key Learnings

  1. Natural language has limits: When the desired outcome is hard to describe precisely in natural language, code becomes a better specification tool.
  2. Consistency is hard: LLMs struggle with maintaining logical relationships across large datasets.
  3. Reusability matters: Building a configurable system pays dividends when you face the same problem repeatedly.
  4. Performance scales differently: Faker’s deterministic approach scales linearly, while LLM generation can become exponentially slower.

This experience taught us that while AI tools are powerful, sometimes the right tool for the job is a well-designed library that’s been solving the specific problem for years. The key is knowing when to use each approach and not getting caught up in the hype of using AI for everything.

More news & views

8 ways to flex DuckDB
Engineering
MIN READ
Claude Code for Synthetic Data: 24-Hour Production Build Test
Analysis
8 MIN READ

Contact form