Sampling Beyond The Real

Uncorking The Synthetic Data Conversation

Samuel Pierre-Gilles
February 09, 2024

For those that are new around here, this is weekly newsletter where I highlight new and innovative AI products that are worth exploring.

Hey hey!

Happy Friday! We’re back for another 2024 issue.

This week's issue is going to take a deeper dive in one aspect of a fundamental pillar of building AI-powered products. Beyond the business-level use cases, the user experience design, the development work and so much more, lies the fuel of the AI engine: its training data.

This release will mostly focus on synthetic data: a smaller slice of the enormous topic that is AI training data.

In this week’s issue:

Product of the Week
A huge thank you
Other AI Things Happened
What I’m Reading

For those of you feeling thrown into the deep end, I’ve added in a short recap, so buckle up!

Synthetic data is like the ultimate backstage pass for AI, letting developers simulate real-world scenarios without compromising privacy or waiting for the perfect dataset.

Imagine creating a digital twin of a bustling city to test self-driving cars, where every traffic jam, pedestrian, and weather condition is generated through code, ensuring the car's AI can handle urban driving. Or picture a healthcare AI system trained on synthetic patient records, rich with diverse medical histories and symptoms, guaranteeing it's well-prepared to diagnose diseases across the globe without ever risking real patient data. Synthetic data offers a sandbox for limitless testing and learning, ensuring AI-powered products are THAT MUCH smarter when the real world throws them a curveball.

For a well rounded primer on Synthetic Data, have a read of this written by Shelby Hiter

To all you business owners, technical architects, etc. shopping for AI-Powered products, don’t hesitate to confidently ask Vendor sales reps:

What kind of data was used for your [Vendor] AI training? How much of it was synthetic?
Are you, [Vendor], planning on using synthetic data in your future versions of your AI engines?

These questions are a small indicators on the growth pace, scope and technical mastery of the vendor’s AI in question.

PRODUCT OF THE WEEK

Only processes of the most exquisite vintage for this newsletter 🍷 curated with the discernment of a product sommelier 😂 So after testing dozens of AI products this week. Here’s my top pick.

Much of this week's experimentation was around various synthetic Data Solutions that were available, whether it be demos, or open-sourced tech.

Hence, this week’s product pick being Mostly.AI!

Let’s jump into it!

Explaining this week's testing

Knowing that there are many solutions out there when it comes to data synth, I was looking for a pretty consistent point of comparison. That meant running a comparative test across a number of solutions I could find in order to let the best one rise to the top on its own merits. I’m talking, the same dataset, the same configs if I could help it, and looking to assess the quality of output metrics.

The use case I developed was data-driven (hilarious, I know) where I attempted to enhance a dataset and see if the results would be adequately usable.

Starting not too far from home, I chose my starting point with the City of Toronto’s Open data portal for a base dataset. More specifically, I put myself in the shoes of someone looking to model the traffic volumes of a major city. With a usable dataset in hand, I moved onto the brass tacks.

Notes On The Overall MOSTLY.AI SaaS Experience

Exploring Mostly AI, is a good lesson in user-friendly design that prioritizes simplicity and intuitiveness. The information layouts and overall user experience are straightforward, making it easy for nearly anyone to navigate through its offerings. What stands out is the clarity in user actions; it's always clear what steps need to be taken next, which greatly reduces any potential confusion. Let’s take a moment to commend its transparency throughout the data creation pipeline. The availability of execution logs, which are both clear and well-integrated with the output and results, provides valuable insights into the process. This approach, while not without room for improvement, significantly contributes to a more informed and smoother use.

I animated a quick recap of my Mostly.AI sampling for your viewing enjoyment

Setting Up Configurations & Running Tests

Getting into the configurations for my tests, my lasting impression was that no one needs to be an expert in technical jargon to grasp Mostly’s functionalities. With a basic understanding of AI and data from a business perspective, someone can navigate through the platform's features. Of course, this doesn't mean it's all plain sailing; a bit of diligence and careful reading will still be required.

The key actions are presented very simply, and follow the common pattern for much of AI development & data synth:

Import source dataset
Establish key relationships & annotate key data types for particularities
Hit the go button, and sit tight
Marvel (or cry) at the results after much babysitting of a progress bar.

The team behind this platform was well minded to know their audience. The entire tests were very smooth, standing truly apart, once compared to some others that were by far less friendly. However there was one parameter that I configured in my pursuit for pushing the limits of the platform & wanting to see where the technology had gotten to; the accuracy of the results.

And let me tell you, I paid the price for my curiosity in time. I had time for a coffee and then some chores while letting this service run. At maximum push, default high fidelity, and recommended epochs (the amount of training cycles used to impress knowledge to a neural network; which was about 100) I sipped tea, coffee, stretched, and caught up on much of my housework while this the model was building an understanding of the dataset. Mind you, this was to be expected when systems use advanced tech to produce a huge jump in data volume (oops I did blast past the recommended limit for use with my request) for making more AI (interesting circular lifecycle there 😉 )

The level of transparency showed everything needed to assess the state of preprocessing, actual trailing outputs of AI training steps, Al most everything - short of the secret sauce in their AI in the exportable logs of course. (For the ML sleuths reading: I suspect a hybrid approach of a fine tuned GAN and some other preprocessing magic)

Results

Keeping with the wine theme from our thumbnail, 2.5hours of fermentation later, the harvest of data points I had set to cask was ready for sampling and decanting. You can see in the GIF above the fantastic clarity, visual communication of the dataset’s profile outlining the variations in the resulting generated data. There’s even clear identification of the amount of chaos (read the technical term: noise) being injected in the possible results.

The icing on the proverbial cake? The free-tier is openly available after a few clicks is a gigantic service to the creator community. Talk about a volume enhancement hack for any AI project I may have in the future!

THANKS TO ALL OF YOU

We launched our first ever community survey 2 weeks ago!

Thank you so much for all of your feedback, your ideas and frankly even your opinions on where the AI Product Report should be headed.

I'll be looking to publish a few key insights from the newsletter’s survey in the next few weeks so you can get a sense of what the community thinks!

The year’s first community poll is essentially closed, but I’ll leave the link up for one last week if you want to jump in late and still have a say in where we’re headed! The idea with this was to ask many of you a few questions (7 questions at most) to let me know what sparks your interest, and how I can improve the newsletter.

Thank you for being the best part of this journey. 💌

OTHER AI THINGS HAPPENED

Some other notable news and product launches from this week

Apple breaks new ground with its AI research, offering a blueprint for high-efficiency, low-cost AI models. This means access to cutting-edge tech for smaller dev shops is getting closer and closer.
Cisco's Motific leaps off the drawing board from Outshift, unveiling a Generative AI management powerhouse poised to revolutionize deployment timelines and security protocols.
Singapore's AI investment has flown incredibly high in 2023, defying global fintech funding trends with a robust 77% surge, cementing its status as a burgeoning hub for AI innovation.
Europe's GenAI navigates the turbulent waters of business sustainability, exploring some novel monetization avenues for their open-source AI products.
Igniting a legislative charge, Senate Democrats have rolled out a groundbreaking bill to clamp down on AI and algorithmic price-fixing, championing transparency and rigorous audits to uphold antitrust statutes. This casts an interesting shadow on automated micro-trades that fuel some of Wall Street’s players
Here’s a critical review of SaaS agreements, urging a keen eye on data privacy, ownership, and stringent compliance measures.

WHAT I'M READING

"Data is the new oil" - Clive Humby,2006

Keeping with this week's theme, my reading was along similar lines. My goals with this were to better understand:

What the standard is becoming when it comes to “synth data”.
Are best practices & resources for dataset enhancement using “synth data” that could help give small teams a leg up?

Here’s what I found:

What the standard was becoming when it came to “synth data”

Yes, people across many industries love this stuff, some teams even dedicate entire ventures to making better synth data to let others’ production teams have more stuff to build with.

Are best practices & resources for dataset enhancement using “synth data” that could help give small teams a leg up?

Garbage in-garbage out still holds true for all things data modeling. The more time you spend getting to know your data’s distributions, shape and characteristics, the more correctly you can generate synth data of high quality.

With enough foundational real data, it is possible to create a referee model that can assess whether the synth data you’re cooking up with various methods is actually good-enough or not without needing a human to granularity overview it (They’re called Generative Adversarial Networks, truly fascinating stuff if you ask me - and yes, this tech is part of the AI soup that powers our LLMs)

LETS TALK OTHER RESOURCES!

Microsoft has a Synthetic Data Generator that’s open source
Datacebo has put out a synthetic data vault to help tool developers with an ecosystem

Stay well, and until next week.

-✌🏽 Sam

P.S. Interested in having me give you private feedback about a product that you are building? Send me an email: [email protected]

Reply

or to participate.