meta_pixel
Tapesearch Logo
Thoughtworks Technology Podcast

AI testing, benchmarks and evals

Thoughtworks Technology Podcast

Thoughtworks

Careers, Business, 907234, Technology

4.753 Ratings

🗓️ 23 January 2025

⏱️ 36 minutes

🧾️ Download transcript

Summary

Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.

To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.

Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani): https://www.thoughtworks.com/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests

Transcript

Click on a timestamp to play from that location

0:00.0

Welcome to the ThoughtWorks Technology Podcast.

0:09.9

I'm your host, Lily Ryan, and I'm speaking to you from Warronderry Country in Australia.

0:15.3

Today we'll discuss benchmarks, evals, tests, and what it really comes down to the renewed

0:20.6

interest in investment in quality

0:22.1

assurance and testing that's been sparked by businesses' attempts to put generative AI-backed

0:26.2

solutions into production.

0:27.7

To guide us in that discussion, we are talking to Shaiyan Mahanty and John Singleton.

0:32.8

Cheyenne is head of AI research, and John is program manager at the ThoughtWorks AI Lab.

0:38.5

Both are former co-founders of Watchful. Welcome. Hey. Thank you so much for having us. Looking forward to it.

0:44.3

Could you tell our listeners a bit about who you folks are, your background in the industry,

0:48.0

and what you're working on at ThoughtWorks? Yeah, I'll kick it off. So like Shine, we came into

0:53.6

ThoughtWorks under the acquisition of Watchful, or the company that we found, co-founded, originally starting off to help automate the process of labeling data. We operated that for about five and a half, six years and are now part of ThoughtWorks as of almost to the day. Actually, to the day, eight months. So,

1:11.3

hazah, this is our eight month anniversary. It's been super exciting. Prior to Watchful,

1:16.4

I've done a number of startups, sales marketing and operations roles, even worked with the

1:21.8

inventor of selective laser centering, 3D printing, Dr. Carl Deckerd for a period of time,

1:29.3

making ink for 3D printers.

1:35.0

And now I am principal program manager here at ThoughtWorks, helping proliferate all the good work and manage all the amazing work that Cheyenne and the research team are heading up.

1:39.6

Yep. And I am Cheyenne. I'm the previous CEO and co-founder of Watchful, along with John.

1:46.8

Did all the stuff there that you'd expect, you know, tried to build a company, did the thing, built a product, got the company sold.

1:54.9

So, woo, we, and now we work on really cool stuff at ThoughtWorks, obviously.

1:59.5

Before all of that, I used to work at

2:02.3

Facebook. So I led the stream processing team that ended up building all the ads network infrastructure

...

Transcript will be available on the free plan in -64 days. Upgrade to see the full transcript now.

Disclaimer: The podcast and artwork embedded on this page are from Thoughtworks, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Thoughtworks and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.

Copyright © Tapesearch 2025.