When ChatGPT was released publicly in November 2022, less than a year and a half ago, nobody could have predicted how quickly it would reshape the world. ChatGPT and similar generative AI capabilities have become a staple of everyday life for regular people and the highest levels of business. Since then, major software and hardware vendors have capitalized on GenAI in numerous ways—some more successful than others. Most data-driven organizations have dipped their toes into AI somehow, but few have integrated AI into their data stack, and almost none have used GenAI in this way.
For this reason, I was excited to finally get my hands on Snowflake’s GenAI tools this year. The two most exciting ones are Snowflake DocumentAI, which trains on unstructured data like images and PDF files, and Snowflake Cortex, their easy button for GenAI. Over several months of experimentation, I explored these features. I built interesting functions, learned the tools, and created great demos. At the end of the process, I found I had several use cases that could immediately benefit companies’ data operations. Today, I’ll share two with you that demonstrate the concepts. But first, some technical basics on how to get started.
Preparing for GenAI Success
Before we discuss the two GenAI use cases, let’s talk about how to set yourself up for success. In these two use cases, Streamlit, a lightweight Python-based UI, is used for data input and output. Streamlit has two versions, one outside of Snowflake on a web server and the other governed by Snowflake, which runs inside of Snowflake. Streamlit is a lightweight Python-based application used for data input and creating some data output charts and tables.
In the past, using SQL Server, I begrudgingly found myself reaching for Microsoft Access for data entry and lightweight reporting. Now, I simply create Streamlit apps. Even more practical, developers can create reusable Streamlit Apps and share them with their teams instead of just sharing code snippets.
How a Streamlit App Works
I’ll share a quick solution that arose while I was trying to generate sample data that included Personally Identifiable Information (PII) for testing. I’ll often need a large data set of values like names and social security numbers to test data systems. I can’t use real data sets for obvious reasons, but the data needs to look real, following the patterns used by names, phone numbers, and social security numbers.