Large language models

4 min read

Reading Time: 4 minutes

Refract: An Overview

Refract is a DSML platform that helps multiple personas like Data Scientist, ML Engineer, Model Quality Controller, MLOPs Engineer, and Model Governance Officer work seamlessly together on any AI use case. Refract accelerates each stage of the ML lifecycle, including data preparation, model development, model deployment, scoring, and monitoring on snowflake data.

Large Language Models — LLMs

Large Language Models are advanced artificial intelligence models that have been trained on vast amounts of text data to understand and generate human-like text. These models are designed to process natural language input, enabling them to understand and generate response text in a way that is contextually relevant and coherent.

LLMs, such as GPT-3 (Generative Pre-trained Transformer 3) developed by OpenAI, have achieved significant advancements in natural language processing and have been applied to a wide range of tasks, including text completion, language translation, question- answering, and more. They can understand and generate text in various languages and can be fine-tuned for specific applications and domains.

The OpenAI API is a cloud-based service provided by OpenAI, where you make requests to OpenAI’s servers to access the language models and receive responses — you do not host the API infrastructure or models on your own servers. But the APIs can be integrated into your applications or services by making HTTP requests to the OpenAI servers.

Limitations of Cloud-hosted private LLMs

Let us discuss a couple of limitations of such Cloud- hosted LLMs.

Data privacy and security: The user might need to send input data (that might be sensitive) to cloud servers (where the LLM is hosted) for processing, and this may be in direct violation of enterprise compliance with relevant data protection regulations.
Lack of transparency: Due to the intricate nature of such models, comprehending them can prove difficult for individuals. The lack of transparency further compounds the issue, impeding one’s understanding of how these models handle data and utilize information for decision-making. Consequently, this could result in limited transparency, and potential issues related to safety and privacy.

Solution: Self-hosted, open-source LLMs

If you look at the timeline (see image above), you will find that several of these LLMs are a part of the open-source community. Usually, these open-source models are released with some pre-trained weights, after getting trained on generic data such as Wikipedia’s crowd-sourced content. These models can either be directly deployed for consumption or they can be fine-tuned on custom data before deployment.

As the title of this blog suggests, the open-sourced models can be fine-tuned/trained on Refract and deployed in Snowflake.

In this blog, I will provide a clear guide on how to fine-tune a T5 model on Refract for text summarization. In this example, we make use of PyTorch and HuggingFace transformers for fine-tuning the model on a News dataset. After fine-tuning, the model was able to generate a summary of long news content.

In this blog we will also illustrate how to deploy the model as a Python UDF (User Defined Function) on Snowflake.

Step-by-Step Guide:

Step 1: Create a project on Refract and attach your GitHub repository to it. This will help in managing all your code.

Step 2: Open the project and launch a template for writing the code. Refract offers multiple pre-designed templates, including one specifically for Snowflake. This template is designed with all the necessary dependencies for Snowflake and flexible to run on various compute options, including GPU, CPU, or Memory Optimized instances. Once you’ve made your choice, you can launch the corresponding template and get started with your Jupyter Notebook right away. In our case of fine-tuning an LLM, we will opt for the GPU enabled Snowflake template.

Step 3: Import libraries and set PyTorch device to GPU. Fine-tuning is a compute-intensive operation, this step enables Refracts GPU to accelerate fine-tuning by parallelizing tasks.

Step 4: Use the following class to initialize the dataset. This will help in tokenizing and preparing the data for fine-tuning. The data should have two columns: one is the document of arbitrary length, and the other is its shorter version or summary.

Step 5: Add the following two functions for training and evaluation.

Step 6: Load the T5 model. T5, an encoder-decoder model, operates by transforming all NLP problems into a text-to-text format. During training, it utilizes a technique called teacher forcing, which necessitates an input sequence and a corresponding target sequence. The input sequence is provided to the model using input_ids, while the target sequence is modified by shifting it to the right. This involves adding a start-sequence token at the beginning and feeding it to the decoder via decoder_input_ids. In the teacher-forcing approach, the target sequence is extended with an EOS token and serves as the label. The start-sequence token is represented by the PAD token. T5 offers the flexibility of being trained or fine-tuned in both supervised and unsupervised manners. In our case we will fine-tune the model in a supervised manner.

Step 7: Prepare training and validation data loaders.

Step 8: Start Training for N number of epochs. You can try starting with N=2 or N=3.

Step 9: Add model to Refracts registry to be able to track it when it moves from one stage to another in the ML lifecycle. Refract comes with a Python SDK (Software Development Kit) which can help with registering the model. User needs to provide a score function, and the model to the register_model function. The score function will have the logic to make predictions using the model. Importing the refractml library, writing the score function, and registering the model are covered in the following three cells.

Save the fine-tuned LLM to Refract persistent storage. The saved model can be downloaded any time from Refract and used for scoring. It has two components: tokenizer and model — we will save both.

Step 10: Deploy the fine-tuned LLM on snowflake. We will use Snowpark Python UDF (User Defined Function) for deploying the model. The UDF will have the logic to load the model and tokenizer into memory, and then generate the response by feeding the input. At this time, the fine-tuned model and tokenizer which was saved to Refract’s persistent storage in step 9, will be uploaded to a stage location. The response will be returned from the UDF.

First, a Snowflake session must be created.

Step 11: Create a stage location on Snowflake and write the model and tokenizer files and configurations to it. In this example, we have created a stage location named T5_P_TOKENIZER for storing the tokenizer and associated files, and T5_P_MODEL for storing the model and the configs of the tokenizer.

Below is an example of adding one of the config files to the Snowflake stage. Similarly, all other files should be put in their respective stages.

Step 12: Write the UDF logic. For the UDF to be able to access the model and tokenizer, we need to bind the files with the UDF. Session.add_imports will help in registering the staged file as an import of the UDF. In our case, we will use add_import for all the files staged in the previous step.

Step 13: Let us create a stage location called LLM and initialize the UDF with that. The Python packages on which the UDF is dependent can be defined using the package param, which takes the list of packages with version as input. In our case we need the parameters, ‘sentencepiece==0.1.95’, ‘snowflake-snowpark-python==1.0.0’, and ‘transformers==4.24.0’.

With this the T5 language model is deployed successfully in Snowflake and is ready for consumption.

Step 14: Now that the model is deployed, we can consume the same using the following code.

Here, the input text would be the document for which the summary must be generated, and the output would be the summary given by the LLM after processing the document.

Conclusion

In conclusion, cloud-hosted, private Large Language Models (LLMs) have certain limitations that need to be considered. These limitations include potential concerns regarding data privacy and security when sending data to cloud servers for processing, as well as the lack of transparency in understanding how these models handle data and make decisions. This lack of transparency can lead to safety and privacy issues, especially when companies choose not to reveal their proprietary code and data. Considering self-hosted, open-source LLMs as a possible solution can help overcome these limitations. These self-hosted, open-source models can be fine-tuned for a custom use case using Refract, and then hosted within a controlled environment like the Snowflake Data Cloud. This way, the deployed LLMs can be used to discover and answer prompts securely. In this blog, we have explored a step-by-step guide on how to fine-tune a T5 model on Refract for text summarization. Additionally, we have explored deploying the model as a Python-based User Defined Function (UDF) on Snowflake.

By leveraging self-hosted, open-source LLMs, organizations can have greater control over their data, ensure compliance with privacy regulations, and enhance transparency in how models handle information, enabling more secure and customizable language processing capabilities.

Click here to learn more about how the combination of Refract and Snowflake can help you get more value from your data with less effort.

Author

Tushar Madheshia

Data Scientist, Fosfor

Tushar Madheshia works as a Data Scientist in the product engineering team at Fosfor. He is passionate about developing AI/ML solutions for complex business problems. Tushar designed and pioneered the MLOps module for Refract in the Fosfor suite, aiding efficient deployment and monitoring of models at scale. Currently, he is integrating Refract with Snowflake for seamless ML experimentation, deployment, and monitoring within Snowflake, without data egress.

More on the topic

Read more thought leadership from our team of experts

Bias in AI: A primer

While Artificial Intelligence (AI) systems can be highly accurate, they are imperfect. As such, they may make incorrect decisions or predictions. Several challenges need to be solved for the development and adoption of technology. One major challenge is the bias in AI systems. Bias in AI refers to the systematic differences between a model's predicted and true output. These deviations can lead to incorrect or unfair outcomes, which can seriously affect critical fields like healthcare, finance, and criminal justice.

Generative AI - Accelerate ML operations using GPT

As Data Science and Machine Learning practitioners, we often face the challenge of finding solutions to complex problems. One powerful artificial intelligence platform that can help speed up the process is the use of Generative Pretrained Transformer 3 (GPT-3) language model.

Choosing the best AI/ML platform from a multimodel vendor

Artificial intelligence (AI) and machine learning (ML) technologies are expanding rapidly as organizations seek to capitalize on the value of their data. Half of the companies surveyed in a 2020 Mckinsey study have already adopted AI in at least one business function.

Privacy & Cookie policy

Privacy & Cookies policy

Cookie name	Active

What is a cookie?

A cookie is a small piece of data that a website asks your browser to store on your computer or mobile device. The cookie allows the website to “remember” your actions or preferences over time. On future visits, this data is then returned to that website to help identify you and your site preferences. Our websites and mobile sites use cookies to give you the best online experience. Most Internet browsers support cookies; however, users can set their browsers to decline certain types of cookies or specific cookies. Further, users can delete cookies at any time.

Why do we use cookies?

We use cookies to learn how you interact with our content and to improve your experience when visiting our website(s). For example, some cookies remember your language or preferences so that you do not have to repeatedly make these choices when you visit one of our websites.

What kind of cookies do we use?

We use the following categories of cookie:

Category 1: Strictly Necessary Cookies

Strictly necessary cookies are those that are essential for our sites to work in the way you have requested. Although many of our sites are open, that is, they do not require registration; we may use strictly necessary cookies to control access to some of our community sites, whitepapers or online events such as webinars; as well as to maintain your session during a single visit. These cookies will need to reset on your browser each time you register or log in to a gated area. If you block these cookies entirely, you may not be able to access gated areas. We may also offer you the choice of a persistent cookie to recognize you as you return to one of our gated sites. If you choose not to use this “remember me” function, you will simply need to log in each time you return.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
__cfduid	Cloudflare	Cookie associated with sites using CloudFlare, used to speed up page load times	1 Year
lidc	linkedin.com	his is a Microsoft MSN 1^st party cookie that ensures the proper functioning of this website.	1 Day
PHPSESSID	ltimindtree.com	Cookies named PHPSESSID only contain a reference to a session stored on the web server	When the browsing session ends
catAccCookies	ltimindtree.com	Cookie set by the UK cookie consent plugin to record that you accept the fact that the site uses cookies.	29 Days
AWSELB		Used to distribute traffic to the website on several servers in order to optimise response times.	2437 Days
JSESSIONID	linkedin.com	Preserves users states across page requests.	334,416 Days
checkForPermission	bidr.io	Determines whether the visitor has accepted the cookie consent box.	1 Day
VISITOR_INFO1_LIVE		Tries to estimate users bandwidth on the pages with integrated YouTube videos.	179 Days

Category 2: Performance Cookies

Performance cookies, often called analytics cookies, collect data from visitors to our sites on a unique, but anonymous basis. The results are reported to us as aggregate numbers and trends. LTI allows third-parties to set performance cookies. We rely on reports to understand our audiences, and improve how our websites work. We use Google Analytics, a web analytics service provided by Google, Inc. (“Google”), which in turn uses performance cookies. Information generated by the cookies about your use of our website will be transmitted to and stored by Google on servers Worldwide. The IP-address, which your browser conveys within the scope of Google Analytics, will not be associated with any other data held by Google. You may refuse the use of cookies by selecting the appropriate settings on your browser. However, you have to note that if you do this, you may not be able to use the full functionality of our website. You can also opt-out from being tracked by Google Analytics from any future instances, by downloading and installing Google Analytics Opt-out Browser Add-on for your current web browser: https://tools.google.com/dlpage/gaoptout & cookiechoices.org and privacy.google.com/businesses

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
_ga	ltimindtree.com	Used to identify unique users. Registers a unique ID that is used to generate statistical data on how the visitor uses the web site.	2 years
_gid	ltimindtree.com	This cookie name is asssociated with Google Universal Analytics. This appears to be a new cookie and as of Spring 2017 no information is available from Google. It appears to store and update a unique value for each page visited.	1 day
_gat	ltimindtree.com	Used by Google Analytics to throttle request rate	1 Day

Category 3: Functionality Cookies

We may use site performance cookies to remember your preferences for operational settings on our websites, so as to save you the trouble to reset the preferences every time you visit. For example, the cookie may recognize optimum video streaming speeds, or volume settings, or the order in which you look at comments to a posting on one of our forums. These cookies do not identify you as an individual and we don’t associate the resulting information with a cookie that does.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
lang	ads.linkedin.com	Set by LinkedIn when a webpage contains an embedded “Follow us” panel. Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.	When the browsing session ends
lang	linkedin.com	In most cases it will likely be used to store language preferences, potentially to serve up content in the stored language.	When the browsing session ends
YSC		Registers a unique ID to keep statistics of what videos from Youtube the user has seen.	2,488,902 Days

Category 4: Social Media Cookies

If you use social media or other third-party credentials to log in to our sites, then that other organization may set a cookie that allows that company to recognize you. The social media organization may use that cookie for its own purposes. The Social Media Organization may also show you ads and content from us when you visit its websites.

Ref links:

LinkedIn – https://www.linkedin.com/legal/privacy-policy Twitter – https://gdpr.twitter.com/en.html & https://twitter.com/en/privacy & https://help.twitter.com/en/rules-and-policies/twitter-cookies Facebook – https://www.facebook.com/business/gdpr Also, if you use a social media-sharing button or widget on one of our sites, the social network that created the button will record your action for its own purposes. Please read through each social media organization’s privacy and data protection policy to understand its use of its cookies and the tracking from our sites, and also how to control such cookies and buttons.

Category 5: Targeting/Advertising Cookies

We use tracking and targeting cookies, or ask other companies to do so on our behalf, to send you emails and show you online advertising, which meet your business and professional interests. If you have registered on our websites, we may send you emails, tailored to reflect the interests you have shown during your visits. We ask third-party advertising platforms and technology companies to show you our ads after you leave our sites (retargeting technology). This technology allows us to make our website services more interesting for you. Retargeting cookies are used to record anonymized movement patterns on a website. These patterns are used to tailor banner advertisements to your interests. The data used for retargeting is completely anonymous, and is only used for statistical analysis. No personal data is stored, and the use of the retargeting technology is subject to the applicable statutory data protection regulations. We also work with companies to reach people who have not visited our sites. These companies do not identify you as an individual, instead rely on a variety of other data to show you advertisements, for example, behavior across websites, information about individual devices, and, in some cases, IP addresses. Please refer below table to understand how these third-party websites collect and use information on our behalf and read more about their opt out options.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
BizoID	ads.linkedin.com	These cookies are used to deliver adverts more relevant to you and your interests	183 days
iuuid	demandbase.com	Used to measure the performance and optimization of Demandbase data and reporting	2 years
IDE	doubleclick.net	This cookie carries out information about how the end user uses the website and any advertising that the end user may have seen before visiting the said website.	2,903,481 Days
UserMatchHistory	linkedin.com	This cookie is used to track visitors so that more relevant ads can be presented based on the visitor’s preferences.	60,345 Days
bcookie	linkedin.com	This is a Microsoft MSN 1st party cookie for sharing the content of the website via social media.	2 years
__asc	ltimindtree.com	This cookie is used to collect information on consumer behavior, which is sent to Alexa Analytics.	1 Day
__auc	ltimindtree.com	This cookie is used to collect information on consumer behavior, which is sent to Alexa Analytics.	1 Year
_gcl_au	ltimindtree.com	Used by Google AdSense for experimenting with advertisement efficiency across websites using their services.	3 Months
bscookie	linkedin.com	Used by the social networking service, LinkedIn, for tracking the use of embedded services.	2 years
tempToken	app.mirabelsmarketingmanager.com		When the browsing session ends
ELOQUA	eloqua.com	Registers a unique ID that identifies the user’s device upon return visits. Used for auto -populating forms and to validate if a certain contact is registered to an email group .	2 Years
ELQSTATUS	eloqua.com	Used to auto -populate forms and validate if a given contact has subscribed to an email group. The cookies only set if the user allows tracking .	2 Years
IDE	doubleclick.net	Used by Google Double Click to register and report the website user’s actions after viewing clicking one of the advertiser’s ads with the purpose of measuring the efficiency of an ad and to present targeted ads to the user.	1 Year
NID	google.com	Registers a unique ID that identifies a returning user’s device. The ID is used for targeted ads.	6 Months
PREF	youtube.com	Registers a unique ID that is used by Google to keep statistics of how the visitor uses YouTube videos across different web sites.	8 months
test_cookie	doubleclick.net	This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor’s browser supports cookies.	1,073,201 Days
UserMatchHistory	linkedin.com	Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor’s preferences.	29 days
VISITOR_INFO1_LIVE	youtube.com		179 days

Third party companies	Purpose	Applicable Privacy/Cookie Policy Link
Alexa	Show targeted, relevant advertisements	https://www.oracle.com/legal/privacy/marketing-cloud-data-cloud-privacy-policy.html To opt out: http://www.bluekai.com/consumers.php#optout
Eloqua	Personalized email based interactions	https://www.oracle.com/legal/privacy/marketing-cloud-data-cloud-privacy-policy.html To opt out: https://www.oracle.com/marketingcloud/opt-status.html
CrazyEgg	CrazyEgg provides visualization of visits to website.	https://help.crazyegg.com/article/165-crazy-eggs-gdpr-readiness Opt Out: DAA: https://www.crazyegg.com/opt-out
DemandBase	Show targeted, relevant advertisements	https://www.demandbase.com/privacy-policy/ Opt out: DAA: http://www.aboutads.info/choices/
LinkedIn	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://www.linkedin.com/legal/privacy-policy Opt-out: https://www.linkedin.com/help/linkedin/answer/62931/manage-advertising-preferences
Google	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://policies.google.com/privacy Opt Out: https://adssettings.google.com/ NAI: http://optout.networkadvertising.org/ DAA: http://optout.aboutads.info/
Facebook	Show targeted, relevant advertisements	https://www.facebook.com/privacy/explanation Opt Out: https://www.facebook.com/help/568137493302217
Youtube	Show targeted, relevant advertisements. Show embedded videos on LTI websites	https://policies.google.com/privacy Opt Out: https://adssettings.google.com/ NAI: http://optout.networkadvertising.org/ DAA: http://optout.aboutads.info/
Twitter	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://twitter.com/en/privacy Opt out: https://twitter.com/personalization DAA: http://optout.aboutads.info/

Save settings

Overview

Partners

What’s hot

Industries

Roles

Knowledge hub

About Fosfor

The Fosfor Decision Cloud

What’s hot

An ecosystem geared for value

Industries

Roles

Knowledge hub

About Fosfor

Refract: An Overview

Large Language Models — LLMs

Limitations of Cloud-hosted private LLMs

Solution: Self-hosted, open-source LLMs

Step-by-Step Guide:

Conclusion

Author

Tushar Madheshia

More on the topic

Bias in AI: A primer

Generative AI - Accelerate ML operations using GPT

Choosing the best AI/ML platform from a multimodel vendor

What is a cookie?

Why do we use cookies?

What kind of cookies do we use?

Category 1: Strictly Necessary Cookies

Category 2: Performance Cookies

Category 3: Functionality Cookies

Category 4: Social Media Cookies

Ref links:

Category 5: Targeting/Advertising Cookies

Overview

Partners

What’s hot

Industries

Roles

Knowledge hub

About Fosfor

The Fosfor Decision Cloud

What’s hot

An ecosystem geared for value

Industries

Roles

Knowledge hub

About Fosfor

Large language models

Refract: An Overview

Large Language Models — LLMs

Limitations of Cloud-hosted private LLMs

Solution: Self-hosted, open-source LLMs

Step-by-Step Guide:

Conclusion

Subscribe to get more insights

Author

Tushar Madheshia

More on the topic

Bias in AI: A primer

Generative AI - Accelerate ML operations using GPT

Choosing the best AI/ML platform from a multimodel vendor

What is a cookie?

Why do we use cookies?

What kind of cookies do we use?

Category 1: Strictly Necessary Cookies

Category 2: Performance Cookies

Category 3: Functionality Cookies

Category 4: Social Media Cookies

Ref links:

Category 5: Targeting/Advertising Cookies