Get started with A.I. in Life Sciences: Document Summary with Amazon Bedrock

Aneesh Karve October 11, 2024

Genomics, Bioinformatics, AI, LLM

It’s tantalizing to talk about generative AI but how can you make AI practical for your enterprise? While working with companies in the life sciences with cross-functional teams of wet and dry scientists, we have found that document summary is a compelling first use case for generative AI. Generative AIs write actionable, mostly accurate summaries of the following types of documents:

Scientific papers
Quality documents
Instrument manuals

Naturally, the above documents may contain proprietary or sensitive information, so we recommend that you use Generative AI models that run exclusively in your virtual private cloud, as opposed to public websites and models. See our previous article on Security and Governance for AI for more on this topic.

In the following sections we’ll show you how to use Amazon Bedrock and Quilt Qurator to summarize scientific documents.

Background

We’re going to use pre-trained foundation models for our tutorial. We recommend Claude 3 Sonnet as a starting point since it supports a wider variety of document attachments—including PDF, Word, and HTML—than the admittedly more capable Claude 3.5 Sonnet.

Foundation models are turnkey models that possess a general understanding of the world. They apply this general understanding to specific problems at hand via a context window that contains your conversation with the model plus any attachments. The larger the context window the more information the model can consider when responding to your prompts. Each human word corresponds to a little more than one token, so a 20 page PDF with 500 words per page contains about 10,000 words (or about 13,000 tokens) and easily fits into the 200,000 token context of Anthropic models like Claude 3 Sonnet.

Foundation models are distinct from fine-tuned models. Fine-tuned models require custom training and are beyond the scope of this article.

Summarizing scientific documents with AWS Bedrock

Enable Claude 3.5 Sonnet

Sign in to the Amazon Console, select a Bedrock-compatible region, and navigate to Bedrock.
Scroll down to Bedrock configurations > Model access and request access to Claude 3 Sonnet. After a few minutes you should see “Access granted.”

Summarize the first document

Navigate to Playgrounds > Chat and select Claude 3 Sonnet as your model.
Attach the paper Nonheritable Cellular Variability Accelerates the Evolutionary Processes of Cancer.

You’ll find that Claude 3 Sonnet has a 4.5MB file size limit. On a Mac you can export a PDF to PDF and choose the Quartz Filter Reduce File Size.

We can now prompt the model:

Repeat the abstract verbatim then give your own summary focusing on things not in the abstract.

You’ll probably notice that the model finds the author summary on page 2 but doesn’t quite grab the abstract from the first page. You can think of generative AI models as bright but sometimes callow and inexperienced assistants. That means you’ll need to iteratively prompt them to refine their output. The more targeted and specific your prompts, the better the response quality. Since generative models create statistically probable but imperfect sentences, you should independently verify important conclusions.

I asked you to quote the abstract verbatim and what you responded with was nowhere in the paper. Look at the summary section?

Now you’ll see the model produces the correct abstract. We proceed as follows:

Summarize the Discussion and give me the first five references in a numbered list.

Amazon Bedrock summarizes a scientific paper with Claude 3 Sonnet.

Configuration

You can adjust the behavior of the model with system prompts (general guidelines that should apply to the entire conversation) or with the temperature. Lower temperatures produce more deterministic responses and higher temperatures more creative responses. For scientific paper summary we recommend starting with a temperature of zero and adjusting up according to your needs.

If you’re starting to think there is art to this science, you’re right. Large language models (LLMs) emit statistically plausible responses as a function of their training and the context you provide to them.

Cost

We’ll be using Bedrock on-demand throughput, which charges on the number of tokens in and the number of tokens out. For Claude 3 Sonnet Amazon charges $.003 per 1,000 input tokens and $.015 per 1,000 output tokens. So if you ask 20 questions from a 20-page paper and receive 2,000 total tokens in response, you’ll pay about 33 cents.

Summarizing quality docs and instrument manuals with Quilt Qurator

Teams that neither wish to send scientists to the Amazon Console nor roll their own AI infrastructure can use Quilt Qurator. Qurator is an out-of-the box solution that sets up and administers Amazon Bedrock for you, provides a user interface for non-developers, and automates retrieval-augmented generation (RAG) so that your model has access to the right information to answer your questions. Qurator is part of the Quilt Platform.

We start by uploading the FDA’s Quality System Regulation Process Validation and Illumina’s MiSeq System Guide to a Quilt package.

Drag and drop a PDF into S3 with the Quilt Catalog

We then click on the first document and Summarize and Chat with AI. We can now see how scientists would gain an understanding of IQ, OQ, and PQ from FDA materials with the help of Qurator:

Define IQ, OQ, and PQ for scientists in a bulleted list.

We receive the following response:

Quilt Qurator interprets FDA quality docs on IQ, OQ, PQ—powered by Amazon Bedrock.

Switching to the MiSeq system guide we now ask Qurator “How do I change a flow cell?” Qurator responds as follows:

The guide provides the following steps to change/load a new flow cell on the MiSeq:

1. Put on a new pair of powder-free gloves.
2. Using plastic forceps, remove the old flow cell from the flow cell container.
3. Rinse the new flow cell with laboratory-grade water to remove any excess salts. Dry it thoroughly with a lint-free lens tissue, being careful around the gasket area.
4. Clean the flow cell glass with an alcohol wipe and dry it to remove any streaks or lint.
5. In the software, go to the Load Flow Cell screen during run setup.
6. Open the flow cell compartment door and press the release button to open the flow cell clamp.
7. Place the new flow cell on the flow cell stage.
8. Gently press down on the flow cell clamp to close and secure the new flow cell in place.
9. Close the flow cell compartment door.
10. Proceed with loading the reagents and starting the run.

The guide emphasizes properly cleaning and drying the new flow cell before loading to prevent any salts or debris from affecting imaging and sequencing.

Conclusion

Consider models as bright but sometimes hasty and inexperienced assistants. Prompt them to correct their mistakes and refine their conclusions. The more targeted and specific your prompts, the better the response quality.

You can now summarize scientific papers, quality documents, and instrument manuals with Amazon Bedrock and Quilt Qurator. In future articles we’ll tackle the more ambitious problem of document corpus summary, where we’ll use the LLMs to summarize and interpret findings across hundreds of documents.

As of this writing generative AI is an effective tool for document summary in the life sciences. As with the early days of any software technology—remember the early Apple Maps?—there will be bumps along the road. We invite you to temper your expectations of machine intelligence with a bit of healthy skepticism. With time and experience your intuition for when to trust the machine will improve, as will the intelligence of the machine.

Get started with A.I. in Life Sciences: Document Summary with Amazon Bedrock

Background

Summarizing scientific documents with AWS Bedrock

Enable Claude 3.5 Sonnet

Summarize the first document

Configuration

Cost

Summarizing quality docs and instrument manuals with Quilt Qurator

Conclusion

Comments

Related Posts

Effortlessly Navigate Life Sciences Data with Quilt’s Qurator Omni

Why Hosting Our Data Platform in AWS Matters for Life Sciences

Nextflow and the Metadata Gap: How Quilt Bridges Science and Scale