Unlocking the Power of Amazon Bedrock: A Step-by-Step Guide to Model Evaluation

Introduction

Amazon Bedrock is an AWS service that enables developers to leverage Generative AI models without the complexity of training, hosting, or scaling them. With Bedrock, users can seamlessly integrate AI models into their applications and workflows. One of its powerful features is model evaluation, which allows users to assess and compare different AI models before deployment.

In this blog, we will walk through a hands-on guide to configuring an Amazon S3 bucket for use with Amazon Bedrock and running a model evaluation job to compare the performance of a Generative AI model.

Harnessing the Power of Generative AI with AWS Bedrock: A Step-by-Step Guide

What We Are Going to Do

This tutorial will cover:

Creating an Amazon S3 bucket to store datasets and results.
Configuring permissions and CORS settings.
Setting up an evaluation job in Amazon Bedrock.
Interpreting evaluation results for model selection.

Environment before

Open dialog containing preview of the Environment architecture before lab completion

Environment after

Open dialog containing preview of the Environment architecture after lab completion

Use Cases

Amazon Bedrock’s model evaluation feature is useful for:
- Machine Learning Engineers: Testing and comparing different Generative AI models.
- Cloud Architects: Integrating AI models into cloud-based applications.
- DevOps Engineers: Managing AI-based workflows in cloud environments.
- Software Engineers: Embedding AI capabilities into web and mobile applications.

Prerequisites

Familiarity with the following AWS services will be beneficial:
- Amazon S3 (for storage management)
- Amazon Bedrock (for Generative AI services)

If you're unfamiliar with these, consider reading AWS documentation before proceeding.

Step 1: Creating Amazon S3 Buckets for Dataset and Results

Amazon Bedrock requires two S3 buckets:

Dataset Bucket – Stores the input dataset (prompt-response pairs) for model evaluation.
Results Bucket – Stores the evaluation results after Amazon Bedrock processes the dataset.

1.1 Creating the Dataset Bucket

Log in to the AWS Management Console.
In the search bar, type S3 and click on Amazon S3.
Click the Create Bucket button.
Enter a unique bucket name (e.g., bedrock-dataset-bucket).
Select the AWS Region where you want to store the dataset.
Leave other settings as default and click Create Bucket.

1.2 Uploading a Prompt Dataset

Open the bedrock-dataset-bucket.
Click Upload and add a file named prompt_dataset.json.
The dataset should contain example prompts like:

[
  { "prompt": "The chemical symbol for gold is", "category": "Chemistry", "referenceResponse": "Au" },
  { "prompt": "The tallest mountain in the world is", "category": "Geography", "referenceResponse": "Mount Everest" },
  { "prompt": "The author of 'Great Expectations' is", "category": "Literature", "referenceResponse": "Charles Dickens" }
]

Click Upload to store the dataset in S3.

1.3 Creating the Results Bucket

Go back to Amazon S3.
Click the Create Bucket button again.
Enter a unique bucket name (e.g., bedrock-results-bucket).
Select the same AWS Region as the dataset bucket.
Leave other settings as default and click Create Bucket.

Now, Amazon Bedrock will use:

bedrock-dataset-bucket to read the dataset.
bedrock-results-bucket to store evaluation results.

Step 2: Configuring Bucket Permissions and CORS

Amazon Bedrock needs permission to read the dataset from S3 and store results in another S3 bucket. To enable this, we must configure Cross-Origin Resource Sharing (CORS).

2.1 Understanding CORS and Why It’s Needed

Cross-Origin Resource Sharing (CORS) is a security feature in browsers and AWS services that prevents unauthorized requests from external sources. Since Amazon Bedrock is a separate AWS service, it needs explicit permission to access our S3 buckets.

Without CORS, Amazon Bedrock will be unable to read the dataset and store the results in S3, leading to permission errors.

2.2 Setting Up CORS Policy

To allow Amazon Bedrock to access the dataset and store results, follow these steps:

Open the bedrock-dataset-bucket in Amazon S3.
Go to the Permissions tab.
Scroll down to Cross-origin resource sharing (CORS) and click Edit.
Add the following CORS configuration:

[
  {
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["GET", "PUT", "POST", "DELETE"],
    "AllowedOrigins": ["*"],
    "ExposeHeaders": ["Access-Control-Allow-Origin"]
  }
]

Click Save Changes.

Step 3: Running a Model Evaluation Job in Amazon Bedrock

Now, we will set up an evaluation job in Amazon Bedrock to compare an AI model’s responses with expected answers.

3.1 Setting Up the Evaluation

Open the AWS Management Console.
In the search bar, type Bedrock and select Amazon Bedrock.
Click on Evaluations in the left menu.
Click Create Automatic: Programmatic.
Configure the evaluation job with the following details:
- Evaluation Name: bedrock-eval-job-1
- Model Provider: Amazon
- Model: Titan Text G1 - Express
- Task Type: Question and Answer
- Metrics: Accuracy
- Prompt Dataset Location:
```
  s3://bedrock-dataset-bucket/prompt_dataset.json
```
- Evaluation Results Location:
```
  s3://bedrock-results-bucket/evaluation-results/
```
- IAM Role: Select an appropriate IAM Role with S3 access.
Click Create to start the evaluation. The job will take approximately 10 minutes to complete.

Step 4: Viewing and Analyzing Results

Once the evaluation is complete, follow these steps to view the results:

Navigate to the Model Evaluation Jobs page in Amazon Bedrock.
Click on the completed job.
Open the Evaluation Results Location to access the JSONL file stored in S3.
Download and open the file.

4.1 Example Evaluation Results

Once the evaluation job is completed, Amazon Bedrock generates a JSONL file containing the results. Below is an example output from an evaluation job:

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 0
      }
    ]
  },
  "inputRecord": {
    "prompt": "The chemical symbol for gold is",
    "referenceResponse": "Au"
  },
  "modelResponses": [
    {
      "response": " “Au”.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 1
      }
    ]
  },
  "inputRecord": {
    "prompt": "The tallest mountain in the world is",
    "referenceResponse": "Mount Everest"
  },
  "modelResponses": [
    {
      "response": " Mount Everest.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Accuracy",
        "result": 0
      }
    ]
  },
  "inputRecord": {
    "prompt": "The author of 'Great Expectations' is",
    "referenceResponse": "Charles Dickens"
  },
  "modelResponses": [
    {
      "response": "Sorry - this model is unable to respond to this request.",
      "modelIdentifier": "amazon.titan-text-express-v1"
    }
  ]
}

Explanation of Evaluation Results

Each result contains three key parts:
- inputRecord: The original question (prompt) and expected answer (referenceResponse).
- modelResponses: The AI model’s generated response.
- automatedEvaluationResult: The score assessing how accurate the response is.
How Accuracy is Calculated
- The accuracy metric is based on the F1 Score, which is a balance between precision (correct predictions among all predictions) and recall (correct predictions among relevant predictions).
- The F1 score ranges from 0 to 1, where 1 indicates a perfect response and 0 indicates an incorrect response.

Analysis of the Example Results

Prompt	Expected Answer	Model Response	Accuracy Score	Observation
The chemical symbol for gold is	Au	“Au”.	0	The response included extra formatting (quotation marks), causing a mismatch.
The tallest mountain in the world is	Mount Everest	Mount Everest.	1	Correct answer, properly formatted.
The author of 'Great Expectations' is	Charles Dickens	Sorry - this model is unable to respond to this request.	0	The model failed to answer the question. This may indicate a knowledge gap.

Key Takeaways from the Evaluation

Formatting Matters: The first prompt received a score of 0 because the model enclosed the answer in quotation marks (“Au”.).
Correct Responses Get Full Score: The second prompt scored 1.0, as the response exactly matched the expected answer.
Model Limitations: The third prompt failed because the model did not have an answer. This could be due to:
- A temporary issue with the model.
- The model lacking sufficient training data on literary topics.

How to Improve Model Performance?

Refine Prompt Engineering: Modify how questions are framed to reduce false negatives (e.g., specifying that responses should not include extra punctuation).
Choose the Right AI Model: Different foundation models have varying strengths. Testing with multiple models can help find the best fit.
Use Custom Training Data: If models frequently fail to answer domain-specific questions, consider fine-tuning the dataset.

Summary

By following this guide, you have successfully:
✅ Created two Amazon S3 buckets for storing datasets and results.
✅ Configured CORS permissions for Amazon Bedrock.
✅ Set up an Amazon Bedrock model evaluation job.
✅ Analyzed evaluation results to compare AI model performance.