Easily build document processing pipelines on AWS with Lakechain

8 min readJun 28, 2024

With the advent of Generative AI, it becomes more and more important to be able to capitalize on the many assets companies are storing in their data lake. Be it text documents, images, videos or even audio, you might want to leverage these data for your business. But extracting the essence of these assets often requires implementing complex ETL jobs, using computer vision or natural language processing tools, and even machine learning models sometimes. Project Lakechain, a framework based on AWS Cloud Development Kit (CDK), aims to address these challenges.

In this blog post, I’ll explain how you can benefit from Lakechain to easily build and deploy document processing pipelines on AWS.

Disclaimer: If you usually read the content I publish (mainly on Serverless and Infra as Code), you might wonder why I speak about Generative AI and such a project. Two reasons for this:

This project is actually about Infra as Code as it is fully based on CDK and provides a large set of constructs you can use in your stacks.
Just like Serverless helps focusing on delivering value without worrying about the undifferentiated heavy lifting, this project is quite impressive of simplicity. Thanks to its well-designed constructs, you can process your documents in minutes rather than days. This project also tries to leverage Serverless services as much as possible, to deploy scalable and cost-effective solutions.

Project Lakechain

Lakechain is an open source framework based on AWS Cloud Development Kit (CDK) to build and deploy document processing pipelines on AWS. It emphasises on modularity and extensibility of pipelines, and provides 60+ ready-to-use constructs.

These constructs are called “middlewares” and you can compose or “chain” them to build your pipelines. As an example, you can transcribe an audio document into text, and then translate it, and then summarize it, and then convert this summary to audio again, or generate vector embeddings of it, or both…

Lakechain comes with a very detailed documentation of all these middlewares and with 40+ examples to quickly start your pipeline.

Let’s build something!

Pre-requisites

You need to setup the following components in your environment to start using Lakechain:

AWS CLI ☁️ v2
Docker 🐋
Node.js 📦 18+
Python 🐍3.9+
TypeScript 5.0+
AWS CDK v2

Use case

A very common use case we see more and more with our customers is the RAG Pipeline (RAG for Retrieval Augmented Generation). RAG is a technique that permits to extend the knowledge of foundation models with your own, specialized information. This gives LLMs the ability to answer questions or generate content related to your specific context.

To do this, you first need to ingest the content you want to be use by the model, let’s say some PDF documents stored in S3. You have different steps to achieve this and for your model to be able to leverage that data:

Extract the text from PDFs,
Split the text into smaller chunks,
Generate and store vector embeddings.

Implementation

You first need to create a CDK project with Typescript: mkdir lakechain-rag && cd lakechain-rag && cdk init app --language=typescript.

You then need to add Lakechain dependencies to your package.json. Lakechain is very modular and you can import only the required middlewares to your project. All constructs are available on npmjs. For this use case, we’ll need to add the following ones (v0.7 at the time of writing, please update accordingly) :

"dependencies": {
    "@project-lakechain/bedrock-embedding-processors": "~0.7.0",
    "@project-lakechain/bedrock-text-processors": "~0.7.0",
    "@project-lakechain/opensearch-domain": "~0.7.0",
    "@project-lakechain/opensearch-vector-storage-connector": "~0.7.0",
    "@project-lakechain/pdf-text-converter": "~0.7.0",
    "@project-lakechain/recursive-character-text-splitter": "~0.7.0",
    "@project-lakechain/s3-event-trigger": "~0.7.0"
  },

Once you have imported the dependencies (npm install), you can start using Lakechain in your stack.ts. You might also need to add "moduleResolution": "nodenext" (or node16) to your tsconfig.json.

Now let’s look at the code:

First thing to notice is that with just about 100 lines of code, we can deploy a complete RAG pipeline, able to ingest your PDF documents from S3 and store them as embeddings in a vector store (Opensearch). This is clearly the key benefit of using Lakechain: it simplifies a lot! Each middleware is under the hood composed of CDK code obviously (lvl 3 constructs) but also Lambda functions and whatever is required to implement the need.

Diving deeper in the code, we can see that each middleware is linked to the previous one(s) with the withSource or withSources functions, so we can build our “chain”. It’s important to understand that not all middlewares can be chained together. Each one as a set of allowed input types (e.g. text/plain, image/png , application/json, …) and each one produces a specific set of outputs. For example, the translation middleware can only process text documents and generate text while the recognition middleware can only process images. More on this here.

Deployment

You can deploy the above example using the standard cdk deploy command. Once deployed you can see the different middlewares in CloudFormation. Let’s take a closer look at one of them:

The PdfConverter, and all the other middlewares, are based on the same structure:

There is obviously some compute to perform the action. Here this is a Lambda function that leverages pypdf and pdfminer python libraries to extract the text from the PDF. By the way, as it’s open source, you can find the code of this function here.
But the most interesting part is the Queue & Topic we can see here. This is actually the heart of Lakechain allowing to create those pipelines. The queue serves as the input of the middleware and the topic as the output. And the integration between SQS and SNS enables this chaining:

Input Queue & Output Topic for a Middleware

Integration between Output Topic and Input Queue of Middlewares

Execution

Before uploading a document to the input S3 Bucket, let’s open the CloudWatch Log groups of the different middlewares to follow the pipeline execution. In CloudWatch, select the log groups related to your middlewares and start tailing:

Then upload a document to S3, and observe the logs:

Let’s review some of them. First the S3 Trigger:

{
    "cold_start": false,
    "function_arn": "arn:aws:lambda:us-east-1:123456789012:function:LakechainRagStack-TriggerCompute3D5783C1-ABCDEFG1234",
    "function_memory_size": 128,
    "function_name": "LakechainRagStack-TriggerCompute3D5783C1-ABCDEFG1234",
    "function_request_id": "6d136455-744d-5196-a62f-3aa2f25b50df",
    "level": "INFO",
    "service": "s3-event-trigger",
    "timestamp": "2024-06-27T15:23:03.315Z",
    "xray_trace_id": "1-667d83d6-55b3156d6addfe8a4145fac7",
    "props": {
        "specversion": "1.0",
        "id": "8a7392ee-88ce-44c7-8d6f-6d46ffdf7c4d",
        "type": "document-created",
        "time": "2024-06-27T15:23:03.315Z",
        "data": {
            "chainId": "0e9b13e4-620b-4d86-9a08-e84bef58d245",
            "source": {
                "url": "s3://lakechainragstack-bucket83908e77-123456abcdef/serverless-stream-processing.pdf",
                "type": "application/pdf",
                "size": 3035957,
                "etag": "1fcfd4acc03284ee537588ad5c590e10"
            },
            "document": {
                "url": "s3://lakechainragstack-bucket83908e77-123456abcdef/serverless-stream-processing.pdf",
                "type": "application/pdf",
                "size": 3035957,
                "etag": "1fcfd4acc03284ee537588ad5c590e10"
            },
            "metadata": {},
            "callStack": []
        }
    }
}

The interesting part here is the "props" field. It contains all the information needed for the middlewares to do their job. It is called “Cloud Event” and is based on the CloudEvents specification. And within this event, we have the very substance in the "data" field. It contains:

the source: the initial document that triggered the pipeline,
the document: the current document being processed,
eventually some metadata
the callStack that contains the list of middlewares that have been executed in the pipeline so far (here it’s empty as it’s the first step).

Here, the document is the same as the source and metadata is empty, but let’s look at the next middleware, the PdfConverter:

{
    "level": "INFO",
    "location": "publish_event:42",
    "message": {
        "specversion": "1.0",
        "id": "8a7392ee-88ce-44c7-8d6f-6d46ffdf7c4d",
        "type": "document-created",
        "time": "2024-06-27T15:23:03.315Z",
        "data": {
            "chainId": "0e9b13e4-620b-4d86-9a08-e84bef58d245",
            "source": {
                "url": "s3://lakechainragstack-bucket83908e77-123456abcdef/serverless-stream-processing.pdf",
                "type": "application/pdf",
                "size": 3035957,
                "etag": "1fcfd4acc03284ee537588ad5c590e10"
            },
            "document": {
                "url": "s3://lakechainragstack-pdfconverterstorageafb3b9c1-zyxwv98765/1fcfd4acc03284ee537588ad5c590e10.txt",
                "type": "text/plain",
                "size": 85286,
                "etag": "d79264f22b9f2ec832de6644d3c5c051"
            },
            "metadata": {
                "properties": {
                    "kind": "text",
                    "attrs": {
                        "pages": 32
                    }
                },
                "createdAt": "2023-08-30T17:50:47",
                "updatedAt": "2023-08-30T17:50:47"
            },
            "callStack": [
                "pdf-text-converter",
                "s3-event-trigger"
            ]
        }
    },
...
}

We can see that the document changed to text/plain and is actually a text file in another bucket. This is the output of the PDF to text conversion. We can also observe some metadata, giving some information about the PDF like the number of pages. Each middleware provides its set of metadata. Finally, we notice the callStack contains the two middlewares executed so far. I’ll let you have a look at the TextSplitter and the BedrockEmbeddingsProcessor.

In the end, do not forget to delete the stack with cdk destroy. Also note that a more complete version of this example is available on the Lakechain repo: here.

Going (even?) further

I already gave a bunch of details on how Lakechain is working under the hood. I personally find it very well designed. You can get much more details in the documentation.

You might also think that Lakechain is quite opinionated and framed. In a sense it is, and this is why it is so easy to use. It also comes with best practices in terms of security, observability, and more globally respects the Well-Architected Framework.

But I didn’t show all the capabilities. Each middleware provides a set of properties that you can configure, you can define filters, change the compute type for ECS task, the memory allocated to Lambda functions and much more (see the API). And if you need even more, have a look at funclets or create a pull request.

Conclusion

In this blog, we saw how Lakechain can really simplify the creation of document processing pipelines. Thanks to its flexible design and its many available middlewares (60+), you can rapidly deploy a pipeline on AWS to process your documents (text, audio, image, video), transform them and extract meaningful information. With more than 40 examples, a very rich documentation and a quickstart, you can literally build and deploy quite complex pipelines in minutes rather than days.