Skip to content

[Draft] Add Web API for MarkItDown#202

Open
vs4vijay wants to merge 6 commits into
microsoft:mainfrom
vs4vijay:add-web-api
Open

[Draft] Add Web API for MarkItDown#202
vs4vijay wants to merge 6 commits into
microsoft:mainfrom
vs4vijay:add-web-api

Conversation

@vs4vijay

Copy link
Copy Markdown

Related to #133

@vs4vijay

Copy link
Copy Markdown
Author

In Testing

@vs4vijay vs4vijay changed the title Add Web API for MarkItDown [Draft] Add Web API for MarkItDown Dec 22, 2024
@xingxingcan

Copy link
Copy Markdown

Thank you very much, we need api interface access in order to utilize automated processes, we are very optimistic about this project, thanks again to the great developers.

@homanp

homanp commented Jan 3, 2025

Copy link
Copy Markdown

I deployed a simple REST API, ya'll can try it out. Here are the specs:

https://x.com/pelaseyed/status/1872326539208986903

@GerardSmit GerardSmit left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run the API locally through Docker. I had to change the following things to make it work.

Comment thread Dockerfile
&& rm -rf /var/lib/apt/lists/*

RUN pip install markitdown
RUN pip install markitdown fastapi uvicorn

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN pip install markitdown fastapi uvicorn
RUN pip install markitdown fastapi uvicorn python-multipart

Without python-multipart the API won't start.

It's still weird that markitdown is being installed here. This means if the source gets changed and you want to check if the Docker image works you first need to publish the package to pip... It would be better if the current src folder would be included here.

@markthepixel markthepixel Jan 14, 2025

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pip install "fastapi[standard]"

When you install with pip install "fastapi[standard]" it comes with some default optional standard dependencies.

If you don't want to have those optional dependencies, you can instead install pip install fastapi.

Comment thread Dockerfile
USER $USERID:$GROUPID

ENTRYPOINT [ "markitdown" ]
ENTRYPOINT ["uvicorn", "src.markitdown.api:app", "--host", "0.0.0.0", "--port", "8000"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work: src/markitdown/api.py doesn't exists, so the API will not start.

You must copy the src directory:

Suggested change
ENTRYPOINT ["uvicorn", "src.markitdown.api:app", "--host", "0.0.0.0", "--port", "8000"]
COPY src /src
ENTRYPOINT ["uvicorn", "src.markitdown.api:app", "--host", "0.0.0.0", "--port", "8000"]

Note that in .dockerignore that everything is ignored, so the COPY won't work.
Modify the .dockerignore so the src/-folder gets included:

*
!/src

@ranma42

ranma42 commented Jan 28, 2025

Copy link
Copy Markdown

I tried out this branch and found a couple of issues; locally I fixed them and rebased the branch on top of the current main.
You can find my current state here https://github.com/ranma42/markitdown/tree/add-web-api

Comment thread src/markitdown/api.py
temp_file.write(contents)

markitdown = MarkItDown()
result = markitdown.convert(temp_file_path)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this is a long CPU-bound operation (for example when converting a long PDF) this can block the executor

Comment thread src/markitdown/api.py

try:
contents = await file.read()
temp_file_path = f"/tmp/{file.filename}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if multiple requests are accepted concurrently with the same filename, they can conflict
(nb: AFAICT this currently does not occur because requests are handled sequentially)

@Ahoo-Wang

Copy link
Copy Markdown

https://github.com/Ahoo-Wang/markitdown/tree/dev/packages/markitdown-api

https://github.com/Ahoo-Wang/markitdown/blob/dev/packages/markitdown-api/delpoy/markitdown-api.yml

Docker Image:

  • ahoowang/markitdown-api:0.1.2-0.0.12
  • ghcr.io/ahoo-wang/markitdown-api:0.1.2-0.0.12
  • registry.cn-shanghai.aliyuncs.com/ahoo/markitdown-api:0.1.2-0.0.12

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants