GDP API allows developers to integrate Genta Document Parser into their applications that requires single or multiple type of document parsing such as PDF, Docs, Excel, PowerPoint Presentation, or HTML.
This Quickstart is designed to help you get started to use and implement GDP into your development environment. This Quickstart will cover:
How to get Genta Technology API Key
How to send GDP API requests
Genta Technology API Key
Currently, the current way to get Genta Technology API Keys is by contacting us at:
Contact Us
After contacting us, please allow us up to the next 24 hours to generate and set up the API key for you.
Genta Document Parser API Request
Parse PDF File
Curl
curl -X POST "https://gdp.genta.tech/parse-pdf" \
-H "accept: application/json" \
-H "Authorization: Bearer GENTA_API_TOKEN"\
-F "pdf_file=@example.pdf;type=application/pdf" \
-F "output_type=semantic" \
-F "caption_image=true" \
-F "strategy=auto"
Python
url = "https://gdp.genta.tech/parse-pdf"
headers = {
"accept": "application/json",
"Authorization": "GENTA_API_TOKEN"
}
files = {
"pdf_file": ("example.pdf", open("example.pdf", "rb"), "application/pdf")
}
data = {
"output_type": "semantic",
"caption_image": "true",
"strategy": "auto"
}
response = requests.post(url, headers=headers, files=files, params=data)
Parse Docs File
Curl
curl -X POST "https://gdp.genta.tech/parse-docx" \
-H "accept: application/json" \
-H "Authorization: GENTA_API_TOKEN"\
-F "docx_file=@example.docx;type=application/docx" \
-F "output_type=markdown" \
-F "caption_image=true" \
-F "strategy=auto"
Python
url = "https://gdp.genta.tech/parse-docx"
headers = {
"accept": "application/json",
"Authorization": "GENTA_API_TOKEN"
}
files = {
"docx_file": ("example.docx", open("example.docx", "rb"), "application/docx")
}
data = {
"output_type": "markdown"
}
response = requests.post(url, headers=headers, files=files, params=data)
Parse Excel File
Curl
curl -X POST "https://gdp.genta.tech/parse-xlsx" \
-H "accept: application/json" \
-H "Authorization: GENTA_API_TOKEN" \
-F "pdf_file=@example.xlsx;type=application/xlsx" \
-F "output_type=markdown"
Python
url = "https://gdp.genta.tech/parse-xlsx
headers = {
"accept": "application/json",
"Authorization": "GENTA_API_TOKEN"
}
files = {
"xlsx_file": ("example.xlsx", open("example.xlsx", "rb"), "application/xlsx")
}
data = {
"output_type": "markdown"
}
response = requests.post(url, headers=headers, files=files, params=data)
Parse PowerPoint Presentation File
Curl
curl -X POST "https://gdp.genta.tech/parse-pptx" \
-H "accept: application/json" \
-H "Authorization: GENTA_API_TOKEN"\
-F "pdf_file=@example.pptx;type=application/pptx" \
-F "output_type=markdown"
Python
url = "https://gdp.genta.tech/parse-pptx"
headers = {
"accept": "application/json",
"Authorization": "GENTA_API_TOKEN"
}
files = {
"pptx_file": ("example.pptx", open("example.pptx", "rb"), "application/pptx")
}
data = {
"output_type": "markdown"
}
response = requests.post(url, headers=headers, files=files, params=data)
Parse HTML
Curl
curl -X POST "https://gdp.genta.tech/parse-html" \
-H "accept: application/json" \
-H "Authorization: GENTA_API_TOKEN" \
-H "Content-Type: application/json" \
-d "{
"output_type": "markdown",
"html": "The HTML Code"
}"
Python
url = "https://gdp.genta.tech/parse-html
headers = {
"accept": "application/json",
"Authorization": "GENTA_API_TOKEN"
}
data = {
"output_type": "markdown",
"html": "your html code"
}
response = requests.post(url, headers=headers, json=data)
Parameter Description
GENTA_API_TOKEN
: Your API Token from Genta Technology.
pdf_file
, xlsx_file
, docx_file
, pptx_file
: The path to your PDF, Excel, Docs, or PPT file. Change the "example.filetype"
to your desired path.
output_type
: The output format you want. Options are:
regular
: Provides a list of elements extracted from the PDF.
markdown
: Converts the regular output into a list of elements based on the number of pages and the layout (headers type, etc.).
semantic
: Chunks the elements based on the context and semantic meaning.
caption_image
: If the PDF document contains images and you want to run image-to-text captioning, set this to "true"
.
strategy
: The strategy for parsing the documents. Options are:
"auto"
: Automatically decides how to parse the documents.
"fast"
: Directly extracts text from the PDF file.
"ocr"
: Uses OCR to extract text.
"quality"
: Uses a layout detection model to identify text and images.
Output Example
{
"parsed_document": [
{
"text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Dui nunc mattis enim ut tellus.",
"metadata": {
"filename": "/content/ExampleDocs.pdf",
"languages": ["eng", "ind"],
"last_modified": null,
"page_number": 1,
"text_type": "Header"
}
},
{
"text": "Dui ut ornare lectus sit amet est placerat. ",
"metadata": {
"filename": "/content/ExampleDocs.pdf",
"languages": ["eng", "ind"],
"last_modified": null,
"page_number": 1,
"text_type": "NarrativeText"
}
},
{
"text": "Tempus imperdiet nulla malesuada pellentesque elit eget.",
"metadata": {
"filename": "/content/ExampleDocs.pdf",
"languages": ["eng", "ind"],
"last_modified": null,
"page_number": 1,
"image_base64": "encoded image",
"text_type": "Image"
}
},
{
"text": "Condimentum lacinia quis vel eros donec ac odio tempor orci.",
"metadata": {
"filename": "/content/ExampleDocs.pdf",
"languages": ["eng", "ind"],
"last_modified": null,
"page_number": 1,
"text_type": "FigureCaption"
}
},
{
"text": "Turpis cursus in hac habitasse platea dictumst quisque.",
"metadata": {
"filename": "/content/ExampleDocs.pdf",
"languages": ["eng", "ind"],
"last_modified": null,
"page_number": 1,
"text_type": "NarrativeText"
}
}
],
"document_length": 5,
"time_taken": 4.076918601989746
}