Genta Document Parser
/
GDP Quickstart
Genta Document Parser
/
GDP Quickstart
Genta Document Parser
/
GDP Quickstart

Quick search…

/alt

Genta Document Parser

Getting Started Using Genta Document Parser

Written by

Alifais Farrel Ramdhani

Published

Jul 26, 2024

Genta Document Parser

Getting Started Using Genta Document Parser

Written by

Alifais Farrel Ramdhani

Published

Jul 26, 2024

Genta Document Parser

Getting Started Using Genta Document Parser

Written by

Alifais Farrel Ramdhani

Published

Jul 26, 2024

GDP API allows developers to integrate Genta Document Parser into their applications that requires single or multiple type of document parsing such as PDF, Docs, Excel, PowerPoint Presentation, or HTML.

This Quickstart is designed to help you get started to use and implement GDP into your development environment.  This Quickstart will cover:

  1. How to get Genta Technology API Key

  2. How to send GDP API requests

Genta Technology API Key

Currently, the current way to get Genta Technology API Keys is by contacting us at:

Contact Us

After contacting us, please allow us up to the next 24 hours to generate and set up the API key for you.

Genta Document Parser API Request

Parse PDF File

Curl

curl -X POST "https://gdp.genta.tech/parse-pdf" \
  -H "accept: application/json" \
  -H "Authorization: Bearer GENTA_API_TOKEN"\
  -F "pdf_file=@example.pdf;type=application/pdf" \
  -F "output_type=semantic" \
  -F "caption_image=true" \
  -F "strategy=auto"

Python

url = "https://gdp.genta.tech/parse-pdf"
headers = {
    "accept": "application/json",
    "Authorization": "GENTA_API_TOKEN"
}
files = {
    "pdf_file": ("example.pdf", open("example.pdf", "rb"), "application/pdf")
}
data = {
    "output_type": "semantic",
    "caption_image": "true",
    "strategy": "auto"
}

response = requests.post(url, headers=headers, files=files, params=data)

Parse Docs File

Curl

curl -X POST "https://gdp.genta.tech/parse-docx" \
  -H "accept: application/json" \
  -H "Authorization: GENTA_API_TOKEN"\
  -F "docx_file=@example.docx;type=application/docx" \
  -F "output_type=markdown" \
  -F "caption_image=true" \
  -F "strategy=auto"

Python

url = "https://gdp.genta.tech/parse-docx"
headers = {
    "accept": "application/json",
    "Authorization": "GENTA_API_TOKEN"
}
files = {
    "docx_file": ("example.docx", open("example.docx", "rb"), "application/docx")
}
data = {
    "output_type": "markdown"
}

response = requests.post(url, headers=headers, files=files, params=data)

Parse Excel File

Curl

curl -X POST "https://gdp.genta.tech/parse-xlsx" \
  -H "accept: application/json" \
  -H "Authorization: GENTA_API_TOKEN" \
  -F "pdf_file=@example.xlsx;type=application/xlsx" \
  -F "output_type=markdown"

Python

url = "https://gdp.genta.tech/parse-xlsx
headers = {
    "accept": "application/json",
    "Authorization": "GENTA_API_TOKEN"
}
files = {
    "xlsx_file": ("example.xlsx", open("example.xlsx", "rb"), "application/xlsx")
}
data = {
    "output_type": "markdown"
}

response = requests.post(url, headers=headers, files=files, params=data)

Parse PowerPoint Presentation File

Curl

curl -X POST "https://gdp.genta.tech/parse-pptx" \
  -H "accept: application/json" \
  -H "Authorization: GENTA_API_TOKEN"\
  -F "pdf_file=@example.pptx;type=application/pptx" \
  -F "output_type=markdown"

Python

url = "https://gdp.genta.tech/parse-pptx"
headers = {
    "accept": "application/json",
    "Authorization": "GENTA_API_TOKEN"
}
files = {
    "pptx_file": ("example.pptx", open("example.pptx", "rb"), "application/pptx")
}
data = {
    "output_type": "markdown"
}

response = requests.post(url, headers=headers, files=files, params=data)

Parse HTML

Curl

curl -X POST "https://gdp.genta.tech/parse-html" \
-H "accept: application/json" \
-H "Authorization: GENTA_API_TOKEN" \
-H "Content-Type: application/json" \
-d "{
  "output_type": "markdown",
  "html": "The HTML Code"
}"

Python

url = "https://gdp.genta.tech/parse-html
headers = {
    "accept": "application/json",
    "Authorization": "GENTA_API_TOKEN"
}
data = {
    "output_type": "markdown",
    "html": "your html code"
}

response = requests.post(url, headers=headers, json=data)

Parameter Description

  1. GENTA_API_TOKEN: Your API Token from Genta Technology.

  2. pdf_file, xlsx_file, docx_file, pptx_file: The path to your PDF, Excel, Docs, or PPT file. Change the "example.filetype" to your desired path.

  3. output_type: The output format you want. Options are:

    1. regular: Provides a list of elements extracted from the PDF.

    2. markdown: Converts the regular output into a list of elements based on the number of pages and the layout (headers type, etc.).

    3. semantic: Chunks the elements based on the context and semantic meaning.

  4. caption_image: If the PDF document contains images and you want to run image-to-text captioning, set this to "true".

  5. strategy: The strategy for parsing the documents. Options are:

    1. "auto": Automatically decides how to parse the documents.

    2. "fast": Directly extracts text from the PDF file.

    3. "ocr": Uses OCR to extract text.

    4. "quality": Uses a layout detection model to identify text and images.

Output Example

{
  "parsed_document": [
    {
      "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Dui nunc mattis enim ut tellus.",
      "metadata": {
        "filename": "/content/ExampleDocs.pdf",
        "languages": ["eng", "ind"],
        "last_modified": null,
        "page_number": 1,
        "text_type": "Header"
      }
    },
    {
      "text": "Dui ut ornare lectus sit amet est placerat. ",
      "metadata": {
        "filename": "/content/ExampleDocs.pdf",
        "languages": ["eng", "ind"],
        "last_modified": null,
        "page_number": 1,
        "text_type": "NarrativeText"
      }
    },
    {
      "text": "Tempus imperdiet nulla malesuada pellentesque elit eget.",
      "metadata": {
        "filename": "/content/ExampleDocs.pdf",
        "languages": ["eng", "ind"],
        "last_modified": null,
        "page_number": 1,
        "image_base64": "encoded image",
        "text_type": "Image"
      }
    },
    {
      "text": "Condimentum lacinia quis vel eros donec ac odio tempor orci.",
      "metadata": {
        "filename": "/content/ExampleDocs.pdf",
        "languages": ["eng", "ind"],
        "last_modified": null,
        "page_number": 1,
        "text_type": "FigureCaption"
      }
    },
    {
      "text": "Turpis cursus in hac habitasse platea dictumst quisque.",
      "metadata": {
        "filename": "/content/ExampleDocs.pdf",
        "languages": ["eng", "ind"],
        "last_modified": null,
        "page_number": 1,
        "text_type": "NarrativeText"
      }
    }
  ],
  "document_length": 5,
  "time_taken": 4.076918601989746
}