Getting Started

​ ​Welcome

Blitline Job

Job Response

Job Options

Functions

Polling and Postbacks

Gotchas

Development Recommendations

Service Limits

Examples List

List of Available Functions

Output Options

S3 Destination

IAM Users

S3 Signed Url

Azure Destination

FTP Destination

Other Service Signed Urls

CDN

Advanced

Metadata

Colorspace

Color Extraction

Static IPs

Signed Jobs

Formats

Pipelines

Image Optimization

Smart Image

PDF

Fonts

Subimage

Special (Non-Image) Processing

AWS Rekognition/Facial Recognition

Apache Tika

Vector Processing

Animated GIFs

Screenshots of Websites

IM Scripts

Video Keyframes

Zipping

​ ​Trancoding Video Presets

​ ​Video Transcoding

Apache Tika

Updated a month ago ​by Blitline Support

APACHE TIKA

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Blitline supports information retrieval from documents such as PDF, and XLS. Not only can Blitline rasterize documents into an image, you can now retrive the data stored within those documents with Blitline. This allows you to retrive the text of various documents (like PDF, Word or EPUB) along with the thumbnails.

A common use-case for this word be to get the text from PDF documents while thumbnailing them and then push that text and metadata into an Elasticsearch system for indexing.

HOW TO USE IT:

Just add get_tika : true option to your root JSON.


{
    "application_id":"YOUR_APP_ID",
    "src":"https://s3.amazonaws.com/blitdoc/docx/Contoso.xlsx",
    "get_tika" : "true",
    "v" : 1.22,
    "functions":[
        {
            "name":"crop",
            "params":{
                "gravity": "NorthGravity",
                "width":100
            },
            "save":{
                "image_identifier":"MY_CLIENT_ID"
              }
        }
    ]
}

See an example here:

Example: Tika Example

How did we do?