SHARE
Facebook X Pinterest WhatsApp

IBM Service Employs Machine Learning Algorithms to Convert PDF Documents

IBM next week at a KDD 2018 conference in London will preview a forthcoming cloud service that makes use of machine learning algorithms to ingest PDF documents in a way that makes it possible to use the data in those documents to train artificial intelligence (AI) models. Costas Bekas, a distinguished research staff member and […]

Written By
MV
Mike Vizard
Aug 15, 2018

IBM next week at a KDD 2018 conference in London will preview a forthcoming cloud service that makes use of machine learning algorithms to ingest PDF documents in a way that makes it possible to use the data in those documents to train artificial intelligence (AI) models.

Costas Bekas, a distinguished research staff member and manager of Foundations of Cognitive Solutions within IBM Research, says that while optical character recognition (OCR) technologies have been available to digitize documents for decades, the IBM Corpus Conversion Service employs machine learning algorithms to make it possible to digitize 100,000 PDF documents a day using a single blade server.

More importantly, data within those documents can be parsed in a way that makes it possible to query that data directly or via an application programming interface (API) that IBM has crafted for the service, says Bekas.

“The data ingested is consumable by other services,” says Bekas.

Scheduled to be generally available on the IBM Cloud later this year, Bekas says the IBM Corpus Conversion Service is designed to enable data scientists to overcome the single biggest challenge with creating AI models. Today, it can take months for data scientists to acquire a corpus of data required to train an AI model.

Rather than relying on inflexible rules to identify data, the IBM Corpus Conversion Service is designed to make it possible to ingest data in a way that identifies segments of the documents such as the abstract regardless of where it appears in the document or which font size is employed, says Bekas.

That capability means that rather than having to rely on a team of data scientists to annotate data, organizations will be able to employ office workers that may only have a high school diploma to ingest data at the push of a button, says Bekas.ibmcorpusconversion

The rate at which AI models can be trained is being hampered by a lack of access to content that is structured in a way that those AI models can comprehend. By digitizing PDF documents, IBM expects that the number of data sources that can be used to train AI models will increase exponentially. That in turn promises to significantly increase not just the rate at which AI models can be developed and trained, but also the number of organizations that can practically afford to create those AI models.

MV

Michael Vizard is a seasoned IT journalist, with nearly 30 years of experience writing and editing about enterprise IT issues. He is a contributor to publications including Programmableweb, IT Business Edge, CIOinsight and UBM Tech. He formerly was editorial director for Ziff-Davis Enterprise, where he launched the company’s custom content division, and has also served as editor in chief for CRN and InfoWorld. He also has held editorial positions at PC Week, Computerworld and Digital Review.

Recommended for you...

Data Lake Strategy Options: From Self-Service to Full-Service
Chad Kime
Aug 8, 2022
What’s New With Google Vertex AI?
Kashyap Vyas
Jul 26, 2022
Data Lake vs. Data Warehouse: What’s the Difference?
Aminu Abdullahi
Jul 25, 2022
IT Business Edge Logo

The go-to resource for IT professionals from all corners of the tech world looking for cutting edge technology solutions that solve their unique business challenges. We aim to help these professionals grow their knowledge base and authority in their field with the top news and trends in the technology space.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.