IBM Service Employs Machine Learning Algorithms to Convert PDF Documents


IBM next week at a KDD 2018 conference in London will preview a forthcoming cloud service that makes use of machine learning algorithms to ingest PDF documents in a way that makes it possible to use the data in those documents to train artificial intelligence (AI) models.

Costas Bekas, a distinguished research staff member and manager of Foundations of Cognitive Solutions within IBM Research, says that while optical character recognition (OCR) technologies have been available to digitize documents for decades, the IBM Corpus Conversion Service employs machine learning algorithms to make it possible to digitize 100,000 PDF documents a day using a single blade server.

More importantly, data within those documents can be parsed in a way that makes it possible to query that data directly or via an application programming interface (API) that IBM has crafted for the service, says Bekas.

“The data ingested is consumable by other services,” says Bekas.

Scheduled to be generally available on the IBM Cloud later this year, Bekas says the IBM Corpus Conversion Service is designed to enable data scientists to overcome the single biggest challenge with creating AI models. Today, it can take months for data scientists to acquire a corpus of data required to train an AI model.

Rather than relying on inflexible rules to identify data, the IBM Corpus Conversion Service is designed to make it possible to ingest data in a way that identifies segments of the documents such as the abstract regardless of where it appears in the document or which font size is employed, says Bekas.

That capability means that rather than having to rely on a team of data scientists to annotate data, organizations will be able to employ office workers that may only have a high school diploma to ingest data at the push of a button, says Bekas.ibmcorpusconversion

The rate at which AI models can be trained is being hampered by a lack of access to content that is structured in a way that those AI models can comprehend. By digitizing PDF documents, IBM expects that the number of data sources that can be used to train AI models will increase exponentially. That in turn promises to significantly increase not just the rate at which AI models can be developed and trained, but also the number of organizations that can practically afford to create those AI models.