IBM Service Employs Machine Learning Algorithms to Convert PDF Documents

    IBM next week at a KDD 2018 conference in London will preview a forthcoming cloud service that makes use of machine learning algorithms to ingest PDF documents in a way that makes it possible to use the data in those documents to train artificial intelligence (AI) models.

    Costas Bekas, a distinguished research staff member and manager of Foundations of Cognitive Solutions within IBM Research, says that while optical character recognition (OCR) technologies have been available to digitize documents for decades, the IBM Corpus Conversion Service employs machine learning algorithms to make it possible to digitize 100,000 PDF documents a day using a single blade server.

    More importantly, data within those documents can be parsed in a way that makes it possible to query that data directly or via an application programming interface (API) that IBM has crafted for the service, says Bekas.

    “The data ingested is consumable by other services,” says Bekas.

    Scheduled to be generally available on the IBM Cloud later this year, Bekas says the IBM Corpus Conversion Service is designed to enable data scientists to overcome the single biggest challenge with creating AI models. Today, it can take months for data scientists to acquire a corpus of data required to train an AI model.

    Rather than relying on inflexible rules to identify data, the IBM Corpus Conversion Service is designed to make it possible to ingest data in a way that identifies segments of the documents such as the abstract regardless of where it appears in the document or which font size is employed, says Bekas.

    That capability means that rather than having to rely on a team of data scientists to annotate data, organizations will be able to employ office workers that may only have a high school diploma to ingest data at the push of a button, says Bekas.ibmcorpusconversion

    The rate at which AI models can be trained is being hampered by a lack of access to content that is structured in a way that those AI models can comprehend. By digitizing PDF documents, IBM expects that the number of data sources that can be used to train AI models will increase exponentially. That in turn promises to significantly increase not just the rate at which AI models can be developed and trained, but also the number of organizations that can practically afford to create those AI models.

    Mike Vizard
    Mike Vizard
    Michael Vizard is a seasoned IT journalist, with nearly 30 years of experience writing and editing about enterprise IT issues. He is a contributor to publications including Programmableweb, IT Business Edge, CIOinsight and UBM Tech. He formerly was editorial director for Ziff-Davis Enterprise, where he launched the company’s custom content division, and has also served as editor in chief for CRN and InfoWorld. He also has held editorial positions at PC Week, Computerworld and Digital Review.

    Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles