How Machine Learning Techniques Impact File Analysis
Applying machine learning (ML) and artificial intelligence (AI) techniques to analyze files within a content repository can raise the bar on operating efficiencies and produce smarter solutions that bring “structure” to unstructured data. However, not all unstructured data is created equally. The primary challenge is determining the differences in file content and context. Technologies based on text-oriented content analysis don’t work well when analyzing non-text files such as images. They’re unable to look inside the files and identify their contents. Over the years, systems that rely on upfront metadata tagging have been developed as a workaround for this issue. For instance, a digital asset management system is a good way to organize large collections of images, provided that somebody develops and maintains the relevant metadata including annotations.Various types of content need to be handled differently as some have well-defined metadata (ex. MS Office documents), while others need intensive analysis to extract rich metadata (ex. Audio, Video). Metadata can be extended beyond file system attributes to include an in-depth analysis of the content itself. Here are a couple ways this can be done: Natural Language Processing (NLP) techniques such as Sentiment Analysis determines the tone of the content (positive, negative), and Named Entity Recognition can be used to extract “business entities” such as personal names, addresses, company names, etc., and group documents in different ways to allow for faster comprehension of large datasets. These techniques have their limitations when applied to large business content sets. For example, sentiment analysis is more useful for short documents such as chat logs, and less useful for business documents, which tend to be fairly neutral in tone. Named entity recognition techniques require a large pre-classified training set. Usual public training sets are based on location/country specific news articles, that are not representative of business data sets. Search engine techniques use machine learning for pattern detection of significant text and phrases within the content. These techniques were mostly designed to work off of large public datasets, and have limitation and varying degrees of success with enterprise datasets. For example, quick access to search results is often complicated by a common requirement to layer a set of complex access control structures (using roles, groups, hierarchical permissions, etc.) on content within an enterprise.When it comes to using search engine techniques for image searches, this is only slightly different from the text search engine. A metadata image search engine rarely examines the actual image itself. Instead, it relies on actual text from the content within the files and/or text used in the description of the image. Deep Learning (DL) is part of a broader family of machine learning which enables the ability to learn from data that is unstructured or unlabeled by building learning algorithms that mimic the brain. Deep learning techniques such as computer vision identify high-level concepts within a target image and build a dictionary of terms to expand the search capability to include similar concepts. For example, if it learns to identify images of digging equipment on a construction site, it can over time expand to include similar images. This requires a very large amount of pre-classified content, which is typically based off of public image datasets. Training needs to occur on a comprehensive sample of enterprise content to extract meaningful information within images, rather than generic concepts that are less useful.Another technique is word vector analysis which identifies significant keywords or phrases within the content and their interrelationships to figure out synonyms, antonyms, and building business domain vocabularies that can be used to increase the accuracy of search results. Recommendation systems To derive meaningful results from enterprise content repositories, it’s clear we need to retrain typical ML/AI algorithms on appropriate datasets of reasonable size. Therefore, more time and effort must be spent on gathering and annotating sample datasets, rather than on tuning existing algorithms. With a rich set of metadata attributes in hand, we can leverage these to construct a variety of non-intrusive, intelligent features that can work side by side with the user to pull up appropriate content at the right time. For example, we can build recommendation systems to suggest potential collaborators and interesting content. These recommendation systems can be built using a couple of different approaches:
- Collaborative filtering that leverages groups of similar users to recommend content created/accessed by one user to other users, and
- Content-based recommendation systems that drive off of content metadata to identify content similar to the one previously accessed or currently being accessed.
Collaborative filtering requires a large grouping of users to be meaningful and content-based recommendation systems require a large set of historical access patterns to be useful. Typically these problems are mitigated by combining both approaches and switching from one to the other as appropriate. From consumer to enterpriseA variety of ML/AI techniques are available to deliver the intelligent features within content repositories that can learn and become better over time. But blindly applying approaches from consumer applications to enterprise data sets can be a frustrating experience. Be discriminating to deliver meaningful results. In the next article, we will dive into specific applications of these techniques to problems in Data Governance and Collaboration products.