In-Depth

The Security Risks of Enterprise Search

We expose five security risks introduced by enterprise search and explain how you can protect your enterprise from these vulnerabilities.

by Sid Probstein

Balancing the demand for security with access to the enormous volume of information in an enterprise presents great challenges for IT. Many enterprises want to empower their employees with fast, simple access to information and analysis. At the same time, security and compliance are a growing concern. A major issue is the dynamic nature of business. Every day, content is added and people change jobs or move between projects. In major reorganizations or mergers and acquisitions, massive changes may occur in a single day. Many enterprise applications, especially those powered by relational database systems (RDBMSs) provide administrative tools for managing access to information based on users’ permissions, roles, and associations (groups). The standardized, guarded environment of the relational database allows for this level of control.

Not all enterprise information can be generated and stored in a database. Some applications handle unstructured data (such as text and documents) that must use a different technology -- the search index. Designed and optimized for querying information of many types and formats, enterprise search software provides rapid answers to a huge range of queries.

This performance, however, comes at a price: most search indexes update slowly and inflexibly due to their document-centric nature. For example, updating a single field in a document may require reprocessing of the entire document. In many popular applications (such as Web search and public information portals), this is not an issue. Inside the enterprise, however, and especially in environments where multiple silos are brought together into a single index, this update challenge becomes problematic.

Here are five potential security risks -- some are well known, some are not -- that can be associated with Enterprise search.

Security Risk #1: Bypassing Security Structures via Connectors and Plug-ins

Every organization uses connectors and plug-ins in some capacity. They make it easy to extract data from multiple and often siloed sources by plugging into disparate systems. However, connectors can cause a significant security risk because even though users have levels of clearance, once the data is extracted to the connector, it is no longer subject to the security provisions and thus can be compromised. For example, an enterprise leverages a connector to access information in a content management system (CMS). When a privileged user pulls content, it is then indexed in the search engine and its connector. The user has essentially gone around the security within the CMS and made that content available, regardless of what the security or permissioning models used by the CMS.

In such a case, the IT department did not consider provisioning the connector and search engine with the same security levels as the CMS, exposing the enterprise to data leaks. It is vital to treat all connectors and plug-ins as an extension to the server, systems, and hardware; the connectors must have security checks and balances in place before they are used.

Security Risk #2: Data Leakage through Linguistic Analysis

Search engines often provide useful “meta information” about a search. For example, a search engine may identify entities related to the search or suggest alternative search terms or spelling. With search engines that use a “late binding” model -- wherein searches are unsecured and then each document is checked against a security authority to allow relay to the user -- meta information may not be filtered correctly.

Another potential leak can result from the use of latent semantic indexing (LSI) or other purely statistical approaches to document analysis that involve determining what words are heavily correlated and using those correlations to index and conceptually understand the relationships between entities. These approaches are suitable on a public site where commerce is the goal, but when they are used in an internal system with a great deal of sensitive data, they pose security concerns.

For example, if the term ‘mergers and acquisitions’ is searched, the system may bring back content recommendations that include confidential information (such as a competitor’s name) because others are conducting similar research or have created documentation relevant to the search term. Although the actual document that identifies a merger and acquisition strategy with the competitor would not be fully exposed, the user could certainly make “assumptions” based on the search patterns culled from the linguistics analysis. Although this is a subtle type of breach, given the sensitivity of such matters, avoiding any suspicions and “talk” is the best approach.

Security Risk #3: Security Leaks Due to Latency -- Synching Search with Security

It might be assumed that the concern here is with the security system’s ability to keep up with the search engine, but in reality the search engine lags behind the security system. This can cause problems. Most of the security systems available are typically implemented in a relational database, which is updated quickly.

That said, there are often many directories within an organization with hundreds (if not thousands) of files; although security updates are made to the indexes as a whole, getting down to the item level can take time -- sometimes hours. During that time, it is easy to conduct a search and index the results, thereby releasing that information to an unauthorized party. Even if that person has no intent of misusing the data and is unaware of the security clearance changes, the data has still been leaked. Additionally, it is not uncommon for search engines to pull from previously cached data or to post snippets from the cache that are no longer freely accessible.

This could become a major security issue. For example, if a company has terminated a division and the permissions to access files have been removed from this group, access to these files may be instantaneously removed, but the search engine might not know about the action for hours, during which time these individuals still have access to the search engine queues and can open past copies of documents or find information to which they should no longer have access. Many search engines implement the “late bound” security model wherein they check against a live security authority to verify that the user who ran the query is authorized to see a particular document in the results list; see Risk #2 above for details of why this can create leaks, especially around meta information.

Security Risk #4: Leaks Due to Hybrid Systems and Item-level Security Clearance

In enterprises where IT is forced to work with limited resources, it is common to “forget” about the search engine when focusing on larger security matters. Many organizations assume it will suffice to have security controls on the individual documents and multiple databases being searched. Therefore, if a search is conducted where 200 items are found to be relevant, if a user does not have access, he or she will be blocked when the secured links are clicked.

Although that model seems appealing, cost and time constraints introduce potential problems. First, if you can get to the search engine without having to talk to the database, then (in theory) you can search across any domain. Additionally, if users can “see” that there is more information locked in the system, they may be compelled to create workarounds to get to that information -- such as using a combination of search terms to create a denial of service attack. In addition to the security issues behind this type of search and retrieval, end users who are forced to click irrelevant documents or are blocked from information that may be pertinent to their needs can grow frustrated, causing them to leave the query altogether.

Security Risk #5: Hacking the Search Engine

HTTP is an unsecure protocol that can easily be hacked by using simple downloads or snooping software that can be found on the Internet. IT must treat the search engine as any other piece of technology that can conduct an exchange of sensitive, confidential information. The search engine needs to be tested and verified regularly. If an enterprise enables searches of sensitive information, even with provisions and levels of clearance, it must understand how to protect its intellectual property.

Protecting Your Enterprise

A large part of the solution is treating the search engine like any type of IT hardware that contains potentially confidential information. Actions include:

  • Follow the industry: Standard best practices around security should be adopted, including security recommendations from the server manufacturer

  • Be deliberate: Include all connectors and plug-ins in your security evaluations
  • Conduct audits: Regularly log requests, changes, etc.; search for patterns or inconsistencies
  • Perform risk analysis: Understand the risks; Determine if you want to make confidential information accessible
  • Select the right vendors: Choose vendors that will work with you, that want to mitigate the risks, and will work to understand potential pitfalls

When evaluating your search engine strategy, take the time to review all your options. Although many companies are aware of the issues outlined here, some of the more subtle threats are those that can cause long-term problems. There are new technology solutions to tackle your security concerns. For example, unified information access provides the security of a relational database with the flexibility and performance of full-text search. It also takes much of the guesswork out of search because it brings together structured and unstructured data as well as the ability to analyze that data for better overall results.

Sid Probstein is the chief technology officer at Attivio where he is responsible for technology strategy and innovation. Sid has more than 15 years of experience leading successful engineering organizations and building complex, high-performance systems. Previously, he was CTO at GCi, where he headed development of the company’s next-generation commerce platform. He also served as vice president of technology at Fast Search & Transfer, where he developed and applied next-generation search, text mining, and multimedia capabilities. You can contact the author at sid@attivio.com.

Must Read Articles