As explained in the article, “Find Everything With ClickOnSite Global Search”, ITD knows that our customers need to be able to quickly find specific information among the large amounts of data they store for their businesses to run efficiently.
Our CTO, François Pouilloux, has a 10-year record of integrating and developing search technologies together with some of the best university labs in Europe, and before joining ITD worked for companies offering specialized search for enterprise clients. With such a knowledgeable resource at hand, I took the opportunity to do a deeper dive with him about search.
How different is ClickOnSite global search from what people are used to on the web?
François: There are two sides to this coin:
- User input: we tried to keep it very close to typical Web search: One search box. No parameters. The user enters keywords/phrases, best results are shown first. In this way it is familiar.
- Results: the nature of the information the search has scanned and retrieved. Different from most web searches, which are primarily web pages (including documents embedded in them), ClickOnSite search indexes:
- Information on intranet pages (like web searches)
- Documents (like web searches)
- Information in entities (database)
- Information in tasks (BPMN 2.0)
This means the user gets an aggregated result mixing entities, documents and tasks.
The structure of the results list is also quite minimalist and different from what you find in web search: we do not include a summary snippet, nor highlighted keywords. Our philosophy is that in the context of ClickOnSite they would not be so useful and would just unnecessarily clutter the results. Our customers’ content is more focused and business specific, so probably title, identifier and type of document is sufficient for a user to decide if a result is worth navigating.
Why did ITD choose Elasticsearch as its search engine?
François: Elasticsearch is one of the most performant and powerful search engines out there. It is widely used, open-source, flexible, and we can modify it. It is established, very stable and used in many contexts. Because of this, it is a very rich environment.
It also has a large community behind it, which provides contributions such as code, support, ideas and innovation. We want to be part of this community and contribute where we can, too.
In short, Elasticsearch encompasses all the things that make open source successful.
When people think of open source, they usually think the primary driver is that it is “free”, that you don’t need pay a license for it. I want to emphasize that that the fact we don’t have to pay a license is not why we chose Elasticsearch. We use it for the reasons I just stated — flexibility, stability, and large community. In the case of Elasticsearch, being community-driven means it is more powerful than if it were guided by just one company.
Which advantages does Elasticsearch have for ClickOnSite users? Why is Elasticsearch good for ITD customers?
François: Nowadays everyone is used to going straight to a search box to find information. People don’t care *where* information is stored, they just want to be able to find and access it.
The kind of search most people think of is search across web pages. Elasticsearch is versatile and can also be optimized to aggregate from databases. There is a subtle difference between the two; especially how the data is put into context and related to other data.
Again, most people think of search as something they use when they want to go find some information. But with Elasticsearch monitoring the whole data ecosystem, you gain significant, if subtle advantages:
- It makes the project team members life easier to find relevant information from across many sources
- You won’t miss information because it is all in one place, all indexed and accurate in near real-time
- It improves usability of information that is stored in ClickOnSite; you can gather things in different ways and export them to analyze or share
Another advantage for our customers is at the IT level: if a company is already using Elasticsearch, or comparable search technology, for its intranet, we can imagine to federate, i.e. integrate, the results. This means the enterprise data could be brought into ClickOnSite, or the ClickOnSite data can be brought into the enterprise search. This gets rid of silos of data within the company.
So you see, search is more than just finding information, it is a method to organize and use information across the whole company.
Elasticsearch is also built with scalability in mind, and has clustering capability out of the box, making it possible to index huge amounts of information just by adding hardware and appropriate expertise in a cluster configuration. Additionally, Elasticsearch also provides even more advanced “enterprise” features with paying licenses, so in case such features are needed, it is technically easy to embed these features.
Which information in ClickOnSite gets indexed?
François: Content from the thumbnail views in ClickOnSite is the natural candidate for indexing because it already contains the most significant information in them about the associated entities. Also indexed are:
- Tasks (i.e. candidate search, TSSR, work orders)
- Text inside the task; name, reference, attachments, dates
- Whitelist of entities you want to be indexed (i.e. sites, assets, leases)
- We index entities configured in a white list to avoid indexing of irrelevant content
- All standard text documents, like Word documents and PDFs, as well as WVGs, and even within the text of the filenames
Can you explain how relevance of results is calculated/determined?
François: Tokens are extracted from query text (at query time) and indexed documents (at index time): the combination of tokens creates a kind of signature for the query and each “document”. When executing the query, the engine evaluates how well the signatures of documents match the signature of the query: very roughly, the relevance is an indication of how well the document’s signature matches the query signature.
There are various ways to compute the relevance, such as TF/IDF (term frequency/ inverse document frequency), the whole issue being finding a computation that is at the same time very fast and resilient to language/content diversity.
Since the “signature” is based on tokens, it’s easy to understand that token generation has a big impact on the final result. Usually tokens are generated from raw text by a pipeline of linguistic analyzer, some of them language agnostic and others specific to each languages.
We implemented different kinds of analyzers, including a language detection analyzer, so that we are able to support language specific analyzers for common languages.
Most of these analyzers are also open source, but need proper tuning to produce good tokens from our content.
How do you see the future of search in ClickOnSite?
François: I believe the value of a search engine goes beyond than just thinking of a keyword, entering it into a box and finding related information: indexing and search can be a way for you to summarize your data.
With that in mind, among future developments I am thinking about include:
- When you have the technology you can put in many different views.
- Watchers and monitors which notify people when information has changed or been updated
- Daily queries to see changes and do data analysis
- Visualizations of your data
- Include search results on a map