About

31 May, 2023

What’s wrong with search engines ?
- Case 1
- Case 2
- Case 3
- What next ?
We need better search engines
Backstory

What’s wrong with search engines ?

Search engines suck.

Unless they are looking for a restaurant nearby or for opening hours of some shop, people are looking for ideas, concepts and methods to accomplish tasks. Search engines deal with keywords. Problem is you don’t know what you don’t know. Which means that:

you don’t know if the keyword you input in your search request is the proper name for the idea, concept or task you are actually searching for,
you have no way of evaluating whether the results of the search are reliable or not.

By asking people to translate their search into relevant keywords, search engines assume they already have some expertise on the matter they are researching. Which means newcomers and beginners have no entry point in the matrix, except other humans who will act as front-desk, whether they are paid for it or not.

Case 1

You are trying to understand what saturation means in the context of digital imagery. If you go to your favourite search engine and query “saturation”, you will find many unrelated pages. If you narrow it down to “color saturation”, then you may find more related pages. But there is still a problem…

In color sciences, “saturation” has a precise meaning and what people usually call “saturation” is closer to what the Commission Internationale de l’Éclairage (CIE) calls “chroma”, or even “colorfulness” depending on context. Your favourite search engine does not know that the three words are closely related, but only in the context of imaging techniques, and will search for “saturation” only.

Let’s pretend you have a clever search engine aware of synonyms, knowing saturation is roughly the same as chroma. Now, how do you make the distinction between authoritative sources and clueless Joes writing blogs about their hobbies ? Because the clueless Joes will tell you “saturation” is the S in HSL, since they know color only from the GUI of their picture editing software. The CIE has a different definition, which has nothing to do with HSL. And the CIE is an authority on the matter, which you can’t know if you are not already familiar with the topic.

In search engines, sources are not curated and ranking algorithms guess the relevance and authority of sources using indirect metrics. Try for yourself and just see how the “saturation” definition from the most authoritative source on the matter gets buried deep into the last pages of search results.

Bottom line, you have to be a color expert to find proper resources explaining color theory to dummies : you need to know the exact keyword to look for, and where to look at it. Otherwise, your favourite search engine will most likely point you toward the best SEO-optimized websites (which the CIE website is not). I can vouch for the fact that the authoritative sources on the matter are a nightmare to index for search engines (I had to write custom code for each of them).

Case 2

You are looking for a comparison between the color spaces standardized by the CIE over the years. You may know the oldest one, CIE 1976, and perhaps the most recent, CIE 2006. What about the others ? Are there others ?

So you may use your favourite search engine with “CIE 1976 AND CIE 2006”. The chances that you will end up with that on a systematic review of color spaces are thin.

If you wanted to look for the 3 letters “CIE” followed by any set of 4 digits, you could do it using regular expressions. In particular, the regex would be CIE [0-9]{4}. This would target any CIE color space from any year in one shot (CIE 1964, CIE 1931, etc.).

Unfortunately, online search engines don’t support regular expressions.

Case 3

You are looking how to brighten an image (or any other specific task) using a particular niche software. Of course, you can query “brighten image [name of your software]”.

The first problem you will face is your favourite search engine will most likely redirect the search toward the most popular software instead of your niche one, because it thinks the probability of you wanting the popular query is higher than your niche stuff that may or may not be a typo. Google seems to be aggressively second-guessing specific queries since 2021 or so.

Then again, looking for “how-to” is tricky… You can put a nail in a wall using a screwdriver’s handle, that’s possible. It’s probably not safe, at least for the wall and for the screwdriver, and it’s definitely not the best way. But nasty tricks like this, that barely-do-the-job, are more difficult to spot on software matters, where things are conveniently hidden/buried under a GUI.

So, again, getting the first answer out of a random forum may or may not give you the expected results. You probably want some seemingly “official” place where information is manually curated and vetted. And again, that official place may not be the top result.

What next ?

From there, you have 2 options:

Option 1: You go with what your favourite search engine fetches you. It might be inaccurate, which might be fine as long as you don’t repeat it around. But people like to talk and teach what they are passionate about, so inaccurate and misleading info spreads fast and wide.
Option 2: You have a feeling that the info presented to you is doubtful. Luckily, you know that guy who knows, and send him an email. But you are not the only one so his mailbox is currently bursting and he doesn’t answer anymore.

We need better search engines

We need a search engine :

specialized on imaging and color topics,
indexing manually-selected sources, no matter how poorly optimized for indexation they are,
aware of synonyms,
bilingual.

Chantal solves this by implementing the paper A Dual Embedding Space Model for Document Ranking, by Mitra, Nalisnick, Craswell and Caruana (2016). It relies on an artificial intelligence designed to extract language structures and patterns (Word2vec), such that it is aware of context, synonyms, common typos, and words frequently used together.

The language model is trained against 6.409.272 sentences out of 239.750 photography-related pages in 2 languages, including HTML webpages, PDF documents and 3 authoritative books on color (Hunt, Fairchild and Kirk, see sources), for a total of 95.659.646 words.

Chantal speaks Dumbrish, a synthetic language constructed by stemming French and English in a way leading both languages to produce the same words, which helps merging them. This makes it possible to have French posts and English ones about the same topic represented by similar keywords once translated into Dumbrish.

When a word is unknown to Chantal, it tries to find the closest known word in Dumbrish by checking letters permutations, deletions and additions, using Peter Norvig’s algorithm.

Internally, this AI represents words as coordinates (vectors) in a 300-dimensions space map where commonly-associated words lie close to each other. Using this property, we compute vector distances between documents and a search query and return the closest matches.

Because the language model is aware of context and synonyms, the search engine works half-ways between a conventional keyword search-engine and a similar content recommendation algorithm, which means it will be able to find relevant documents even if you don’t know the exact keywords (out of the proper technical vocabulary) to search for.

For example, querying “saturation” will also return results where “brightness” and “chroma” are found, and where “saturation” itself does not necessarily appear, just because those words are often found together in the training corpus and people often misuse one for the other, so the AI mapped them as neighbours.

And because the similar content recommendation way can sometimes get in the way, Chantal offers a direct grep-like search supporting regular expressions.

Backstory

Chantal is part of an on-going project (Virtual Secretary) meant to reduce the amount of emails and communications I need to process, by trying to remove redundant communications where I need to repeat what I already wrote somewhere on the internet, sometimes several times and in several languages. In addition of being a mind-numbing and frustrating task, the amount of it becomes more overwhelming every year.

Over the past 5 years, I have answered questions on various websites and forums that are now pretty much lost in the internet limbo, and as a result, get asked again. Truth is, forums and websites alike do a very poor job at making content indexable and queryable in the future, let alone making sourced and reliable info stand out from uneducated opinions (trying to help is still not a skill). All that work would have been a dry loss without Chantal.

You will find a handful of guys like me on Reddit and forums, repeating the same things in time-loops to people who didn’t find the previous answers, but getting more frustrated and cranky each time, which makes these internet places unwelcoming to newbies, who then flee to social platforms where people are more patient but the information is third-hand and even less reliable. Unreliable information, in the context of image processing, may not kill people, but doesn’t produce the results you expect once put into action. That’s a loss of everybody’s time and CPU cycles.

On the other hand, asking newbies and beginners to do their research before asking anything relies on the assumption that they know the exact technical terms to query, which they don’t because they are beginners. But keyword-based search engines can’t do better, so these people have no entry point in the matrix. Thus they ask for help where they find humans. But the internet being built on a “handful providing vs. masses needing” paradigm, it ends up in the handful performing repetitive tasks for the masses, usually for free, and getting backlash when they snap at people. Repetitive tasks are the realm of computers and automats, they consume humans.

The final issue is that all-purpose search engines use indirect metrics of relevance (popularity, compliance with web standards and techs, time spent on content), which don’t account for the actual reliability of the information. The Covid pandemic has shown well enough the limits of machine detection in the matter of choosing reliable sources of information. In technical matters, where technical vocabulary may also exist in common language with different meanings, all-purpose search engines have an hard time guessing what is meant beyond the actual keyword.

The present work aims at solving all that at once, with a search engine working beyond direct keywords, and using hand-picked sources only.