Dumbrish stands for Dumb French-English, because dumb is all AI understands. The suggested keywords are written in Dumbrish, so don’t be surprised if they read weird.
It is a totally made-up language for the purpose of machine learning, and it is designed to be programmatically translatable from natural English and natural French (using regex). It aims at merging French and English languages by reducing words to similar roots (or stems) for both languages. The intuition behind it stems from the fact that many English and French words share the same root (labeur/labor, profil/profile, décider/decide) or even exist as-is in both languages (action, classification, film, catalogue, important).
graph TD;
subgraph "Natural language"
A_3[acte];
A_1[actionné];
A_2[activity];
A_4[acteur];
A_5[acting];
A_6[activated];
end
A_3----->E_1;
A_1-->B_1(actionne);
B_1-->C_1(actione);
C_1-->D_1(action);
D_1-->E_1;
A_2---->B_2(activ);
B_2-->E_1;
A_4----->E_1;
A_5----->E_1;
A_6-->B_6(activat);
B_6-->C_6(activ);
C_6-->E_1
subgraph "Dumbrish"
E_1[act];
end
The Dumbrish root can be used as a cross-language token representing the idea/concept, acting as a basic automatic translation. There are caveats to this though. “Marketing” and “Markets” don’t reference the same concept, for example, even though they share the same stem. Cutting suffixes loses some meaning. For this reason, cutting all stems may result in too much accuracy lost or conflicts, like removing -al would turn “analytical” into “analytic” but “appeal” into “ape”. So “actionnable” will not be stemmed as “act” in Dumbrish.
But since the Dumbrish is only designed as a first processing step to help an unsupervised machine learning word embedding, these discrepancies don’t matter as much as you would think at the end. Word embedding extracts word patterns from documents, which means it looks for words in-between their context. If we fed it enough common-language documents, it could actually infer by itself that “action”, “actor” and “actrice” are closely related. Problem is we process technical documents written with jargon, and we don’t have millions of documents. This “manual” prefiltering gives the neural network some generalization hints that create anchor points between languages.
The rules of Dumbrish are simple:
- Consonants can’t be doubled. For example, “professional” in English translates “professionnel” in French, so irregular consonants are a problem, therefore we force single consonants everywhere in Dumbrish.
- Accented characters are removed, excepted “à” (which has a grammatical function in French). Since people often misspell them (and some don’t use them at all), and they only change change the phonetic sound (except for “à”), so there is no point in keeping them.
- Dumbrish uses the American spelling for -our words (“behaviour”, “colour”), British spelling is transformed to American spelling (“behavior”, “color”) to match Latin words (“color”) and be consistent with French (“coloriste”, “coloriser”).
- Plural and feminine forms in -s, -e and -ses are removed. “Lens” and “Lense” become “len”, “les” becomes “le”.
- The following word suffixes are removed in order:
-ity,-ité,-ite(quantity, quantité, activity, activité)-tor,-teur,-trice, replaced by-t(aviator, aviateur, aviatrice)-ment,-ement(management, immédiatement, endictment)-ing(being, acting)-ed(managed, relieved)
-sion,-tion, replaced by-sand-trespectively (action, commission)-isme,-ism,-iste,-ist(socialist, socialism)-at(reliquat, predicat, postulat)-tif,-tiv,-tive, replaced by-t(active, actif)-y, replaced by-i(apply, really)-er(installer, chanter, her)-iz,-ize,-yze, replaced by-is(analyze, optimize, size)-ique,-icreplaced by-i(politique, arithmétique)
The conversion from natural English and French is done automatically to Dumbrish, you don’t have anything specific to do. However, the suggested keywords are written in Dumbrish so it might be useful to understand where it comes from to use them.
The following graph shows how Dumbrish plays in the whole search engine:
graph TD;
subgraph "Documents to index"
A_1[Doc 1: camera sensors];
A_2[Doc 2: removing colour cast];
end
subgraph "User query"
Q[camera ISO noise];
end
subgraph "Dumbrish translation"
A_1--regex parser-->B_1(camera sensor);
A_2--regex parser-->B_2(remov color cast);
Q--regex parser-->Q_1(camera iso nois);
end
subgraph "Vector representation"
B_1--AI language model-->C_1("(-0.03, 0.32)");
B_2--AI language model-->C_2("(-0.08 , -0.26)");
Q_1--AI language model-->Q_2("(0.97, 0.24)")
end
subgraph "Vector distance"
C_1-->D_1((0.85));
C_2-->D_2((1.50));
Q_2-->D_1;
Q_2-->D_2;
end
D_1-->Yes[Closest answer];
D_2-->No[Farthest answer];
Yes--->R;
subgraph "Result"
Q--->R[Doc 1];
end
Below, we can see the projection in 2D of the vectors inferred by the AI language model, for the keywords from the above example (camera, sensor, color, cast, iso, remove), and the 6 closest keywords. I also added more “noisy” keywords to see how they go: gamut, vibrant, noise, highlight, shadow, social, politics, social and france.
In red, we show the full query (treated as a document), represented as a vector by the centroid of the vectors of its keywords. In black, we have the 5 words found closest to the full query (topics). In green, we have the individual keywords composing the query, and for each of them, in blue, their 5 closest keywords.
This shows that the AI correctly identified:
- photo, photograph and photographie/photography as close synonyms,
- midtones as an alternative spelling for mid-tones,
- various color spaces with gamut,
- vibrant as a synonym of colorful,
- BSI as an alias of backside-illuminated,
- APS-C, full-frame, mirorrless, DSLR, point-and-shoot as synonyms of camera,
- remove as an alias for removal.
However:
- it failed to identify the relationship between noise and ISO through high-ISO (that was correctly associated with noise, but not with ISO),
- it identified remove as being close to add while they are semantic opposites, most likely because we find them in the same context.