The Dumbrish language

31 May, 2023

Dumbrish stands for Dumb French-English, because dumb is all AI understands. The suggested keywords are written in Dumbrish, so don’t be surprised if they read weird.

It is a totally made-up language for the purpose of machine learning, and it is designed to be programmatically translatable from natural English and natural French (using regex). It aims at merging French and English languages by reducing words to similar roots (or stems) for both languages. The intuition behind it stems from the fact that many English and French words share the same root (labeur/labor, profil/profile, décider/decide) or even exist as-is in both languages (action, classification, film, catalogue, important).

graph TD;
    subgraph "Natural language"
      A_3[acte];
      A_1[actionné];
      A_2[activity];
      A_4[acteur];
      A_5[acting];
      A_6[activated];
    end

    A_3----->E_1;
    A_1-->B_1(actionne);
    B_1-->C_1(actione);
    C_1-->D_1(action);
    D_1-->E_1;
    A_2---->B_2(activ);
    B_2-->E_1;
    A_4----->E_1;
    A_5----->E_1;
    A_6-->B_6(activat);
    B_6-->C_6(activ);
    C_6-->E_1

    subgraph "Dumbrish"
      E_1[act];
    end

How the Dumbrish language is constructed

The Dumbrish root can be used as a cross-language token representing the idea/concept, acting as a basic automatic translation. There are caveats to this though. “Marketing” and “Markets” don’t reference the same concept, for example, even though they share the same stem. Cutting suffixes loses some meaning. For this reason, cutting all stems may result in too much accuracy lost or conflicts, like removing -al would turn “analytical” into “analytic” but “appeal” into “ape”. So “actionnable” will not be stemmed as “act” in Dumbrish.

But since the Dumbrish is only designed as a first processing step to help an unsupervised machine learning word embedding, these discrepancies don’t matter as much as you would think at the end. Word embedding extracts word patterns from documents, which means it looks for words in-between their context. If we fed it enough common-language documents, it could actually infer by itself that “action”, “actor” and “actrice” are closely related. Problem is we process technical documents written with jargon, and we don’t have millions of documents. This “manual” prefiltering gives the neural network some generalization hints that create anchor points between languages.

The rules of Dumbrish are simple:

Consonants can’t be doubled. For example, “professional” in English translates “professionnel” in French, so irregular consonants are a problem, therefore we force single consonants everywhere in Dumbrish.
Accented characters are removed, excepted “à” (which has a grammatical function in French). Since people often misspell them (and some don’t use them at all), and they only change change the phonetic sound (except for “à”), so there is no point in keeping them.
Dumbrish uses the American spelling for -our words (“behaviour”, “colour”), British spelling is transformed to American spelling (“behavior”, “color”) to match Latin words (“color”) and be consistent with French (“coloriste”, “coloriser”).
Plural and feminine forms in -s, -e and -ses are removed. “Lens” and “Lense” become “len”, “les” becomes “le”.
The following word suffixes are removed in order:
- -ity, -ité, -ite (quantity, quantité, activity, activité)
- -tor, -teur, -trice, replaced by -t (aviator, aviateur, aviatrice)
- -ment, -ement (management, immédiatement, endictment)
- -ing (being, acting)
- -ed (managed, relieved)
- -sion, -tion, replaced by -s and -t respectively (action, commission)
- -isme, -ism, -iste, -ist (socialist, socialism)
- -at (reliquat, predicat, postulat)
- -tif, -tiv, -tive, replaced by -t (active, actif)
- -y, replaced by -i (apply, really)
- -er (installer, chanter, her)
- -iz, -ize, -yze, replaced by -is (analyze, optimize, size)
- -ique, -ic replaced by -i (politique, arithmétique)

The conversion from natural English and French is done automatically to Dumbrish, you don’t have anything specific to do. However, the suggested keywords are written in Dumbrish so it might be useful to understand where it comes from to use them.

The following graph shows how Dumbrish plays in the whole search engine:

graph TD;
    subgraph "Documents to index"
      A_1[Doc 1: camera sensors];
      A_2[Doc 2: removing colour cast];
    end

    subgraph "User query"
      Q[camera ISO noise];
    end

    subgraph "Dumbrish translation"
      A_1--regex parser-->B_1(camera sensor);
      A_2--regex parser-->B_2(remov color cast);
      Q--regex parser-->Q_1(camera iso nois);
    end

    subgraph "Vector representation"
      B_1--AI language model-->C_1("(-0.03,  0.32)");
      B_2--AI language model-->C_2("(-0.08 , -0.26)");
      Q_1--AI language model-->Q_2("(0.97, 0.24)")
    end

    subgraph "Vector distance"
      C_1-->D_1((0.85));
      C_2-->D_2((1.50));
      Q_2-->D_1;
      Q_2-->D_2;
    end

    D_1-->Yes[Closest answer];
    D_2-->No[Farthest answer];
    Yes--->R;

    subgraph "Result"
      Q--->R[Doc 1];
    end

How similarity between documents is computed

Below, we can see the projection in 2D of the vectors inferred by the AI language model, for the keywords from the above example (camera, sensor, color, cast, iso, remove), and the 6 closest keywords. I also added more “noisy” keywords to see how they go: gamut, vibrant, noise, highlight, shadow, social, politics, social and france.

2D projection T-distributed Stochastic Neighbor Embedding of the 400D embedding vectors from Chantal v1.0 Word2Vec CBOW model

In red, we show the full query (treated as a document), represented as a vector by the centroid of the vectors of its keywords. In black, we have the 5 words found closest to the full query (topics). In green, we have the individual keywords composing the query, and for each of them, in blue, their 5 closest keywords.

This shows that the AI correctly identified:

photo, photograph and photographie/photography as close synonyms,
midtones as an alternative spelling for mid-tones,
various color spaces with gamut,
vibrant as a synonym of colorful,
BSI as an alias of backside-illuminated,
APS-C, full-frame, mirorrless, DSLR, point-and-shoot as synonyms of camera,
remove as an alias for removal.

However:

it failed to identify the relationship between noise and ISO through high-ISO (that was correctly associated with noise, but not with ISO),
it identified remove as being close to add while they are semantic opposites, most likely because we find them in the same context.