The 2 fuzzy search methods use meta-keywords (see below) to make searches more general, have internal spell-checking and use the Dumbrish machine language. They focus on natural language and will not be able to retrieve particular numbers, hashes, usernames, etc. Each keyword from the query will be treated as a separate entity, and results are ranked by descending relevance (ascending distance) between pages and the query.
The grep-like search uses no meta-keywords and treats the whole query as a single entity (string or regular expression). It ranks results by descending order of the number of matches of the query with the pages content. For example, the query CIE [0-9]{4} will target all occurences of CIE color spaces names (CIE 1931, CIE 1964, CIE 1976, CIE 2006, CIE 2012, etc.). Learn how to use regex to search for text.
Numbers, prices, dates and common filenames have been explicitly removed from the AI training corpus to keep only natural language from the end-user perspective. They are replaced by meta-tokens for generalization. You can use these meta-tokens to abstract your research and target a type of information, regardless of its particular value.
Meta-tokens are case-insensitive, they only need to be nested between underscores. Only full words are replaced by meta-tokens, meaning words preceded by a whitespace, a parenthesis or a bracket and followed by a whitespace, a parenthesis, a bracket or any sentence punctuation.
| Meta-token | Selects | Selected formats |
|---|---|---|
| _DATE_ | Most common date formats, including ISO 8601, or textual day/month in French and in English | 2023-08-01 (year-month-day), 2023/08/10, March 25, 25 mars, 25 mars 2021, 2021 March 25 |
| _TIME_ | Most common time formats, including ISO 8601 or textual hours. The previous date meta-token takes precedence over this one if a word contains an ISO 8601 datetime. | 12h15, 12:15, 12:15:00, 6:00, 12am, 12 am, 12 h, 6 h, 12:15:00Z, 12:15:00+01, 12:15:00 UTC+1 |
| _URL_ | Any word resembling an internet address | http://domain.ext, https://domain.ext, https://subdomain.domain.ext/page, //domain.ext, http://domain.ext/?search=query&sort=asc |
| _IP_ | Any word resembling an IPv4 or IPv6 address | 2001:0db8:0000:85a3:0000:0000:ac1f:8001, 239.255.255.255 |
| _HASH_ | Any word resembling an hexadecimal cryptographic hash of more than 8 characters | a6fde8c2 |
| _BASE64_ | Any word resembling base64-encoded strings | |
| _CODEFILE_ | Any word resembling a programming file format | *.c, *.h, *.sh, *.py, *.php, *.css, *.js, etc. |
| _IMAGEFILE_ | Any word resembling an image file format (including SVG, raster images, HDR and RAW photographs) | *.jpg, *.png, *.tif, *.raw, *.cr2, *.cr3, etc. |
| _DOCUMENTFILE_ | Any word resembling an office or image document file format | *.odt, *.xls, *.doc, *.ppt(Libre Office, MS Office, Star Office), *.kra, *.xcf, *.psd (Krita, Gimp, Photoshop, InDesign, etc.), *.pdf, *.ps, *.eps (PostScript, PDF) |
| _TEXTFILE_ | Any word resembling a text file format | *.txt, *.html, *.tex, *.md, *.xmp, |
| _ARCHIVEFILE_ | Any word resembling an archive file format | *.zip, *.tar, *.rar, *.gz, *.tar.gz |
| _BINARYFILE_ | Any word resembling an executable file format | *.exe, *.bin, *.dmg, *.appimage, *.deb, *.rpm, *.so, *.so.21 |
| _DATABASEFILE_ | Any word resembling a dabase file format | *.db, *.sql, *.sqlite |
| _PATH_ | Any word resembling a filesystem path that is not an url. The URL and files meta-tokens above take precedence over this one, so the root path will be split from the filename if its extension is known. | ~/folder, ~/.folder/, ./folder, C:\\folder\file, /test/file |
| _PRICE_ | Any word resembling a price, with or without number. Will also detect the special character $ in computer code. | $15, 15€, 15.5 €, 5 USD, EUR 5, 12k€, £12K |
| _RESOLUTION_ | Any word resembling a pixel resolution | 800×600, 800x600, 800X600 |
| _FILESIZE_ | Any word resembling a file/memory size in bits, bytes or octets | 16 Gb, 256MB, 1024.5 To, 1 Tio, 1Tib |
| _DISTANCE_ | Any word resembling a numeric (decimal or integer) distance in British or SI units, optionaly preceded by an arithmetic sign | 1.5in, 12 inches, 12’, 12', 12 ft, 12.5 feet, 12,000’’, 5 km, 25 µm, 25µm, 25 micrometers, 2,5 cm, 12 m |
| _TEMPERATURE_ | Any word resembling a numeric (decimal or integer) temperature in °C, °F and kelvin, optionaly preceded by an arithmetic sign | +2 °C, -5.2°C, 200 K, 250°F, 2.5 degC, 272 kelvin, 25 degree C |
| _ANGLE_ | Any word resembling a numeric (decimal or integer) angle value in degree, radian and steradian, optionaly preceded by an arithmetic sign | 25 rad, 2°, 3 deg, 4.5 rads, 5 degrés, ±5° |
| _PIXELS_ | Any word resembling an integer pixel count, optionaly preceded by an arithmetic sign | 1200 px, 2500 pixels, 24 Mpix, 12 megapixels |
| _EXPOSURE_ | Any word resembling a numeric (decimal or integer) photographic exposure value, in EV or IL, optionaly preceded by an arithmetic sign | 2 EV, -1 IL, 2 EVs, -2.45 EV |
| _SENSIBILITY_ | Any word resembling a numeric (decimal or integer) photographic sensibility value, in ISO or ASA | ISO200, 800 ISO, 1600 ASA, 250 ISOs |
| _APERTURE_ | Any word resembling a numeric (decimal or integer) photographic diaphragm aperture value, in f/ unit | f/1.4, f2.8, F4 |
| _LUMINANCE_ | Any word resembling a numeric (decimal or integer) luminance value, in nit or Cd/m², optionaly preceded by an arithmetic sign | 120 Cd/m², 0.1 Cd/m^2, 300 nit, 1000 nits, +3 Cd/m2 |
| _FREQUENCY_ | Any word resembling a numeric (decimal or integer) frequency value, in multiple of Hz, optionaly preceded by an arithmetic sign | 3 MHz, 5 nanohertz, 2.5 ghz |
| _GAIN_ | Any word resembling a numeric (decimal or integer) luminance value, in dB, optionaly preceded by an arithmetic sign | +3 dB, -2.5 dB, 25 decibel, 5 décibels |
| _WEIGHT_ | Any word resembling a numeric (decimal or integer) mass (weight) in multiples of British or SI units, optionaly preceded by an arithmetic sign | 12 lbs, 15.5lb, 200 g, 4,5 kg |
| _ORDINAL_ | Any word containing the letter or digit variant of ordinal numbers | 1st, first, 2nd, n°2, premier, deuxième, 2eme, 2ème |
| _PERCENT_ | Any word containing only a number (decimal or integer) followed by %, optionaly preceded by an arithmetic sign | +2 % , 1%, = 3 % |
| _NUMBER_ | Any word containing only an unitless number, possibly with decimal marker | 123456, 12.456, 12,456, 12_45, 12/45, 0-2, .1, 2. |
| _USER_ | Any word resembling an user handle or an email | @me, me@here, me@domain.ext, user1234, user6 |
Machine language (code, markup) has been removed from the training corpus too, by deleting the <pre> and <code> tags in HTML documents before indexing them. But since many forum users post computer code displayed as natural language, not using the proper code markup, and since programming languages use English common words as instructions, the natural language model is still polluted by some computer code.