Navigation

Advanced research

The 2 fuzzy search methods use meta-keywords (see below) to make searches more general, have internal spell-checking and use the Dumbrish machine language. They focus on natural language and will not be able to retrieve particular numbers, hashes, usernames, etc. Each keyword from the query will be treated as a separate entity, and results are ranked by descending relevance (ascending distance) between pages and the query.

The grep-like search uses no meta-keywords and treats the whole query as a single entity (string or regular expression). It ranks results by descending order of the number of matches of the query with the pages content. For example, the query CIE [0-9]{4} will target all occurences of CIE color spaces names (CIE 1931, CIE 1964, CIE 1976, CIE 2006, CIE 2012, etc.). Learn how to use regex to search for text.

Numbers, prices, dates and common filenames have been explicitly removed from the AI training corpus to keep only natural language from the end-user perspective. They are replaced by meta-tokens for generalization. You can use these meta-tokens to abstract your research and target a type of information, regardless of its particular value.

Meta-tokens are case-insensitive, they only need to be nested between underscores. Only full words are replaced by meta-tokens, meaning words preceded by a whitespace, a parenthesis or a bracket and followed by a whitespace, a parenthesis, a bracket or any sentence punctuation.

Meta-token Selects Selected formats
_DATE_ Most common date formats, including ISO 8601, or textual day/month in French and in English 2023-08-01 (year-month-day), 2023/08/10, March 25, 25 mars, 25 mars 2021, 2021 March 25
_TIME_ Most common time formats, including ISO 8601 or textual hours. The previous date meta-token takes precedence over this one if a word contains an ISO 8601 datetime. 12h15, 12:15, 12:15:00, 6:00, 12am, 12 am, 12 h, 6 h, 12:15:00Z, 12:15:00+01, 12:15:00 UTC+1
_URL_ Any word resembling an internet address http://domain.ext, https://domain.ext, https://subdomain.domain.ext/page, //domain.ext, http://domain.ext/?search=query&sort=asc
_IP_ Any word resembling an IPv4 or IPv6 address 2001:0db8:0000:85a3:0000:0000:ac1f:8001, 239.255.255.255
_HASH_ Any word resembling an hexadecimal cryptographic hash of more than 8 characters a6fde8c2
_BASE64_ Any word resembling base64-encoded strings
_CODEFILE_ Any word resembling a programming file format *.c, *.h, *.sh, *.py, *.php, *.css, *.js, etc.
_IMAGEFILE_ Any word resembling an image file format (including SVG, raster images, HDR and RAW photographs) *.jpg, *.png, *.tif, *.raw, *.cr2, *.cr3, etc.
_DOCUMENTFILE_ Any word resembling an office or image document file format *.odt, *.xls, *.doc, *.ppt(Libre Office, MS Office, Star Office), *.kra, *.xcf, *.psd (Krita, Gimp, Photoshop, InDesign, etc.), *.pdf, *.ps, *.eps (PostScript, PDF)
_TEXTFILE_ Any word resembling a text file format *.txt, *.html, *.tex, *.md, *.xmp,
_ARCHIVEFILE_ Any word resembling an archive file format *.zip, *.tar, *.rar, *.gz, *.tar.gz
_BINARYFILE_ Any word resembling an executable file format *.exe, *.bin, *.dmg, *.appimage, *.deb, *.rpm, *.so, *.so.21
_DATABASEFILE_ Any word resembling a dabase file format *.db, *.sql, *.sqlite
_PATH_ Any word resembling a filesystem path that is not an url. The URL and files meta-tokens above take precedence over this one, so the root path will be split from the filename if its extension is known. ~/folder, ~/.folder/, ./folder, C:\\folder\file, /test/file
_PRICE_ Any word resembling a price, with or without number. Will also detect the special character $ in computer code. $15, 15€, 15.5 €, 5 USD, EUR 5, 12k€, £12K
_RESOLUTION_ Any word resembling a pixel resolution 800×600, 800x600, 800X600
_FILESIZE_ Any word resembling a file/memory size in bits, bytes or octets 16 Gb, 256MB, 1024.5 To, 1 Tio, 1Tib
_DISTANCE_ Any word resembling a numeric (decimal or integer) distance in British or SI units, optionaly preceded by an arithmetic sign 1.5in, 12 inches, 12’, 12', 12 ft, 12.5 feet, 12,000’’, 5 km, 25 µm, 25µm, 25 micrometers, 2,5 cm, 12 m
_TEMPERATURE_ Any word resembling a numeric (decimal or integer) temperature in °C, °F and kelvin, optionaly preceded by an arithmetic sign +2 °C, -5.2°C, 200 K, 250°F, 2.5 degC, 272 kelvin, 25 degree C
_ANGLE_ Any word resembling a numeric (decimal or integer) angle value in degree, radian and steradian, optionaly preceded by an arithmetic sign 25 rad, 2°, 3 deg, 4.5 rads, 5 degrés, ±5°
_PIXELS_ Any word resembling an integer pixel count, optionaly preceded by an arithmetic sign 1200 px, 2500 pixels, 24 Mpix, 12 megapixels
_EXPOSURE_ Any word resembling a numeric (decimal or integer) photographic exposure value, in EV or IL, optionaly preceded by an arithmetic sign 2 EV, -1 IL, 2 EVs, -2.45 EV
_SENSIBILITY_ Any word resembling a numeric (decimal or integer) photographic sensibility value, in ISO or ASA ISO200, 800 ISO, 1600 ASA, 250 ISOs
_APERTURE_ Any word resembling a numeric (decimal or integer) photographic diaphragm aperture value, in f/ unit f/1.4, f2.8, F4
_LUMINANCE_ Any word resembling a numeric (decimal or integer) luminance value, in nit or Cd/m², optionaly preceded by an arithmetic sign 120 Cd/m², 0.1 Cd/m^2, 300 nit, 1000 nits, +3 Cd/m2
_FREQUENCY_ Any word resembling a numeric (decimal or integer) frequency value, in multiple of Hz, optionaly preceded by an arithmetic sign 3 MHz, 5 nanohertz, 2.5 ghz
_GAIN_ Any word resembling a numeric (decimal or integer) luminance value, in dB, optionaly preceded by an arithmetic sign +3 dB, -2.5 dB, 25 decibel, 5 décibels
_WEIGHT_ Any word resembling a numeric (decimal or integer) mass (weight) in multiples of British or SI units, optionaly preceded by an arithmetic sign 12 lbs, 15.5lb, 200 g, 4,5 kg
_ORDINAL_ Any word containing the letter or digit variant of ordinal numbers 1st, first, 2nd, n°2, premier, deuxième, 2eme, 2ème
_PERCENT_ Any word containing only a number (decimal or integer) followed by %, optionaly preceded by an arithmetic sign +2 % , 1%, = 3 %
_NUMBER_ Any word containing only an unitless number, possibly with decimal marker 123456, 12.456, 12,456, 12_45, 12/45, 0-2, .1, 2.
_USER_ Any word resembling an user handle or an email @me, me@here, me@domain.ext, user1234, user6

Machine language (code, markup) has been removed from the training corpus too, by deleting the <pre> and <code> tags in HTML documents before indexing them. But since many forum users post computer code displayed as natural language, not using the proper code markup, and since programming languages use English common words as instructions, the natural language model is still polluted by some computer code.