มอดูล:data consistency check/documentation

นี่คือหน้าเอกสารการใช้งานสำหรับ มอดูล:data consistency check

This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.

Output[แก้ไข]

Discrepancies detected:

Module:etymology languages/code to canonical name

goh-lng, the code for the canonical name Lombardic, is wrong; it should be lng.

Module:etymology languages/data

The data key alias_codes for ??? (lng) is invalid.
ภาษามคธ (pra-mag) has a canonical name that is not unique; it is also used by the code mag.
The data key preprocess_links for ??? (th-new) is invalid.

Module:families/canonical names

The code ira-mid and the canonical name อิเรเนียนกลาง should be removed; they are not found in Module:families/data.
The code ira-old and the canonical name อิเรเนียนเก่า should be removed; they are not found in Module:families/data.

Module:families/code to canonical name

The code ira-mid and the canonical name อิเรเนียนกลาง should be removed; they are not found in Module:families/data.
The code ira-old and the canonical name อิเรเนียนเก่า should be removed; they are not found in Module:families/data.

Module:languages/data/3/s

ภาษาสันถาลี (sat) has override_translit set, but no transliteration module

Module:scripts/by name

Chisoi (Chis) is missing
ดิเวส อกุรุ (Diak) is missing
Dhives Akuru, the canonical name for the code Diak, is wrong; it should be ดิเวส อกุรุ.
Garay (Gara) is missing
Khema (Gukh) is missing
Pahawh Hmong (Hmng) is missing
ม้ง, the canonical name for the code Hmng, is wrong; it should be Pahawh Hmong.
IPAchar, the code for the canonical name สัทอักษรสากล, is wrong; it should be Ipach.
กวิ (Kawi) is missing
Kawi, the canonical name for the code Kawi, is wrong; it should be กวิ.
Kirat Rai (Krai) is missing
มโร (Mroo) is missing
Mro, the canonical name for the code Mroo, is wrong; it should be มโร.
Ol Onal (Onao) is missing
อุสมาน (Osma) is missing
Osmanya, the canonical name for the code Osma, is wrong; it should be อุสมาน.
Sidetic (Sidt) is missing
Khudawadi, the canonical name for the code Sind, is wrong; it should be คุดาบาด.
คุดาบาด (Sind) is missing
Sunuwar (Sunu) is missing
Lai Tay (Tayo) is missing
Todhri (Todr) is missing
Tolong Siki (Tols) is missing
Tigalari (Tutg) is missing
วรังจิติ (Wara) is missing
Varang Kshiti, the canonical name for the code Wara, is wrong; it should be วรังจิติ.
Tamyig, the canonical name for the code sit-tam-Tibt, is wrong; it should be ตัมยิก.
The code xzh-Tibt and the canonical name Zhang-Zhung should be removed; they are not found in a submodule of Module:scripts.

Module:scripts/code to canonical name

Chis (Chisoi) is missing
Dhives Akuru, the canonical name for the code Diak, is wrong; it should be ดิเวส อกุรุ.
Gara (Garay) is missing
Gukh (Khema) is missing
ม้ง, the canonical name for the code Hmng, is wrong; it should be Pahawh Hmong.
IPAchar, the code for the canonical name สัทอักษรสากล, is wrong; it should be Ipach.
Kawi, the canonical name for the code Kawi, is wrong; it should be กวิ.
Krai (Kirat Rai) is missing
Latnx, the code for the canonical name ละติน, is wrong; it should be Latn.
Mro, the canonical name for the code Mroo, is wrong; it should be มโร.
Onao (Ol Onal) is missing
Osmanya, the canonical name for the code Osma, is wrong; it should be อุสมาน.
Sidt (Sidetic) is missing
Khudawadi, the canonical name for the code Sind, is wrong; it should be คุดาบาด.
Sunu (Sunuwar) is missing
Tayo (Lai Tay) is missing
Todr (Todhri) is missing
Tols (Tolong Siki) is missing
Tutg (Tigalari) is missing
Varang Kshiti, the canonical name for the code Wara, is wrong; it should be วรังจิติ.
Tamyig, the canonical name for the code sit-tam-Tibt, is wrong; it should be ตัมยิก.
xka-Arab, the code for the canonical name อาหรับ, is wrong; it should be fa-Arab.
The code xzh-Tibt and the canonical name Zhang-Zhung should be removed; they are not found in a submodule of Module:scripts.

Module:scripts/data

อักษรBlissymbols (Blis) is not used by any language and has no characters listed for auto-detection.
อักษรCypro-Minoan (Cpmn) is not used by any language.
อักษรฮิรางานะ (Hira) is not used by any language.
อักษรคานะ (Hrkt) is not used by any language.
อักษรImage-rendered (Imag) is not used by any language and has no characters listed for auto-detection.
อักษรสัทอักษรสากล (Ipach) is not used by any language and has no characters listed for auto-detection.
อักษรMoon (Moon) is not used by any language and has no characters listed for auto-detection.
รหัสมอร์ส (Morse) is not used by any language and has no characters listed for auto-detection.
อักษรสัญกรณ์ดนตรี (Music) is not used by any language.
อักษรไม่ระบุ (None) is not used by any language and has no characters listed for auto-detection.
อักษรOl Onal (Onao) is not used by any language and has no characters listed for auto-detection.
อักษรRongorongo (Roro) is not used by any language and has no characters listed for auto-detection.
อักษรRumi numerals (Rumin) is not used by any language.
สัญญาณธง (Semap) is not used by any language and has no characters listed for auto-detection.
อักษรVisible Speech (Visp) is not used by any language and has no characters listed for auto-detection.
อักษรmathematical notation (Zmth) is not used by any language.
อักษรสัญลักษณ์ (Zsym) is not used by any language.
อักษรยังไม่กำหนด (Zyyy) is not used by any language and has no characters listed for auto-detection.
อักษรยังไม่มีรหัส (Zzzz) is not used by any language and has no characters listed for auto-detection.
The codes fa-Arab, ug-Arab, ks-Arab, ps-Arab, ur-Arab, tt-Arab, ota-Arab, mzn-Arab, sd-Arab and ku-Arab are currently alias codes. Only one code should be used in the data.
The codes ms-Arab and kk-Arab are currently alias codes. Only one code should be used in the data.
The data key sort_by_scraping for อักษรญี่ปุ่น (Jpan) is invalid.

Checks performed[แก้ไข]

For multiple data modules:

Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
Each name in the list of other names must appear only once.
otherNames, if present, must be an array.
Wikidata item IDs must be a positive integer or a string starting with Q and ending with decimal digits.

The following must be true of the data used by Module:languages:

Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
The canonical name (field 1) must be present and must not be the same as the canonical name of another language.
If field 2 is not nil, it must a valid Wikidata item ID.
If field 3 or family is given and not nil, it must be a valid family code.
If field 4 or scripts is given and not nil, it must be an array, and each string in the array must be a valid script code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code.
If family is given, it must be a valid family code.
If type is given, it must be one of the recognised values (regular, reconstructed, appendix-constructed).
If entry_name is given, it must be a table that contains either two arrays (from and to) or a string (remove_diacritics) or both.
If sort_key is given, it may either be a string, or at table that in turn contains either two arrays (from and to) or a string (remove_diacritics).
If entry_name or sort_key is given, the from array must be longer or equal in length to the to array.
If standardChars is given, it must form a valid Lua string pattern when placed between square brackets with ^ before it ("[^...]). (It should match all characters regularly used in the language, but that cannot be tested.)
If override_translit is set, translit must also be set, because there must be a transliteration module that can override manual transliteration.
If link_tr is present, it must be true.
Have no data keys besides these: 1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr".

Checks not performed:

If translit is present, it should be the name of a module, and this module should contain a tr function that takes a pagename (and optionally a language code and script code) as arguments.
If sort_key is a string, it should be the name of a module, and this module should contain a makeSortKey function that takes a pagename (and optionally a language code and script code) as arguments.
If entry_name or sort_key is a table and contains a field remove_diacritics, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]).

These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link attempts to use the transliteration module.

Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.

The following must be true of the data used by Module:etymology languages:

canonicalName must be given.
parent must be given must be a valid language, family or etymology-only language code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language.
Have no data keys besides these: "canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item".

Codes in Module:families data must:

Have canonicalName, which must not be the same as the canonical name of another family.
If family is given, it must be a valid family code.
Have at least one language or subfamily belonging to it.
Have no data keys besides these: "canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item".

Codes in Module:scripts data must:

Have canonicalName.
Have at least one language that lists it as one of its scripts.
Have a characters pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"). (It should match all characters in the script, but that cannot be tested.)
Have no data keys besides these: "canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction".