Orthography Information¶

The Unicode CLDR data has three categories of character support for each orthography: basic, optional, and punctuation.

class jkUnicode.orthography.Orthography(info_obj: Any | None, code: str, script: str, territory: str, info_dict: dict[str, Any], speakers: int)¶

The Orthography object represents an orthography. You usually don’t deal with this object directly, it is used internally by the jkUnicode.orthography.OrthographyInfo object.

Parameters:

info_obj (jkUnicode.orthography.OrthographyInfo) – The parent info object.
code (str) – The ISO-639-1 code for the orthography.
script (str) – The script code of the orthography.
territory (str) – The territory code of the orthography.
info_dict (dict) – The dictionary which contains the rest of the information about the orthography.
speakers (int) – The number of speakers for the orthography.

almost_supported_basic(max_missing: int = 5) → bool¶: Is the orthography supported with a maximum of max_missing base characters for the current parent cmap?

almost_supported_full(max_missing: int = 5) → bool¶: Is the orthography supported with a maximum of max_missing characters (base, optional and punctuation characters) for the current parent cmap?

almost_supported_punctuation(max_missing: int = 5) → bool¶: Is the orthography supported with a maximum of max_missing punctuation characters for the current parent cmap?

cased(codepoint_list: list[int]) → list[int]¶

Return a list with its Unicode case mapping toggled. If a codepoint has no lowercase or uppercase mapping, it is dropped from the list.

Parameters:: codepoint_list (list) – The list of codepoints.

fill_from_default_orthography() → None¶

Sometimes the base codepoints are empty for a variant of an orthography. Try to fill them in from the default variant.

Call this only after the whole list of orthographies is present, or it will fail, because the default orthography may not be present until the whole list has been built.

forget_cmap() → None¶: Forget the results of the last cmap scan.

from_dict(info_dict: dict[str, Any]) → None¶

Read information for the current orthography from a dictionary. This method is called during initialization of the object and fills in a number of instance attributes:

name: The orthography name.

unicodes_base: The set of base characters for the orthography.

unicodes_optional: The set of optional characters for the orthography.

unicodes_punctuation: The set of punctuation characters for the orthography.

unicodes_any: The previous three sets combined.

get_missing(minimum: bool = False, punctuation: bool = False) → set[int]¶

Return a set of missing characters for support of the orthography. If minimum is true, only required characters are listed. If punctuation is true, only punctuation characters are listed. If both are true, both required and punctuation characters are listed. If both are false, all required, optional, and punctuation characters are listed.

Parameters:

minimum (bool) – Only report missing required characters
punctuation (bool) – Only report missing punctuation

property identifier: str¶: Return a BCP47 identifier for language code/script/territory (read-only).

property ignored_unicodes: set[int]¶: The set of ignored codepoints. If a parent jkUnicode.orthography.OrthographyInfo object exists, it is taken from there.

property info: OrthographyInfo | None¶: The parent jkUnicode.orthography.OrthographyInfo object (read-only).

property name: str¶: The name of the orthography.

scan_cmap() → None¶

Scan the orthography against the current parent cmap. This fills in a number of instance attributes:

missing_base: A set of unicode values that are missing from the basic characters of the orthography.

missing_optional: A set of unicode values that are missing from the optional characters of the orthography.

missing_punctuation: A set of unicode values that are missing from the punctuation characters of the orthography.

missing_all: A set of all the previous combined.

num_missing_base, num_missing_optional, num_missing_punctuation, num_missing_all: The number of missing characters for the previous attributes

base_pc, optional_pc, punctuation_pc: The percentage values of support for the categories basic, optional, and punctuation characters.

The names of these attributes can be used in jkUnicode.orthography.OrthographyInfo.print_report.

speakers_supported_by_unicode(u: int) → int¶: If the character was removed from the font, how many fewer speakers would the font support?

property support_basic: bool¶: Is the orthography supported (base and punctuation characters) for the current parent cmap?

property support_full: bool¶: Is the orthography supported (base, optional and punctuation characters) for the current parent cmap?

property support_minimal: bool¶: Is the orthography supported (base characters) for the current parent cmap?

property support_minimal_inclusive: bool¶: Is the orthography supported (base characters only) for the current parent cmap?

property ui: UniInfo¶: The jkUnicode.UniInfo object that is queried for Unicode information.

uses_unicode_any(u: int) → bool¶

Is the codepoint used by this orthography in any set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters:: u (int) – The codepoint.

uses_unicode_base(u: int) → bool¶

Is the codepoint used by this orthography in the base set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters:: u (int) – The codepoint.

class jkUnicode.orthography.OrthographyInfo(ui: UniInfo | None = None, source='CLDR', sort_by_speakers=True)¶

The main Orthography Info object. It reads the information for each orthography from the files in the json subfolder. The JSON data is generated from the specified data source via included Python scripts.

build_reverse_cmap() → None¶: Build a map from each unicode to a list of indices into the orthographies list for all orthographies that are using it as base or punctuation character.

property cmap: dict[int, str]¶

The codepoint to glyph name mapping. When you set the cmap, it is scanned against all orthographies belonging to the OrthographyInfo object.

You set the cmap by passing a dictionary, usually from a font. E.g.:

TTFont(“myfont.ttf”) o = OrthographyInfo() o.cmap = TTFont(“myfont.ttf”).getBestCmap()

get_almost_supported(max_missing: int = 5) → list[Orthography]¶

Return a list of almost supported orthographies for the current cmap.

Parameters:: max_missing (int) – The maximum allowed number of missing characters.

get_kern_list(include_optional=False) → set[frozenset[int]]¶

Return a list of character pairs that may appear in any supported orthography for the current cmap.

Parameters:: include_optional (bool) – Include optional characters.

get_language_name(code: str) → str¶

Return the nice name for a language by its code.

Parameters:: code (str) – The language code.

get_orthographies_for_char(char: str) → list[Orthography]¶

Get a list of orthographies which use a supplied character at base level.

Parameters:: char (char) – The character.

get_orthographies_for_unicode(u: int) → list[Orthography]¶

Get a list of orthographies which use a supplied codepoint at base level.

Parameters:: u (int) – The codepoint.

get_orthographies_for_unicode_any(u: int) → list[Orthography]¶

Get a list of orthographies which use a supplied codepoint at any level.

Parameters:: u (int) – The codepoint.

get_script_name(code: str = 'DFLT') → str¶

Return the nice name for a script by its code.

Parameters:: code (str) – The script code.

get_supported_orthographies(full_only: bool = False) → list[Orthography]¶

Get a list of supported orthographies for a character list.

Parameters:: full_only (bool) – Return only orthographies which have both basic and optional characters present for the current cmap.

get_supported_orthographies_minimum() → list[Orthography]¶: Get a list of orthographies with minimal support for the current cmap only.

get_supported_orthographies_minimum_inclusive() → list[Orthography]¶: Get a list of orthographies with minimal or better support for the current cmap.

get_territory_name(code: str = 'dflt') → str¶

Return the nice name for a territory by its code.

Parameters:: code (str) – The territory code.

orthography(code: str, script: str = 'DFLT', territory: str = 'dflt') → Orthography | None¶

Access a particular orthography by its language, script and territory code.

Parameters:

code (str) – The language code.
script (str) – The script code.
territory (str) – The territory code.

print_report(otlist: list[Orthography], attr: str, bcp47: bool = False) → None¶

Print a formatted report for a given list of orthographies.

Parameters:

otlist (List[Orthography]) – The list of orthographies.
attr (str) – The name of the attribute of the orthography object that will be shown in the report (missing_base, missing_optional, missing_punctuation, missing_all, num_missing_base, num_missing_optional, num_missing_punctuation, base_pc, optional_pc, punctuation_pc, unicodes_base, unicodes_optional, unicodes_punctuation).
bcp47 (bool) – Output BCP47 subtags instead of names

report_kern_list(bcp47=False, include_optional=False) → None¶

Print a list of character pairs that may appear in any supported orthography for the current cmap.

Parameters:: bcp47 (bool) – Output BCP47 subtags instead of names

report_missing(codes: list[str], minimum=False, punctuation=False, bcp47=False) → None¶

Print a report of missing characters for the given BCP47 language subtags. If minimum is true, only required characters are listed. If punctuation is true, only punctuation characters are listed. If both are true, both required and punctuation characters are listed. If both are false, all required, optional, and punctuation characters are listed.

Parameters:

codes (List[str]) – BCP47 language subtags
minimum (bool) – Only report missing required characters
punctuation (bool) – Only report missing punctuation
bcp47 (bool) – Output BCP47 subtags instead of names

report_missing_punctuation(bcp47=False) → None¶

Print a report of orthographies which have all basic letters present, but are missing puncuation characters.

Parameters:: bcp47 (bool) – Output BCP47 subtags instead of names

report_near_misses(n: int = 5, bcp47=False) → None¶

Print a report of orthographies which a maximum number of n characters missing.

Parameters:

n (int) – The maximum number of missing characters
bcp47 (bool) – Output BCP47 subtags instead of names

report_supported(full_only: bool = False, bcp47=False) → None¶

Print a report of supported orthographies for the current cmap.

Parameters:

full_only (bool) – Only report orthographies which have both basic and optional characters present
bcp47 (bool) – Output BCP47 subtags instead of names

report_supported_minimum(bcp47=False) → None¶

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters present).

Parameters:: bcp47 (bool) – Output BCP47 subtags instead of names

report_supported_minimum_inclusive(bcp47=False) → None¶

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters required).

Parameters:: bcp47 (bool) – Output BCP47 subtags instead of names

Orthography Information¶

jkUnicode

Navigation

Related Topics