Orthography Information

The Unicode CLDR data has three categories of character support for each orthography: basic, optional, and punctuation.

class jkUnicode.orthography.Orthography(info_obj: Any | None, code: str, script: str, territory: str, info_dict: dict[str, Any], speakers: int)

The Orthography object represents an orthography. You usually don’t deal with this object directly, it is used internally by the jkUnicode.orthography.OrthographyInfo object.

Parameters:
  • info_obj (jkUnicode.orthography.OrthographyInfo) – The parent info object.

  • code (str) – The ISO-639-1 code for the orthography.

  • script (str) – The script code of the orthography.

  • territory (str) – The territory code of the orthography.

  • info_dict (dict) – The dictionary which contains the rest of the information about the orthography.

  • speakers (int) – The number of speakers for the orthography.

almost_supported_basic(max_missing: int = 5) bool

Is the orthography supported with a maximum of max_missing base characters for the current parent cmap?

almost_supported_full(max_missing: int = 5) bool

Is the orthography supported with a maximum of max_missing characters (base, optional and punctuation characters) for the current parent cmap?

almost_supported_punctuation(max_missing: int = 5) bool

Is the orthography supported with a maximum of max_missing punctuation characters for the current parent cmap?

cased(codepoint_list: list[int]) list[int]

Return a list with its Unicode case mapping toggled. If a codepoint has no lowercase or uppercase mapping, it is dropped from the list.

Parameters:

codepoint_list (list) – The list of codepoints.

fill_from_default_orthography() None

Sometimes the base codepoints are empty for a variant of an orthography. Try to fill them in from the default variant.

Call this only after the whole list of orthographies is present, or it will fail, because the default orthography may not be present until the whole list has been built.

forget_cmap() None

Forget the results of the last cmap scan.

from_dict(info_dict: dict[str, Any]) None

Read information for the current orthography from a dictionary. This method is called during initialization of the object and fills in a number of instance attributes:

name: The orthography name.

unicodes_base: The set of base characters for the orthography.

unicodes_optional: The set of optional characters for the orthography.

unicodes_punctuation: The set of punctuation characters for the orthography.

unicodes_any: The previous three sets combined.

get_missing(minimum: bool = False, punctuation: bool = False) set[int]

Return a set of missing characters for support of the orthography. If minimum is true, only required characters are listed. If punctuation is true, only punctuation characters are listed. If both are true, both required and punctuation characters are listed. If both are false, all required, optional, and punctuation characters are listed.

Parameters:
  • minimum (bool) – Only report missing required characters

  • punctuation (bool) – Only report missing punctuation

property identifier: str

Return a BCP47 identifier for language code/script/territory (read-only).

property ignored_unicodes: set[int]

The set of ignored codepoints. If a parent jkUnicode.orthography.OrthographyInfo object exists, it is taken from there.

property info: OrthographyInfo | None

The parent jkUnicode.orthography.OrthographyInfo object (read-only).

property name: str

The name of the orthography.

scan_cmap() None

Scan the orthography against the current parent cmap. This fills in a number of instance attributes:

missing_base: A set of unicode values that are missing from the basic characters of the orthography.

missing_optional: A set of unicode values that are missing from the optional characters of the orthography.

missing_punctuation: A set of unicode values that are missing from the punctuation characters of the orthography.

missing_all: A set of all the previous combined.

num_missing_base, num_missing_optional, num_missing_punctuation, num_missing_all: The number of missing characters for the previous attributes

base_pc, optional_pc, punctuation_pc: The percentage values of support for the categories basic, optional, and punctuation characters.

The names of these attributes can be used in jkUnicode.orthography.OrthographyInfo.print_report.

speakers_supported_by_unicode(u: int) int

If the character was removed from the font, how many fewer speakers would the font support?

property support_basic: bool

Is the orthography supported (base and punctuation characters) for the current parent cmap?

property support_full: bool

Is the orthography supported (base, optional and punctuation characters) for the current parent cmap?

property support_minimal: bool

Is the orthography supported (base characters) for the current parent cmap?

property support_minimal_inclusive: bool

Is the orthography supported (base characters only) for the current parent cmap?

property ui: UniInfo

The jkUnicode.UniInfo object that is queried for Unicode information.

uses_unicode_any(u: int) bool

Is the codepoint used by this orthography in any set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters:

u (int) – The codepoint.

uses_unicode_base(u: int) bool

Is the codepoint used by this orthography in the base set? This is relatively slow. Use jkUnicode.orthography.OrthographyInfo.build_reverse_cmap() if you need to access this information more often.

Parameters:

u (int) – The codepoint.

class jkUnicode.orthography.OrthographyInfo(ui: UniInfo | None = None, source='CLDR', sort_by_speakers=True)

The main Orthography Info object. It reads the information for each orthography from the files in the json subfolder. The JSON data is generated from the specified data source via included Python scripts.

build_reverse_cmap() None

Build a map from each unicode to a list of indices into the orthographies list for all orthographies that are using it as base or punctuation character.

property cmap: dict[int, str]

The codepoint to glyph name mapping. When you set the cmap, it is scanned against all orthographies belonging to the OrthographyInfo object.

You set the cmap by passing a dictionary, usually from a font. E.g.:

TTFont(“myfont.ttf”) o = OrthographyInfo() o.cmap = TTFont(“myfont.ttf”).getBestCmap()

get_almost_supported(max_missing: int = 5) list[Orthography]

Return a list of almost supported orthographies for the current cmap.

Parameters:

max_missing (int) – The maximum allowed number of missing characters.

get_kern_list(include_optional=False) set[frozenset[int]]

Return a list of character pairs that may appear in any supported orthography for the current cmap.

Parameters:

include_optional (bool) – Include optional characters.

get_language_name(code: str) str

Return the nice name for a language by its code.

Parameters:

code (str) – The language code.

get_orthographies_for_char(char: str) list[Orthography]

Get a list of orthographies which use a supplied character at base level.

Parameters:

char (char) – The character.

get_orthographies_for_unicode(u: int) list[Orthography]

Get a list of orthographies which use a supplied codepoint at base level.

Parameters:

u (int) – The codepoint.

get_orthographies_for_unicode_any(u: int) list[Orthography]

Get a list of orthographies which use a supplied codepoint at any level.

Parameters:

u (int) – The codepoint.

get_script_name(code: str = 'DFLT') str

Return the nice name for a script by its code.

Parameters:

code (str) – The script code.

get_supported_orthographies(full_only: bool = False) list[Orthography]

Get a list of supported orthographies for a character list.

Parameters:

full_only (bool) – Return only orthographies which have both basic and optional characters present for the current cmap.

get_supported_orthographies_minimum() list[Orthography]

Get a list of orthographies with minimal support for the current cmap only.

get_supported_orthographies_minimum_inclusive() list[Orthography]

Get a list of orthographies with minimal or better support for the current cmap.

get_territory_name(code: str = 'dflt') str

Return the nice name for a territory by its code.

Parameters:

code (str) – The territory code.

orthography(code: str, script: str = 'DFLT', territory: str = 'dflt') Orthography | None

Access a particular orthography by its language, script and territory code.

Parameters:
  • code (str) – The language code.

  • script (str) – The script code.

  • territory (str) – The territory code.

print_report(otlist: list[Orthography], attr: str, bcp47: bool = False) None

Print a formatted report for a given list of orthographies.

Parameters:
  • otlist (List[Orthography]) – The list of orthographies.

  • attr (str) – The name of the attribute of the orthography object that will be shown in the report (missing_base, missing_optional, missing_punctuation, missing_all, num_missing_base, num_missing_optional, num_missing_punctuation, base_pc, optional_pc, punctuation_pc, unicodes_base, unicodes_optional, unicodes_punctuation).

  • bcp47 (bool) – Output BCP47 subtags instead of names

report_kern_list(bcp47=False, include_optional=False) None

Print a list of character pairs that may appear in any supported orthography for the current cmap.

Parameters:

bcp47 (bool) – Output BCP47 subtags instead of names

report_missing(codes: list[str], minimum=False, punctuation=False, bcp47=False) None

Print a report of missing characters for the given BCP47 language subtags. If minimum is true, only required characters are listed. If punctuation is true, only punctuation characters are listed. If both are true, both required and punctuation characters are listed. If both are false, all required, optional, and punctuation characters are listed.

Parameters:
  • codes (List[str]) – BCP47 language subtags

  • minimum (bool) – Only report missing required characters

  • punctuation (bool) – Only report missing punctuation

  • bcp47 (bool) – Output BCP47 subtags instead of names

report_missing_punctuation(bcp47=False) None

Print a report of orthographies which have all basic letters present, but are missing puncuation characters.

Parameters:

bcp47 (bool) – Output BCP47 subtags instead of names

report_near_misses(n: int = 5, bcp47=False) None

Print a report of orthographies which a maximum number of n characters missing.

Parameters:
  • n (int) – The maximum number of missing characters

  • bcp47 (bool) – Output BCP47 subtags instead of names

report_supported(full_only: bool = False, bcp47=False) None

Print a report of supported orthographies for the current cmap.

Parameters:
  • full_only (bool) – Only report orthographies which have both basic and optional characters present

  • bcp47 (bool) – Output BCP47 subtags instead of names

report_supported_minimum(bcp47=False) None

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters present).

Parameters:

bcp47 (bool) – Output BCP47 subtags instead of names

report_supported_minimum_inclusive(bcp47=False) None

Print a report of minimally supported orthographies for the current cmap (no punctuation, no optional characters required).

Parameters:

bcp47 (bool) – Output BCP47 subtags instead of names