class EL_ZSTRING
Client examples: ADD_ALBUM_ART_TASK ; AMAZON_INSTANT_ACCESS_TEST_SET ; BENCHMARK_HTML ; BENCHMARK_TABLE ; CAD_MODEL_SLICER ; CAMERA_TRANSFER_COMMAND ; CASELESS_STRING_TEST ; CHECK_LOCALE_STRINGS_COMMAND ; CLASS_DESCENDANTS_COMMAND ; CLASS_FEATURE ; CLASS_FILE_NAME_NORMALIZER ; CLASS_FILE_NAME_NORMALIZER_TEST_SET ; CLASS_LINK ; CLASS_LINK_LIST ; CLASS_LINK_OCCURRENCE_INTERVALS ; CLASS_NOTES ; CLASS_NOTE_EDITOR ; CLASS_RENAMING_COMMAND ; CODEBASE_METRICS ; CODEC_INFO
Usually referenced with the alias ZSTRING, this string is a memory efficient alternative to using STRING_32. When an application mainly uses characters from the ISO-8859-15 character set, the memory saving can be as much as 70%, while the execution efficiency is roughly the same as for STRING_8. For short strings the saving is much less: about 50%. ISO-8859-15 covers most Western european languages.
Class ZSTRING_TEST_SET
DEFAULT CODEC
By default area characters are encoded using the codec {EL_SHARED_ZCODEC_FACTORY}.default_codec. By default this is ISO-8859-15 but can be set for the application using the command line option:
-zstring_codec [name]
Example:
-zstring_codec WINDOWS-1258
See class constant {EL_ZCODEC_FACTORY}.Codec_option_name.
FEATURES
ZSTRING has many useful routines not found in STRING_32. Probably the most useful is Python style templates using the #$ as an alias for substituted_tuple, and place holders indicated by %S, which by coincidence is both an Eiffel escape sequence and a Python one (Python is actually %s).
ARTICLES
There is a detailed article about this class here: https://www.eiffel.org/blog/finnianr/introducing_class_zstring
BENCHMARKS
Comparison of ZSTRING against STRING_32 for basic string operations
CAVEAT
There is a caveat attached to using ZSTRING which is that if your application uses very many characters outside of the ISO-8859-15 character-set, the execution efficiency does down substantially.
This is something to consider if your application is going to be used in for example: Russia or Japan. If the user locale is for a language that is supported by a ISO-8859-x what you can do is over-ride {EL_SHARED_ZSTRING_CODEC}.Default_codec and initialize it immediately after application launch. This will force ZSTRING to switch to a more optimal character-set for the user-locale.
The execution performance will be slow for Asian characters, but a planned solution is to make a swappable alternative implementation that works equally well with Asian character sets and non-Western European sets. The version used can be set by changing an ECF variable or perhaps the build target.
There two ways to go about achieving an efficient implementation for Asian character sets:
THE CODE FUNCTION
There is an important difference between how ZSTRING implements the function {READABLE_STRING_GENERAL}.code and how STRING_32 implements it. For the most part ZSTRING will return unicode, but a small subset of characters will be different. This is why code has been rename as z_code. To get true unicode use the function {ZSTRING}.unicode
The exception to this rule is if {ZSTRING}.default_codec has been changed to return an instance of EL_ISO_8859_1_ZCODEC instead of the default {EL_ISO_8859_15_ZCODEC}. This can be done using the command line option -zstring_codec.
note
description: "[
Usually referenced with the alias **ZSTRING**, this string is a memory efficient alternative to using
${STRING_32}. When an application mainly uses characters from the ISO-8859-15 character set,
the memory saving can be as much as 70%, while the execution efficiency is roughly the same as for
${STRING_8}. For short strings the saving is much less: about 50%.
ISO-8859-15 covers most Western european languages.
]"
tests: "Class ${ZSTRING_TEST_SET}"
notes: "See end of class"
author: "Finnian Reilly"
copyright: "Copyright (c) 2001-2022 Finnian Reilly"
contact: "finnian at eiffel hyphen loop dot com"
license: "MIT license (See: en.wikipedia.org/wiki/MIT_License)"
date: "2024-10-06 10:36:29 GMT (Sunday 6th October 2024)"
revision: "123"
class
EL_ZSTRING
inherit
EL_READABLE_ZSTRING
rename
append_string_8 as append_raw_string_8
export
{ANY}
-- Element change
append_boolean, append_character, append_character_8,
append_double, append_real, append_integer, append_natural,
append_real_32, append_real_64, append_rounded_real, append_rounded_double,
append_integer_8, append_integer_16, append_integer_32, append_integer_64,
append_natural_8, append_natural_16, append_natural_32, append_natural_64,
append_replaced, append_raw_string_8, append_from_right, append_from_right_general,
append_string, append, append_string_general, append_substring, append_substring_general,
append_utf_8, append_utf_16_le, append_encoded, append_encodeable, append_encoded_any,
append_unicode,
extend, enclose, fill_blank, fill_character, multiply,
prepend_boolean, prepend_character, prepend_integer, prepend_integer_32,
prepend_real_32, prepend_real, prepend_real_64, prepend_double, prepend_substring,
prepend, prepend_string, prepend_string_general, prepend_compatible,
precede, put_unicode, quote, translate,
-- Transformation
crop, expand_tabs, hide, mirror, replace_character, replace_delimited_substring,
replace_delimited_substring_general, replace_substring, replace_substring_all, replace_substring_general,
replace_set_members_8, replace_set_members, reveal,
to_canonically_spaced, to_lower, to_proper, to_upper, translate_or_delete,
unescape,
-- Removal
keep_head, keep_tail, left_adjust, remove_head, remove_quotes, remove_tail, right_adjust,
prune_all_leading, prune_all_trailing,
-- Contract support
Encoding
{EL_SHARED_ZSTRING_CODEC} order_comparison
{EL_ZSTRING_IMPLEMENTATION} append_string_general_for_type
{EL_ZCODEC} Once_interval_list
redefine
new_list
select
String_searcher
end
EL_WRITABLE
rename
write_encoded_character_8 as append_raw_character_8,
write_character_8 as append_character_8,
write_character_32 as append_character,
write_integer_8 as append_integer_8,
write_integer_16 as append_integer_16,
write_integer_32 as append_integer,
write_integer_64 as append_integer_64,
write_natural_8 as append_natural_8,
write_natural_16 as append_natural_16,
write_natural_32 as append_natural_32,
write_natural_64 as append_natural_64,
write_encoded_string_8 as append_raw_string_8,
write_real_32 as append_real,
write_real_64 as append_double,
write_string as append_string,
write_string_8 as append_string_8,
write_string_32 as append_string_32,
write_string_general as append_string_general,
write_boolean as append_boolean,
write_pointer as append_pointer
undefine
append_string_general
redefine
write_path, write_path_steps
end
STRING_GENERAL
rename
append as append_string_general,
append_code as append_z_code,
append_substring as append_substring_general,
code as z_code,
ends_with as ends_with_general,
has_code as has_unicode,
is_case_insensitive_equal as is_case_insensitive_equal_general,
prepend as prepend_string_general,
prepend_substring as prepend_substring_general,
put_code as put_z_code,
same_caseless_characters as same_caseless_characters_general,
same_characters as same_characters_general,
same_string as same_string_general,
split as split_list,
starts_with as starts_with_general
undefine
copy, hash_code, out, index_of, last_index_of, occurrences,
-- Status query
ends_with_general, has, has_unicode,
is_real, is_real_32, is_double, is_equal, is_real_64,
is_integer_8, is_integer_16, is_integer, is_integer_32, is_integer_64,
is_natural_8, is_natural_16, is_natural, is_natural_32, is_natural_64,
-- Comparison
same_characters_general, starts_with_general, same_caseless_characters_general,
-- Conversion
to_boolean, to_real, to_real_32, to_double, to_real_64,
to_integer_8, to_integer_16, to_integer, to_integer_32, to_integer_64,
to_natural_8, to_natural_16, to_natural, to_natural_32, to_natural_64,
as_string_32, to_string_32, as_string_8, to_string_8, split_list,
-- Element change
append_string_general, append_substring_general, prepend_string_general,
-- Implementation
is_valid_integer_or_natural
end
INDEXABLE [CHARACTER_32, INTEGER]
rename
upper as count
undefine
copy, is_equal, out, new_cursor
redefine
changeable_comparison_criterion, prune_all
end
RESIZABLE [CHARACTER_32]
undefine
copy, is_equal, out
redefine
changeable_comparison_criterion
end
DEBUG_OUTPUT -- not working
rename
debug_output as to_string_32
undefine
copy, is_equal, out
end
create
make, make_empty, make_from_zcode_area, make_shared, make_filled, make_from_file,
-- other strings
make_from_other, make_from_string, make_from_string_8, make_from_general, make_from_substring,
-- Encodings
make_from_utf_8, make_from_utf_16_le, make_from_latin_1_c
convert
-- from
make_from_general ({STRING_8, STRING_32, IMMUTABLE_STRING_8, IMMUTABLE_STRING_32}),
-- to
to_string_32: {STRING_32}, to_latin_1: {STRING}
feature -- Duplication
plus alias "+" (s: READABLE_STRING_GENERAL): like Current
do
Result := new_string (count + s.count)
Result.append (Current)
Result.append_string_general (s)
end
feature -- Basic operations
do_with_splits (a_separator: READABLE_STRING_GENERAL; action: PROCEDURE [like Current])
-- apply `action' for all delimited substrings
do
across split_on_string (a_separator) as str loop
action (str.item_copy)
end
end
feature -- Status query
Changeable_comparison_criterion: BOOLEAN = False
for_all_split (a_separator: READABLE_STRING_GENERAL; is_true: PREDICATE [like Current]): BOOLEAN
-- `True' if all split substrings match `is_true'
do
Result := across split_on_string (a_separator) as str all is_true (str.item) end
end
there_exists_split (a_separator: READABLE_STRING_GENERAL; is_true: PREDICATE [like Current]): BOOLEAN
-- `True' if one split substring matches `is_true'
do
Result := across split_on_string (a_separator) as str some is_true (str.item) end
end
feature -- Element change
append_all (a_list: ITERABLE [like Current])
local
cursor: ITERATION_CURSOR [like Current]
do
cursor := a_list.new_cursor
grow (count + sum_count (cursor))
if attached {INDEXABLE_ITERATION_CURSOR [like Current]} cursor as l_cursor then
l_cursor.start
else
cursor := a_list.new_cursor
end
from until cursor.after loop
append (cursor.item)
cursor.forth
end
end
append_all_general (a_list: ITERABLE [READABLE_STRING_GENERAL])
local
cursor: ITERATION_CURSOR [READABLE_STRING_GENERAL]
do
cursor := a_list.new_cursor
grow (count + sum_count (cursor))
if attached {INDEXABLE_ITERATION_CURSOR [READABLE_STRING_GENERAL]} cursor as l_cursor then
l_cursor.start
else
cursor := a_list.new_cursor
end
from until cursor.after loop
append_string_general (cursor.item)
cursor.forth
end
end
edit (left_delimiter, right_delimiter: READABLE_STRING_GENERAL; a_edit: PROCEDURE [INTEGER, INTEGER, ZSTRING])
local
editor: EL_ZSTRING_EDITOR
do
create editor.make (Current)
editor.for_each (left_delimiter, right_delimiter, a_edit)
end
escape (escaper: EL_STRING_ESCAPER [ZSTRING])
do
make_from_other (escaper.escaped (Current, False))
end
insert_character (uc: CHARACTER_32; i: INTEGER)
local
c: CHARACTER
do
c := encoded_character (uc)
String_8.insert_character (Current, c, i)
shift_unencoded_from (i, 1)
inspect c
when Substitute then
put_unencoded (uc, i)
else
end
ensure
one_more_character: count = old count + 1
inserted: item (i) = uc
stable_before_i: elks_checking implies substring (1, i - 1) ~ (old substring (1, i - 1))
stable_after_i: elks_checking implies substring (i + 1, count) ~ (old substring (i, count))
end
insert_string (s: EL_READABLE_ZSTRING; i: INTEGER)
local
l_count, old_count: INTEGER
do
old_count := count
internal_insert_string (s, i)
inspect respective_encoding (s)
when Both_have_mixed_encoding then
if attached empty_unencoded_buffer as buffer then
l_count := i - 1
if l_count.to_boolean then
buffer.append_substring (Current, 1, i - 1, 0)
end
if s.count.to_boolean then
buffer.append (s, l_count)
l_count := l_count + s.count
end
if i <= old_count then
buffer.append_substring (Current, i, old_count, l_count)
end
set_unencoded_from_buffer (buffer)
end
when Only_current then
shift_unencoded_from (i, s.count)
when Only_other then
append_unencoded (s, i - 1)
else
end
end
insert_string_general (s: READABLE_STRING_GENERAL; i: INTEGER)
do
insert_string (adapted_argument (s, 1), i)
end
left_pad (uc: CHARACTER_32; a_count: INTEGER)
do
prepend_string (new_padding (uc, a_count))
end
right_pad (uc: CHARACTER_32; a_count: INTEGER)
do
append_string (new_padding (uc, a_count))
end
set_from_latin_1_c (latin_1_ptr: POINTER)
local
latin: EL_STRING_8
do
latin := String_8.c_string (latin_1_ptr)
if latin.is_ascii then
set_from_ascii (latin)
else
wipe_out
append_string_general (latin)
end
end
substitute_tuple (inserts: TUPLE)
do
make_from_other (substituted_tuple (inserts))
end
feature -- Removal
prune (uc: CHARACTER_32)
-- Remove one occurrence of `uc' if any. (`INDEXABLE' implementation)
local
i: INTEGER
do
i := index_of (uc, 1)
if i > 0 then
remove (i)
end
end
prune_all (uc: CHARACTER_32)
-- Remove all occurrences of `c'. (`INDEXABLE' redefinition)
local
i, j, i_upper, block_index, last_upper: INTEGER; encoded_c, c_i: CHARACTER_8; uc_i: CHARACTER_32
c_is_substitute: BOOLEAN; iter: EL_COMPACT_SUBSTRINGS_32_ITERATION
do
i_upper := count - 1
encoded_c := encoded_character (uc); c_is_substitute := encoded_c = Substitute
if attached area as l_area and then attached unencoded_area as uc_area and then uc_area.count > 0
and then attached empty_unencoded_buffer as buffer
then
last_upper := buffer.last_upper
from until i > i_upper loop
c_i := l_area [i]
inspect c_i
when Substitute then
uc_i := iter.item ($block_index, uc_area, i + 1)
if c_is_substitute implies uc_i /= uc then
l_area [j] := c_i
last_upper := buffer.extend (uc_i, last_upper, j + 1)
j := j + 1
end
else
if encoded_c /= c_i then
l_area [j] := c_i
j := j + 1
end
end
i := i + 1
end
buffer.set_last_upper (last_upper)
set_unencoded_from_buffer (buffer)
elseif attached area as l_area then
from until i > i_upper loop
c_i := l_area [i]
if c_i /= encoded_c then
l_area [j] := c_i
j := j + 1
end
i := i + 1
end
end
set_count (j)
end
remove (i: INTEGER)
do
internal_remove (i)
remove_unencoded_substring (i, i)
ensure then
valid_unencoded: is_valid
end
remove_substring (start_index, end_index: INTEGER)
do
String_8.remove_substring (Current, start_index, end_index)
remove_unencoded_substring (start_index, end_index)
ensure
valid_unencoded: is_valid
removed: elks_checking implies Current ~ (old substring (1, start_index - 1) + old substring (end_index + 1, count))
end
wipe_out
do
internal_wipe_out
internal_hash_code := 0
make_unencoded
end
feature {NONE} -- Factory
new_list (a_count: INTEGER): EL_ZSTRING_LIST
do
create Result.make (a_count)
end
new_padding (uc: CHARACTER_32; a_count: INTEGER): like Current
local
delta: INTEGER
do
delta := a_count - count
if delta > 0 then
create Result.make_filled (uc, delta)
else
Result := Once_adapted_argument [0]
Result.wipe_out
end
end
new_string (n: INTEGER): like Current
-- New instance of current with space for at least `n' characters.
do
create Result.make (n)
end
feature {NONE} -- Implementation
append_pointer (ptr: POINTER)
do
append_string_general (ptr.out)
end
append_raw_character_8 (c: CHARACTER)
do
append_character (c)
end
append_string_32 (str: READABLE_STRING_32)
do
append_string_general_for_type (str, string_storage_type (str))
end
append_string_8 (str: READABLE_STRING_8)
do
append_string_general_for_type (str, '1')
end
current_zstring: ZSTRING
do
Result := Current
end
current_writable: EL_WRITABLE
do
Result := Current
end
empty_escape_table: like Once_escape_table
do
Result := Once_escape_table
Result.wipe_out
end
sum_count (cursor: ITERATION_CURSOR [READABLE_STRING_GENERAL]): INTEGER
do
from until cursor.after loop
Result := Result + cursor.item.count
cursor.forth
end
end
write_path (path: EL_PATH)
do
path.append_to (Current)
end
write_path_steps (steps: EL_PATH_STEPS)
do
steps.append_to (Current)
end
feature {NONE} -- Constants
Once_escape_table: EL_HASH_TABLE [NATURAL, NATURAL]
once
create Result.make (5)
end
note
notes: "[
**DEFAULT CODEC**
By default `area' characters are encoded using the codec ${EL_SHARED_ZCODEC_FACTORY}.default_codec. By
default this is ISO-8859-15 but can be set for the application using the command line option:
-zstring_codec [name]
Example:
-zstring_codec WINDOWS-1258
See class constant ${EL_ZCODEC_FACTORY}.Codec_option_name.
**FEATURES**
**ZSTRING** has many useful routines not found in ${STRING_32}. Probably the most useful is Python style templates
using the `#$' as an alias for `substituted_tuple', and place holders indicated by `%S', which by coincidence is
both an Eiffel escape sequence and a Python one (Python is actually `%s').
**ARTICLES**
There is a detailed article about this class here: [https://www.eiffel.org/blog/finnianr/introducing_class_zstring]
**BENCHMARKS**
Comparison of **ZSTRING** against ${STRING_32} for basic string operations
* [./benchmark/ZSTRING-benchmarks-latin-1.html Latin-1 base encoding]
* [./benchmark/ZSTRING-benchmarks-latin-15.html Latin-15 base encoding]
**CAVEAT**
There is a caveat attached to using **ZSTRING** which is that if your application uses very many characters outside of
the ISO-8859-15 character-set, the execution efficiency does down substantially.
This is something to consider if your application is going to be used in for example: Russia or Japan.
If the user locale is for a language that is supported by a ISO-8859-x what you can do is over-ride
${EL_SHARED_ZSTRING_CODEC}.Default_codec and initialize it immediately after application launch.
This will force **ZSTRING** to switch to a more optimal character-set for the user-locale.
The execution performance will be slow for Asian characters, but a planned solution is to make a swappable alternative
implementation that works equally well with Asian character sets and non-Western European sets.
The version used can be set by changing an ECF variable or perhaps the build target.
There two ways to go about achieving an efficient implementation for Asian character sets:
1. Change the implementation to inherit from ${STRING_32}. This will be the fastest and easiest to implement.
2. Change the area array to type: ${SPECIAL [NATURAL_16]} and then the same basic algorithm can be applied to Asian characters.
The problem is ${NATURAL_16} is not a character and there is no **CHARACTER_16**, so it will entail a lot of changes.
The upside is that there will still be a substantial memory saving.
**THE CODE FUNCTION**
There is an important difference between how ${ZSTRING} implements the function
${READABLE_STRING_GENERAL}.code and how ${STRING_32} implements it. For the most
part ${ZSTRING} will return unicode, but a small subset of characters will be different.
This is why **code** has been rename as **z_code**. To get true unicode use the function
${ZSTRING}.unicode
The exception to this rule is if ${ZSTRING}.default_codec has been changed to return an instance of
${EL_ISO_8859_1_ZCODEC} instead of the default {EL_ISO_8859_15_ZCODEC}. This can be done using the command
line option `-zstring_codec'.
]"
end