Describe the bug
The Official Documentation states regarding the encoding for the tail plugin:
encoding, from_encoding
| type |
default |
version |
| string |
nil (string encoding is ASCII-8BIT) |
0.14.0 |
Specifies the encoding of reading lines.
By default, in_tail emits string value as ASCII-8BIT encoding.
These options change it:
-
If encoding is specified, in_tail changes string to encoding.
This uses Ruby's String#force_encoding.
-
If encoding and from_encoding both are specified, in_tail tries to
encode string from from_encoding to encoding. This uses Ruby's
String#encode.
source: tail#encoding-from_encoding
I have been checking Fluentd source code and:
-
Regarding the first bullet.
I think encoding parameter is not being used as it states in the Documentation.
I cannot find the function String#force_encoding using the encoding parameter.
On the other side I have found the String#force_encoding function with the from_encoding parameter in few places.
I think line 992 might be wrong:
https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992
If the Documentation is not wrong the function String#force_encoding should use the encoding value not the from_encoding value.
-
Regarding the second bullet.
It states the String#encode function is used when from _encoding parameter is set but it seems String#encode is used by default is you set encoding parameter to something different than ASCII-8BIT because from_encoding is set by default to ASCII-8BIT. For example, String#encode is used if you set encoding parameter to UTF-8 but according to the Documentation String#force_encoding should be used when you set the encoding parameter and not String#encode.
To Reproduce
Just start a Fluentd container with GROK plugin.
Then run the command:
td-agent --config /home/td-agent/fluentd.conf
Expected behavior
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Your Environment
- Fluentd version: 1.11.2
- TD Agent version: 1.11.2
- Operating system: Alma Linux 9
- Kernel version: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64
Your Configuration
# /home/td-agent/patterns.conf
CUSTOM_LOG_WORKS %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
# HTTPDATE has ä character
# Source: https://github.com/fluent/fluent-plugin-grok-parser/blob/903dfe222984b90c4e1c1151530038d1f242157d/patterns/legacy/grok-patterns#L51
CUSTOM_LOG_FAILS %{HTTPDATE:timestamp} %{NUMBER:response}
# /tmp/encoding-test.log
2023-11-22 18:18:09.823+0100 Testing Zürich
2023-11-22 18:18:09.823+0100 Testing Geneva
# /home/td-agent/fluentd.conf
<source>
@type tail
path /tmp/encoding-test.log
read_from_head true
encoding UTF-8
tag encoding
<parse>
@type grok
grok_failure_key grokfailure
custom_pattern_path /home/td-agent/patterns.conf
<grok>
pattern %{CUSTOM_LOG_FAILS:message}
</grok>
<grok>
pattern %{CUSTOM_LOG_WORKS:message}
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
Your Error Log
[td-agent@dc60c1c5967e ~]$ /opt/td-agent/bin/fluentd --config /home/td-agent/fluentd.conf
2023-11-23 15:58:45 +0100 [info]: parsing config file is succeeded path="/home/td-agent/fluentd.conf"
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.1.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-kafka' version '0.14.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus' version '1.8.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-td' version '1.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5'
2023-11-23 15:58:45 +0100 [info]: gem 'fluentd' version '1.11.2'
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:45 +0100 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:45 +0100 [warn]: this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:45 +0100 [info]: using configuration file: <ROOT>
<source>
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
encoding "UTF-8"
<parse>
@type "grok"
grok_failure_key "grokfailure"
custom_pattern_path "/home/td-agent/patterns.conf"
unmatched_lines
<grok>
pattern "%{CUSTOM_LOG_FAILS:message}"
</grok>
<grok>
pattern "%{CUSTOM_LOG_WORKS:message}"
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
</ROOT>
2023-11-23 15:58:45 +0100 [info]: starting fluentd-1.11.2 pid=715 ruby="2.7.1"
2023-11-23 15:58:45 +0100 [info]: spawn command to main: cmdline=["/opt/td-agent/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/bin/fluentd", "--config", "/home/td-agent/fluentd.conf", "--under-supervisor"]
2023-11-23 15:58:45 +0100 [info]: adding match pattern="encoding" type="stdout"
2023-11-23 15:58:45 +0100 [info]: adding source type="tail"
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:46 +0100 [warn]: #0 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:46 +0100 [warn]: #0 this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:46 +0100 [info]: #0 starting fluentd worker pid=720 ppid=715 worker=0
2023-11-23 15:58:46 +0100 [info]: #0 following tail of /tmp/encoding-test.log
2023-11-23 15:58:46.005131856 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005146527 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005152826 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005157747 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46 +0100 [info]: #0 fluentd worker is now running worker=0
Additional details
If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented.
I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?
Configuration
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
from_encoding "UTF-8"
encoding "UTF-8"
Warning
2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effect
Output
2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Documentation not clear or wrong
Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.
Describe the bug
The Official Documentation states regarding the encoding for the tail plugin:
source: tail#encoding-from_encoding
I have been checking Fluentd source code and:
Regarding the first bullet.
I think
encodingparameter is not being used as it states in the Documentation.I cannot find the function
String#force_encodingusing theencodingparameter.On the other side I have found the
String#force_encodingfunction with thefrom_encodingparameter in few places.I think line 992 might be wrong:
https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992
If the Documentation is not wrong the function
String#force_encodingshould use theencodingvalue not thefrom_encodingvalue.Regarding the second bullet.
It states the
String#encodefunction is used whenfrom _encodingparameter is set but it seemsString#encodeis used by default is you setencodingparameter to something different thanASCII-8BITbecausefrom_encodingis set by default toASCII-8BIT. For example,String#encodeis used if you setencodingparameter toUTF-8but according to the DocumentationString#force_encodingshould be used when you set theencodingparameter and notString#encode.To Reproduce
Just start a Fluentd container with GROK plugin.
Then run the command:
Expected behavior
Your Environment
Your Configuration
Your Error Log
Additional details
If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented.
I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?
Configuration
@type tail path "/tmp/encoding-test.log" tag "encoding" read_from_head true from_encoding "UTF-8" encoding "UTF-8"Warning
2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effectOutput
2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"} 2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}Documentation not clear or wrong
Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.