Почему hadoop не видит ключевое слово python?
Я запускаю hadoop из docker командой
docker exec -it namenode /bin/bash
Я перенес файлы mapper.py -
#!/usr/bin/env python
"""mapper.py"""
import sys
# Input comes from standard input (stdin)
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Split the line into words
words = line.split()
# Increase counters
for word in words:
# Write the results to standard output (stdout)
print(f'{word}\t1')
и reducer.py -
#!/usr/bin/env python
"""reducer.py"""
import sys
current_word = None
current_count = 0
word = None
# Input comes from standard input
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Parse the input we got from mapper.py
word, count = line.split('\t', 1)
# Convert count (currently a string) to an integer
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print(f'{current_word}\t{current_count}')
current_count = count
current_word = word
if current_word == word:
print(f'{current_word}\t{current_count}')
в папку tmp(локальной файловой системы) внутри namenode контейнера.
Файл input.txt(перемещен в hdfs) -
hello world
hello friend
hello mom
hello father
Но при попытке применить mapper на input.txt ничего не работает.
Если же я пытаюсь запустить hadoop streaming командой
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -file /tmp/mapper.py -mapper mapper.py -file /tmp/reducer.py -reducer reducer.py -input /user/hduser/input.txt -output /user/hduser/output
то выдается следующая ошибка -
root@afa15008d868:/# hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -file /tmp/mapper.py -mapper mapper.py -file /tmp/reducer.py -reducer reducer.py -input /user/hduser/input.txt -output /user/hduser/output
2024-10-11 11:29:50,843 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/mapper.py, /tmp/reducer.py, /tmp/hadoop-unjar2358119599702684140/] [] /tmp/streamjob6545005124994330395.jar tmpDir=null
2024-10-11 11:29:51,694 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.3:8032
2024-10-11 11:29:51,923 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.6:10200
2024-10-11 11:29:51,950 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.3:8032
2024-10-11 11:29:51,950 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.6:10200
2024-10-11 11:29:52,116 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1728639985711_0003
2024-10-11 11:29:52,204 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,296 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,324 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,385 INFO mapred.FileInputFormat: Total input files to process : 1
2024-10-11 11:29:52,418 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,447 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,463 INFO mapreduce.JobSubmitter: number of splits:2
2024-10-11 11:29:52,615 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-10-11 11:29:52,630 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1728639985711_0003
2024-10-11 11:29:52,630 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-10-11 11:29:52,785 INFO conf.Configuration: resource-types.xml not found
2024-10-11 11:29:52,786 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-10-11 11:29:53,043 INFO impl.YarnClientImpl: Submitted application application_1728639985711_0003
2024-10-11 11:29:53,077 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1728639985711_0003/
2024-10-11 11:29:53,079 INFO mapreduce.Job: Running job: job_1728639985711_0003
2024-10-11 11:29:58,137 INFO mapreduce.Job: Job job_1728639985711_0003 running in uber mode : false
2024-10-11 11:29:58,138 INFO mapreduce.Job: map 0% reduce 0%
2024-10-11 11:30:02,180 INFO mapreduce.Job: Task Id : attempt_1728639985711_0003_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Операционная система - Win 10
Тип перехода на следующую строку выбран LF