Leave a comment

Your email address will not be published. Required fields are marked *

11 thoughts on “Twitter Data Analysis Using Hadoop Flume

  • jesus

    Obj    avro.schema?
    {“type”:”record”,”name”:”Doc”,”doc”:”adoc”,”fields”:[{“name”:”id”,”type”:”string”},{“name”:”user_friends_count”,”type”:[“int”,”null”]},{“name”:”user_location”,”type”:[“string”,”null”]},{“name”:”user_description”,”type”:[“string”,”null”]},{“name”:”user_statuses_count”,”type”:[“int”,”null”]},{“name”:”user_followers_count”,”type”:[“int”,”null”]},{“name”:”user_name”,”type”:[“string”,”null”]},{“name”:”user_screen_name”,”type”:[“string”,”null”]},{“name”:”created_at”,”type”:[“string”,”null”]},{“name”:”text”,”type”:[“string”,”null”]},{“name”:”retweet_count”,”type”:[“long”,”null”]},{“name”:”retweeted”,”type”:[“boolean”,”null”]},{“name”:”in_reply_to_user_id”,”type”:[“long”,”null”]},{“name”:”source”,”type”:[“string”,”null”]},{“name”:”in_reply_to_status_id”,”type”:[“long”,”null”]},{“name”:”media_url_https”,”type”:[“string”,”null”]},{“name”:”expanded_url”,”type”:[“string”,”null”]}]}???[/D????O???$598253775116050432|Gina Paola gina_paola25(2015-05-12T22:29:49Z*@xanatbravo holaaa <3????    <a href=”http://twitter.com” rel=”nofollow”>Twitter Web Client</a>    ???[/D????O??
    Obj    avro.schema?
    {“type”:”record”,”name”:”Doc”,”doc”:”adoc”,”fields”:[

     

  • Rakesh Gupta

    Thanks for writing such a nice post. I have done the same steps, but when querying the hive table created using the .avsc file. I get the below exception:

    hive> select count(1) as num_rows from TwitterData_09062015;

    Query ID = root_20150609125050_617a7511-b253-4327-a475-7e6846eac78a
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1433857038961_0002, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1433857038961_0002/
    Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job -kill job_1433857038961_0002
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2015-06-09 12:50:12,834 Stage-1 map = 0%, reduce = 0%
    2015-06-09 12:50:45,553 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1433857038961_0002 with errors
    Error during job, obtaining debugging information…
    Examining task ID: task_1433857038961_0002_m_000000 (and more) from job job_1433857038961_0002

    Task with the most failures(4):
    —–
    Task ID:
    task_1433857038961_0002_m_000000

    URL:
    http://sandbox.hortonworks.com:8088/taskdetails.jsp?jobid=job_1433857038961_0002&tipid=task_1433857038961_0002_m_000000
    —–
    Diagnostic Messages for this Task:
    Error: java.io.IOException: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:273)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:183)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    Caused by: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:271)
    … 11 more
    Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
    at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
    at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
    at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
    at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
    … 15 more
    Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
    at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
    … 19 more
    FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
    MapReduce Jobs Launched:
    Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
    Total MapReduce CPU Time Spent: 0 msec

     

    I searched for the error and could not find much. Any help will be highly appreciated.

    Thanks

    Rakesh

  • Karl

    Creation of Flume Agent conf.

    How I can get the right tweets? Keyword is not working.

    How I can get the right language? For example “en” instead of serveral languages.

     

    Would be great if you could help.

     

    Thanks

    Karl

    • Profile photo of Siva
      Siva Post author

      The default flume twitter agent doesn’t support these filtering. We need to write custom twitter flume agent for this or try using cloudera twitter agent instead of apache flume twitter agent.

  • Arun

    Excellent post. it helped me to resolve the issue “Block size invalid or too large for this implementation”. I changed the properties as mentioned in your post
    TwitterAgent.sources.Twitter.maxBatchSize = 50000
    TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000

    and now I am able to view the twitter data in my hive tables. thanks a ton. Arun

  • abhishek singh tomar

    hello sir,

    i am facing the problem during the extraction of tweets. its gives me the error just like that

    ERROR 401 authentication error(keys are correct or not, time is sync or not)  but my time is

    already correct and keys are also…

    please help me to resolve this problem.

    i am facing this problem from three days.

  • HM

    Issues with starting the flume agent

    I get the following error:

    /usr/local/flume/conf$ flume-ng agent –conf $FLUME_CONF_DIR –conf-file $FLUME_CONF_DIR/twitter-agent.conf –name TwitterAgent
    Info: Including Hadoop libraries found via (/usr/local/lib/hadoop-2.7.0/bin/hadoop) for HDFS access
    Info: Including Hive libraries found via () for Hive access
    + exec /usr/lib/jvm/java-8-oracle/bin/java -Xmx20m -cp ‘/usr/local/flume/conf:/usr/local/flume/lib/*:/usr/local/lib/hadoop-2.7.0/etc/hadoop:/usr/local/lib/hadoop-2.7.0/share/hadoop/common/lib/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/common/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/hdfs:/usr/local/lib/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/hdfs/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/yarn/lib/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/yarn/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/usr/local/lib/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/lib/*’ -Djava.library.path=:/usr/local/lib/hadoop-2.7.0/lib/native org.apache.flume.node.Application –conf-file /usr/local/flume/conf/twitter-agent.conf –name TwitterAgent
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/local/flume/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/local/lib/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

    Exception in thread “Twitter4J Async Dispatcher[0]” java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
    at org.apache.flume.source.twitter.TwitterSource.serializeToAvro(TwitterSource.java:275)
    at org.apache.flume.source.twitter.TwitterSource.onStatus(TwitterSource.java:162)
    at twitter4j.StatusStreamImpl.onStatus(StatusStreamImpl.java:75)
    at twitter4j.StatusStreamBase$1.run(StatusStreamBase.java:114)
    at twitter4j.internal.async.ExecuteThread.run(DispatcherImpl.java:116)
    Exception in thread “Twitter Stream consumer-1[Receiving stream]” java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.StringBuffer.toString(StringBuffer.java:669)
    at java.io.BufferedReader.readLine(BufferedReader.java:359)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at twitter4j.StatusStreamBase.handleNextElement(StatusStreamBase.java:85)
    at twitter4j.StatusStreamImpl.next(StatusStreamImpl.java:57)
    at twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:478)
    Exception in thread “SinkRunner-PollingRunner-DefaultSinkProcessor” java.lang.OutOfMemoryError: Java heap space
    at java.io.BufferedReader.<init>(BufferedReader.java:105)
    at java.io.BufferedReader.<init>(BufferedReader.java:116)
    at java.io.LineNumberReader.<init>(LineNumberReader.java:72)
    at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:64)
    at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
    at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
    at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
    at org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
    at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
    at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
    at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
    at org.apache.log4j.Category.callAppenders(Category.java:206)
    at org.apache.log4j.Category.forcedLog(Category.java:391)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
    at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)
    at java.lang.Thread.run(Thread.java:745)


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.