Hadoop: RecordReader and FileInputFormat

Today’s new challenge…
I want to create a custom MapReduce job that can handle more than 1 single line at a time. Actually, it took me some time to understand the implementation of default LineRecordReader class, not because of its implementation Vs. my Java skill set, but rather that I was not familiar with its concept. I am describing in this article my understanding on this implementation.

As InputSplit is nothing more than a chunk of 1 or several blocks, it should be pretty rare to get a block boundary ending up at the exact location of a end of line (EOL). Some of my records located around block boundaries should be therefore split in 2 different blocks. This triggers the following issues:

  1. How Hadoop can guarantee lines read are 100% complete ?
  2. How Hadoop can consolidate a line that is starting on block B and that ends up on B+1 ?
  3. How Hadoop can guarantee we do not miss any line ?
  4. Is there a limitation in term of line’s size ? Can a line be greater than a block (i.e. spanned over more than 2 blocks) ? If so, is there any consequence in term of MapReduce performance ?

Definitions

InputFormat

Definition taken from

Hadoop relies on the input format of the job to do three things:
1. Validate the input configuration for the job (i.e., checking that the data is there).
2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing.
3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper.

RecordReader

Definition taken from

A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated

Example

I jumped right into the code of LineRecordReader and found it not that obvious to understand. Let’s get an example first that will hopefully make the code slightly more readable.
Suppose my data set is composed on a single 300Mb file, spanned over 3 different blocks (blocks of 128Mb), and suppose that I have been able to get 1 InputSplit for each block. Let’s imagine now 3 different scenarios.

File is composed on 6 lines of 50Mb each

InputSplit1

  • The first Reader will start reading bytes from Block B1, position 0. The first two EOL will be met at respectively 50Mb and 100Mb. 2 lines (L1 & L2) will be read and sent as key / value pairs to Mapper 1 instance. Then, starting from byte 100Mb, we will reach end of our Split (128Mb) before having found the third EOL. This incomplete line will be completed by reading the bytes in Block B2 until position 150Mb. First part of Line L3 will be read locally from Block B1, second part will be read remotely from Block B2 (by the mean of FSDataInputStream), and a complete record will be finally sent as key / value to Mapper 1.
  • The second Reader starts on Block B2, at position 128Mb. Because 128Mb is not the start of a file, there are strong chance our pointer is located somewhere in an existing record that has been already processed by previous Reader. We need to skip this record by jumping out to the next available EOL, found at position 150Mb. Actual start of RecordReader 2 will be at 150Mb instead of 128Mb.

We can wonder what happens in case a block starts exactly on a EOL. By jumping out until the next available record (through readLine method), we might miss 1 record. Before jumping to next EOL, we actually need to decrement initial “start” value to “start – 1”. Being located at at least 1 offset before EOL, we ensure no record is skipped !

Remaining process is following same logic, and everything is summarized in below table.

InputSplit_meta1

File composed on 2 lines of 150Mb each

InputSplit2

Same process as before:

  • Reader 1 will start reading from block B1, position 0. It will read line L1 locally until end of its split (128Mb), and will then continue reading remotely on B2 until EOL (150Mb)
  • Reader 2 will not start reading from 128Mb, but from 150Mb, and until B3:300

InputSplit_meta2

File composed on 2 lines of 300Mb each

OK, this one is a tricky and perhaps unrealistic example, but I was wondering what happens in case a record is larger than 2 blocks (spanned over at least 3 blocks).

InputSplit5

  • Reader 1 will start reading locally from B1:0 until B1:128, then remotely all bytes available on B2, and finally remotely on B3 until EOL is reached (300Mb). There is here some overhead as we’re trying to read a lot of data that is not locally available
  • Reader 2 will start reading from B2:128 and will jump out to next available record located at B3:300. Its new start position (B3:300) is actually greater than its maximum position (B2:256). This reader will therefore not provide Mapper 2 with any key / value. I understand it somehow as a kind of security feature ensuring data locality (that makes Hadoop so efficient in data processing) is preserved (i.e. Do not process a line that is not starting in the chunk I’m responsible for).
  • Reader 3 will start reading from B3:300 to B5:600

This is summarized in below table

InputSplit_meta5

Maximum size for a single record

There is a maximum size allowed for a single record to be processed. This value can be set using below parameter.

conf.setInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

A line with a size greater than this maximum value (default is 2,147,483,647) will be ignored.

I hope these 3 examples gives you a high level understanding on RecordReader and InputFormat. If so, let’s jump to the code, else, let me know.

I doubt a single record is hundreds of Mb large (300Mb in my example) in a real environment… With hundreds of Kb for a single record, the overhead due to a line spanning over different blocks should not be that significant, and overall performance should not be really affected

Implementation

RecordReader

I added some (a tons of) comments in the code in order to point out what has been previously said in the example section. Hopefully this makes it slightly clearer. A new Reader must extends class RecordReader and override several methods.


public class CustomLineRecordReader 
	extends RecordReader<LongWritable, Text> {

	private long start;
	private long pos;
	private long end;
	private LineReader in;
	private int maxLineLength;
	private LongWritable key = new LongWritable();
	private Text value = new Text();

	private static final Log LOG = LogFactory.getLog(
			CustomLineRecordReader.class);

	/**
	 * From Design Pattern, O'Reilly...
	 * This method takes as arguments the map task’s assigned InputSplit and
	 * TaskAttemptContext, and prepares the record reader. For file-based input
	 * formats, this is a good place to seek to the byte position in the file to
	 * begin reading.
	 */
	@Override
	public void initialize(
			InputSplit genericSplit, 
			TaskAttemptContext context)
			throws IOException {

		// This InputSplit is a FileInputSplit
		FileSplit split = (FileSplit) genericSplit;

		// Retrieve configuration, and Max allowed
		// bytes for a single record
		Configuration job = context.getConfiguration();
		this.maxLineLength = job.getInt(
				"mapred.linerecordreader.maxlength",
				Integer.MAX_VALUE);

		// Split "S" is responsible for all records
		// starting from "start" and "end" positions
		start = split.getStart();
		end = start + split.getLength();

		// Retrieve file containing Split "S"
		final Path file = split.getPath();
		FileSystem fs = file.getFileSystem(job);
		FSDataInputStream fileIn = fs.open(split.getPath());

		// If Split "S" starts at byte 0, first line will be processed
		// If Split "S" does not start at byte 0, first line has been already
		// processed by "S-1" and therefore needs to be silently ignored
		boolean skipFirstLine = false;
		if (start != 0) {
			skipFirstLine = true;
			// Set the file pointer at "start - 1" position.
			// This is to make sure we won't miss any line
			// It could happen if "start" is located on a EOL
			--start;
			fileIn.seek(start);
		}

		in = new LineReader(fileIn, job);

		// If first line needs to be skipped, read first line
		// and stores its content to a dummy Text
		if (skipFirstLine) {
			Text dummy = new Text();
			// Reset "start" to "start + line offset"
			start += in.readLine(dummy, 0,
					(int) Math.min(
							(long) Integer.MAX_VALUE, 
							end - start));
		}

		// Position is the actual start
		this.pos = start;

	}

	/**
	 * From Design Pattern, O'Reilly...
	 * Like the corresponding method of the InputFormat class, this reads a
	 * single key/ value pair and returns true until the data is consumed.
	 */
	@Override
	public boolean nextKeyValue() throws IOException {

		// Current offset is the key
		key.set(pos);

		int newSize = 0;

		// Make sure we get at least one record that starts in this Split
		while (pos < end) {

			// Read first line and store its content to "value"
			newSize = in.readLine(value, maxLineLength,
					Math.max((int) Math.min(
							Integer.MAX_VALUE, end - pos),
							maxLineLength));

			// No byte read, seems that we reached end of Split
			// Break and return false (no key / value)
			if (newSize == 0) {
				break;
			}

			// Line is read, new position is set
			pos += newSize;

			// Line is lower than Maximum record line size
			// break and return true (found key / value)
			if (newSize < maxLineLength) {
				break;
			}

			// Line is too long
			// Try again with position = position + line offset,
			// i.e. ignore line and go to next one
			// TODO: Shouldn't it be LOG.error instead ??
			LOG.info("Skipped line of size " + 
					newSize + " at pos "
					+ (pos - newSize));
		}

		
		if (newSize == 0) {
			// We've reached end of Split
			key = null;
			value = null;
			return false;
		} else {
			// Tell Hadoop a new line has been found
			// key / value will be retrieved by
			// getCurrentKey getCurrentValue methods
			return true;
		}
	}

	/**
	 * From Design Pattern, O'Reilly...
	 * This methods are used by the framework to give generated key/value pairs
	 * to an implementation of Mapper. Be sure to reuse the objects returned by
	 * these methods if at all possible!
	 */
	@Override
	public LongWritable getCurrentKey() throws IOException,
			InterruptedException {
		return key;
	}

	/**
	 * From Design Pattern, O'Reilly...
	 * This methods are used by the framework to give generated key/value pairs
	 * to an implementation of Mapper. Be sure to reuse the objects returned by
	 * these methods if at all possible!
	 */
	@Override
	public Text getCurrentValue() throws IOException, InterruptedException {
		return value;
	}

	/**
	 * From Design Pattern, O'Reilly...
	 * Like the corresponding method of the InputFormat class, this is an
	 * optional method used by the framework for metrics gathering.
	 */
	@Override
	public float getProgress() throws IOException, InterruptedException {
		if (start == end) {
			return 0.0f;
		} else {
			return Math.min(1.0f, (pos - start) / (float) (end - start));
		}
	}

	/**
	 * From Design Pattern, O'Reilly...
	 * This method is used by the framework for cleanup after there are no more
	 * key/value pairs to process.
	 */
	@Override
	public void close() throws IOException {
		if (in != null) {
			in.close();
		}
	}

}

FileInputFormat

Now that you have created a custom Reader, you need to use it from a class extending FileInputFormat, as reported below …


public class CustomFileInputFormat extends FileInputFormat<LongWritable,Text>{

	@Override
	public RecordReader<LongWritable, Text> createRecordReader(
			InputSplit split, TaskAttemptContext context) throws IOException,
			InterruptedException {
		return new CustomLineRecordReader();
	}
}

MapReduce

… and to use this new CustomFileInputFormat in your MapReduce driver code when specifying Input format.

.../...
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(CustomFileInputFormat.class);
.../...

Congratulations, if you followed this article you have just re-invented the wheel. We did not do anything more that re-implementing LineRecordReader and FileInputFormat, default implementations for Text file. However, I hope you now understand a bit better how these 2 classes works, allowing you to create your custom Reader and therefore being able to handle specific file format.

I hope you liked this article, that it was not too high-level and therefore not a waste of time..
Should you have any question / remarks / suggestions, feel free to comment. Feel also free to share it !

Cheers !

Advertisements

75 thoughts on “Hadoop: RecordReader and FileInputFormat

    • thanks for the fast response , i tried adding empty constructor and it didn’t work, in the end the solution was to change the class to a static class , cause otherwise it’s impossible to have a real empty constructor in java

    • Hi,
      I really like your blogs and they have some very useful content for Hadoop developers.

      trying to build an inputformat which read 3 lines at a time
      i.e
      aaaaa
      bbbbb
      ccccc
      ddddd
      eeeee
      fffff

      it will send
      key value
      0 aaaaa bbbbb cccccc
      15 ddddd eeeee fffff

      on last record of first split it will merge some records from second split.
      In this scenarion how should I make sure that records which are read in first split from second split , will not be read by second i.e. no duplicacy

      • Hi Just add below code in between ‘While(pos < end)' and 'if (newSize == 0) '
        note: declare as String Val="";
        …………..
        ………
        while (pos < end) {

        /*murali added this 'for' loop
        * for getting 3lines as one line input to mapper
        * if input as like
        * aaaaa
        * bbbbb
        * ccccc
        * ddddd
        * eeeee
        * fffff
        * ggggg
        * hhhhh
        * iiiii
        *
        * Mapper-input or its RecordReader output would be like
        * 0 – aaaaabbbbbccccc
        * 18 – dddddeeeeefffff
        * 36 – ggggghhhhhiiiii and so on….
        * */

        for(int i=0;i<3;i++){
        // Read first line and store its content to "value"
        newSize+= in.readLine(value, maxLineLength,
        Math.max((int) Math.min(Integer.MAX_VALUE, end – pos), maxLineLength));

        Val=Val+value.toString();
        value=new Text(Val);
        }
        Val=""; // till this 'for' loop is added by murali

        // No byte read, seems that we reached end of Split
        // Break and return false (no key / value)
        if (newSize == 0)
        ……..
        ……….

      • Hi I also face problem since I am new could U send ur code with steps of execution; software need to be installed.

  1. Nice post!

    I’m thinking of implementing something similar to it to implement a “linesToSkip” functionality (my motivation: in Hive, we cannot skip headers in text files). My idea:

    – extend org.apache.hadoop.mapreduce.lib.input.LineRecordReader, override initialize() method and after calling parents’ initialize(), read linesToSkip from Configuration and if FileSplit.getStart() == 0 (it’s the beginning of file), read “linesToSkip” dummy lines ;

    – extend org.apache.hadoop.mapreduce.lib.input.TextInputFormat in the same way as you did ;

    Does it seem right to you? It’s the first time I’m playing with TextInputFormat/LineRecordReader classes, so I would appreciate your ideas if you have time.

    Thanks !

    Luis

    • you may not necessarily read the data in the default way – line by line.
      Not necessarily your data source is always flat files loaded on HDFS, it could be excel sheets etc. Also, your data source may bit be HDFS altogether, you may want to read the data from Database instead of HDFS (read HaddopDB paper for this). So, as long as you want to read the data other than the default way, you need custom RecordReader.

  2. Nice Explaination.
    I am facing some problem in program when I am using this CustomFileInputFormat

    job.setInputFormatClass(WholeFileRecordReader.class);
    this line gave me following error.

    Exception in thread “main” java.lang.Error: Unresolved compilation problem:
    The method setInputFormatClass(Class) in the type Job is not applicable for the arguments (Class)

    My program is Word Count program only.I am modifying job.setInputFormatClass(TextInputFormat.class);
    to
    job.setInputFormatClass(WholeFileRecordReader.class);

    What’s the problem here?

  3. ok, now i will follow this tutorial to read the input data set from postgreSQL db. However, HadoopDB already do this, but i would like to do it by my own. 😀

  4. Hi,

    I’m getting exception

    java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.CombineFileSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit

    Any idea to resolve this

  5. Hi Antoine,
    This is great stuff 🙂
    I am trying to make a custom input format for JSON files.
    The challenge is that I want to be able to plug it into hive with a storage handler.
    Do you have any resources I can use?
    thanks! 🙂
    Meet

  6. Hi Antoine,
    Firstly I wanted to credit you that this is a great article!! Loved it!!
    I am trying to make a custom input format for reading JSON files from a cloud service for a school project..Would you happen to have anything that could potentially help me? An example of blog post?
    Have been struggling a lot and any help would be appreciated 🙂
    I can send you what I have. You seem to be an expert at this and you can probably point me in the right direction

    • Sorry for such a long delay in my answer..
      Did you manage to read your Json documents ? I guess your documents are spanning across many lines (otherwise you would not ask for any help). By any chance, are your Json prettyprint-formatted (using tab character) ? If so, a simple regex could do the trick: Your delimiter pattern could be any line starting with a bracket character (“$\\{“). If not, you will have to read char by char until your Json document is complete. Alternatively, might be much easier to process records as-is using Serde libraries (not tested though) – https://github.com/rcongiu/Hive-JSON-Serde

  7. Pingback: Spark / Hadoop: Processing GDELT data using Hadoop InputFormat and SparkSQL | Hadoopi

  8. Hello Antoine, really excellent explanation.
    Here I have one doubt, You have explained three scenarios with example by taking record size as 50MB, 150MB, 300MB.
    Here I need to understand for record size of 64MB and 128MB.
    If the record size is 64 or 128MB then Reader1 will read how many records?
    Will the Reader1 reads the first record in Block2?

    Thank you
    Venkatesh

      • Hi, sorry for not making the question clear.
        What I mean is,
        For example consider the record size as 64MB, Block size as 128MB. Now Block1 contains 1st and 2nd records.
        And Block2 contains 3rd and 4th records and so on.
        Reader1 will read 1st and 2nd records in Block1, it is clear.
        And here question is will the Reader1 will read 3rd record which is in Block2 ?
        Here I’m confused because, as per my understanding Reader2 will not know that the 3rd record is starting of Block2 as Reader2 don’t know that the previous character is a new line(How does Reader2 knows that 3rd record starts in Block2 without knowing previous character is newline character?).

        Thanks & Regards
        Venkatesh

  9. Hi guys,
    I am new in hadoop and asked the same question for myself. But I did not understand, is this the default behavior of Hadoop, or I need to implement a this RecordReader?

    Thank you

    • Default implementation of TextInputFormat is Record by record where each record is delimited by ‘\n’ character. Use conf.set(“textinputformat.record.delimiter”,”…”) should you need to override your default delimiter character. If you need any other logic, you need to implement your own or use existing third part ones (such as XmlInputFormat provided by Mahout distrib.)

      • Hi Antoine, thank you for your answer.
        One more question please. It is about your example.
        What happend, If the second split will start with the begining of line? How hadoop will decide, that this line should not be skipped?

  10. Hi
    I am trying to write a custom json record reader
    My input is an array of json objects. about 5000 object for my test case
    example
    [
    {jsonobject on line 2}
    ,{jsonobject on line 3}
    ,….
    ]

    The first and the last line have [ and ] respectively.
    The default text input file format reads [, detects its the beginning of the array but gives me an error saying end of file exception at line 1 column 3

    I want to skip line 1 and the last line.
    I am very new to hadoop. I know i need to write a custom reader.
    Please help if you have any idea

  11. I HAVE A DATAFILE IN THE FOLLOWING FORMAT, i want to run a mapreduce job on it to find the average rating for each product id

    ———————————–
    ProductID : XUV123
    ProductName : Geyser
    ProductRating : 3.0

    ProductID : VBG3465
    ProductName : Heater
    ProductRating : 2.0

    ProductID : BNM235
    ProductName : Mobile
    ProductRating : 4.0
    ————————————
    Kindly let me know how i can generate a tab limited text file from the above file in the following format

    XUV123||Geyser||3.0
    VBG3465||Heater||2.0
    BNM235||Mobile||4.0

    • 1. You can use a TextInputFormat with option “textinputformat.record.delimiter=\n\n”. This will split your records based on the empty line between them
      2. In the map function, use a simple split(“\n”) to get distinct lines, and a regex (or another split) to get distinct values
      3. Output tsv from map

      Hope that helps!

  12. Pingback: Hortonworks Certification preparation. | Hang around Big Data Technologies

  13. Hi Antoine,
    Really a nice post.Still one doubt.If the start of a split is at a eol of a line it will change start to start– and will put the record in dummy.Agreed.But is it going to read that record from the previous block remotely and do nothing,,just put that into dummy.is not an overhead.

  14. Hey Antoine
    Great article .I am trying to understand your code and improvise it for my situation.
    My input record is broken down into two lines..
    1,2,3,4
    5,6,7,8.
    I want to read this as one single line.How can I do this with your implementations?
    PS : The data resided within the same block.

  15. Thanks Antonie, Nice explaination.
    But I still have one doubt. lets say we have two splits S1 , S2
    S1 has EOL at the end of it
    S2 starts with some data (basically you can say its the begining of new record).

    Now when Reader2 starts reading Split S2 , (start!=0) it will move one step back , just before begining of the Split S2, now dont you think it will skip the first record from S2 by jumping it to next EOL?

  16. Hi Antoine,
    I need your help.I am working on one use case. I have csv file having header,please advice me.how can i remove the header. and which is the best inputformat for this type of files?

  17. Hi,
    I really admire your blogs and they have some very useful content for Hadoop developers.
    I have a question, though not exactly related to this post.
    I have got a MS word file with content divided in paragraphs and i need my Map-Reduce program to be able to open the file and read the content para by para. I can try to customize my record reader to read para-wise (a help from you on this well be great though), however my main concern is which InputFormat will suit me for processing Word files.
    Any pointer!!!

  18. Pingback: Hadoop: RecordReader and FileInputFormat | Me and my research

  19. Pingback: Cloudera Hadoop Developer Certification(CCDH) | ruchisaini

  20. Hi Antoine,
    Let me first tell you that this is an exceptional piece of writing that I found on the internet regarding implementing Custom Recorders. Kudos to You !!!
    However I have one question, in the post you mentioned that if a new line starts exactly at the beginning of the 2nd block, it will be neglected by the reader. So to do this, you reset the counter to –start(in the code). So my question is, wont this be an overhead of reading almost all the lines again which,say, span across two blocks?
    The code snippet being referred to is:-
    boolean skipFirstLine = false;
    if (start != 0) {
    skipFirstLine = true;
    // Set the file pointer at “start – 1” position.
    // This is to make sure we won’t miss any line
    // It could happen if “start” is located on a EOL
    –start;
    fileIn.seek(start);
    }

  21. My question is:

    In your first example, how many InputSplits are there?

    3 InputSplits with an empty one (for mapper 3) or 2 InputSplits (only one mapper 1 and 2) ?

  22. Hi Antoine,
    Thanks for this exceptional code .This has helped alot in improving my understanding about LineRecordReader.

    My requirement is inline with this above code of yours with a slight change that i want to specify my a key word for the start of the record and similarly end of the record so basically anything that falls between those two words should be treated as one record .

    please advise how i can achieve this .

    Regards ,
    Anurag

    • Hello Anurag

      May be you have found the answer by now. Your requirement looks like a fun problem to solve. I have a of question; Do you have any new line characters (CR & LF) in your input file? I mean, anywhere in the file.. in between the record or after the record delimiter word..

      Regards
      Curious Cat

  23. Hey Antoine,
    Its excellent blog , it helped me a lot understand the concepts.
    I am working on a project and as a data set I have directory containing lots of text file. What modification I have to do if I want Key as file name and value as content of the file?

  24. Good explanation but i have some question. The LineReader is the class what knows to read line. In this case is easy to know where is the line’s end. But if the file is not text, like a video for example, how can i know where is the video’s bitstream end? Is there some way to hadoop framework know it or will i need to know the video ‘s bitstream structure?

  25. Hi Antoine,
    This is great tutorial. Thanks a lot.
    I just started working with hadoop, and I want to user video file as input with hadoop. I am struggling with this. Can you help or guide me with this. Any help will be appreciated.
    Thanks

  26. Hi,

    Thanks for your tutorial, can I ask what should I do if I use your code to process compressed file. For example, XX.xml.gz .I tried to use your code to handle it, but the result is not good. Thanks

  27. Hai Antoine
    I am Aarthi. I want to know onething. I am doing my final year project in hadoop.The concept is cache manager. I implemengt the cache. I write the mapper output in hdfs file. But i dont know how to modify the FileInputFormat class to not run the mapper class. Becz if we run the same file and the map task is same then no need to run the map function. So kindly help me.

  28. Thanks for the article. I have a situation as follows.
    I have multiple text files under directory A in hdfs. I would like to retrieve only 10th line(record) from all these files in my Mapper. How can we do this using custom input format. Do you have any clue on this. I appreciate your suggestions.

  29. I have a small doubt in example 3. In that case, the line size is 300mb. So the reader 2 will not do any thing? so in this example, only reader 1 and reader 3 will be there? which means only 2 mappers will run? If so, in the same article you mentioned that a reader can read max of 268MB (2,147,483,647) around. If so what about the other 32MB data, this will read by the reader 2 or reader 3?

  30. Hi Antoine,

    Need help!
    In my case each of my records in log starting with date format and i want each of those should be treated as new element rather than per line wise. I want to use some regex in hadoop configuration in spark scala.

    I am repeating my question here again :
    How to read multiple line elements in Spark , where each record of log is starting with any format like yyyy-MM-dd and each record of log is multi-line?

    I have implemented below logic in scala so far for this :

    val hadoopConf = new Configuration(sc.hadoopConfiguration);
    //hadoopConf.set(“textinputformat.record.delimiter”, “2016-“)
    hadoopConf.set(“record.delimiter.regex”, “^([0-9]{4}.*)”)

    val accessLogs = sc.newAPIHadoopFile(“/user/root/sample.log”, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConf).map(x=>x.2.toString)

    |
    |_
    I want to put regex to recognize the if line started with date format then treat it as a new record else continue to add lines in old record.

    But this is not working. If i am passing date manually then its working fine. Below is the same code like this i want to put the regex:

    Here below is the sample format:

    2016-12-23 07:00:09,693 [jetty-51 – /app/service] INFO org.apache.cxf.interceptor.LoggingOutInterceptor S:METHOD_NAME=METHNAME : WebAppSessionId= : ChannelSessionId=web-xxx-xxx-xxx : ClientIp=xxxxxxx : – Outbound Message

    ————————-
    ID: 1978
    Address: https://sample.domain.com/SampleService.xxx/basic
    Encoding: UTF-8
    Content-Type: text/xml
    Headers: {Accept=[/], SOAPAction=[“WebDomain.Service/app”]}
    Payload:

    —————————–

    2016-12-26 08:00:01,514 [jetty-1195 – /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : – ActiveSpaceCacheHandler:getServiceResponseFromCache(); exception: java.lang.Exception: getServiceResponseData: com.tibco.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey:Request.US

    2016-12-26 08:00:01,624 [jetty-979 – /app/service/serviceName] ERROR com.testservices.cache.impl.ActiveSpaceCacheHandler S:METHOD_NAME=ServiceInquiryWithBands : WebAppSessionId= : ChannelSessionId=SERVICE : ClientIp=client-ip : – ActiveSpaceCacheHandler:setServiceResponseInCache(); exception: com.test.as.space.RuntimeASException: field key is not nullable and is missing in tuple for cachekey

    Thanks,
    Ashish Tyagi

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s