From charlesreid1

Chapter 6: File Processing

Sections:

6.1 File reading basics

6.2 Token based processing

6.3 Line based processing

6.4 Advanced file processing

6.5 Case study: zip code lookup

Chapter 3 focused on a scanner for user input. Chapter 6 focuses on a scanner for file reading.

Many intro programming classes see this as a complicated topic, and Java doesn't make it easy. It's awkward, but it's manageable.

We will also explore exceptions relate to file processing.

(Python makes this a dream.)

with open('data.txt','r') as f:
    lines = f.readlines()

Done.

Section 6.1: File Reading Basics

6.1 Definitions

Definitions;

  • File
  • File extension
  • binary
  • ASCII
  • Checked exception
  • Throws clause

6.1 Material

Examples of the deluge of data available:

  • Landmark-project (earthquakes, pollution, baserball, history, weather, etc)
  • Gutenberg - see ciphertexts
  • ncbi.nlm.nih.gov - biological/genomic data
  • IMDB
  • Fedstats.gov
  • US census
  • World bank
  • CIA world factbook

Files and file objects:

  • Data stored on computer as files
  • Files have extensions
  • Files can be stored as text, or as binary
  • To deal with a file, use a File object
  • This provides various methods
  • Java API lookup/reference
  • Note: we aren't constructing a NEW FILE, we're constructing a new object to represent an existing file

Reading files with scanner:

  • Useful methods of File objects: (see list)
  • File object is like a pipe: doesn't care much about what kind of fluid flowing thru, or where it comes from
  • File object is the delivery system
  • You can then pas sthe File object into a scanner
  • Again, scanner is like nozzle at end of pipe - does not care much about File type or details of File object, just like nozzle doesn't care about type of fluid
  • Need to deal with potential problems; file not there
  • Checked exception - like "check" in chess
  • Must be dealt with (can't just say, ignore and keep going)
  • To handle this exception, put the code that may cause the error into a throws clause

Throws clause: diapers for your code

More in throws/catch clauses:

  • You're anticipating a particular kind of mess
  • Like an if statement, for exceptions
  • If we see this kind of exception, catch it this particular way
public static void main (String[] args) throws FileNotFoundException {
    ...
}

Other exceptions:

  • If you reach the end of a file, then ask for more
  • NoSuchElementException

A word on the correct way:

Scanner input = new Scanner(new File("hamlet.txt"))

versus the incorrect way:

Scanner input = new Scanner("hamlet.txt")

(Latter would be like saying, a file with the literal contents "hamlet.txt")

NOTE: This is overloading in action (Scanner can take multiple data types)

Section 6.2: Token-Based Processing

6.2 Definitions

Definitions:

  • Token-baesd processing
  • Input cursor
  • Consuming input
  • File path
  • Current directory

6.2 Material

Token - a single chunk of letters or character data

  • Usually WORDS separated by SPACES
  • But could also be NUMBERS separated by COMMAS
  • Or, other stuff...

Example: file with 5 numbers

  • Read in the first 5 numbers
  • Cumulative sum of first 5 numbers
  • don't forget the throws

Output:

  • Program outputs sum as 337.19999999 instead of 337.2

Utilize scanner functions:

  • Scanners have next() and nextDouble() and etc to read next values

Structure of files:

  • Computer sees a one-dimensional stream of characters: everything else is our own invention (e.g., line breaks are ignored so computer doesn't even see lines)
  • Scanner handles details of, e.g., what to do when it gets to a newline char or a number char

Exceptions from wrong data type:

  • InputMismatchException
  • Pay close attention to errors: not clear, but provide you with hints

Moving through a file:

  • Comptuer sees 1D stream of text
  • Can't jump around - like a VCR tape
  • So, current location/position is important (input cursor)
  • Cursor moves down one char at a time
  • Scanner handles details:
    • nextFloat() knows what to look for
    • advances cursor to next word

Scanner object info:

  • if we repeatedly call Scanner, it doesn't reset the cursor
  • one scanner --> one File, one position
  • processTokens(input,2) --> first 2 tokens
  • processTokens(input,3) --> processes tokens 3, 4, and 5 (not 1, 2, 3)

etc.

Paths and directories:

  • Organization of files: uses directory structure
  • Root directory: top level (C:\ or /)
  • If no path specified, look in current directory (where Java is being run from)
  • If full path is specified, look for the file
  • If relative path, specific location to look, starting from current directory
  • Slashes: can use C:\\Windows\\ etc or can use C:/Windows/etc

Example: 2 scanners

  • One scanner for user input
  • One scanner for file
  • Scanner deals with backslashes/escaping backslashes just fine (again - abstract away details, just take care of it)

Example: Complex input file

  • Multiple columns, 1st column name, remainder numeric
  • File processing will use while loops
  • Identify things you do want to generalize
  • What things do you know ahead of time - things you DON'T want to generalize
  • Things you may not know ahead of time (e.g., number of columns) - generalize
  • Example: we know number of columns... we don't know number of lines.
  • NOTE: This example does a poor job of explaining this distinction, WHAT to abstract and WHEN
  • Use while loop (while hasDouble()) to get column data
  • Better way to pose this problem:
  • Present a general scenario: (STRING) (SET OF AT LEAST 1 NUMBERS0
  • Each line has some data, so find the totals for each line
  • THEN, you can raise the question: how many numbers?
  • Does each row have same number of numbers?
  • If so, line-based processing
  • Otherwise, token-based processing

Section 6.3: Line Based Processing

6.3 Material

Line by line:

  • Rather than deal with tokens, may want to deal with lines
  • Scanner has nextLine() and hasNextLine() methods
  • uses toUpperCase() method to turn a file into uppercase
  • Choice of lines vs tokens also depends on whether whitespace is important (example: poem, vs CSV file)

String scanners, line/token combos:

  • Can combine line and token parsing
  • Example: modify employees file so now it has an employee ID number out in front
  • Need to deal more gracefully with this change
  • Pseudocode
for each line of file:
  split into tokens
  for each token in file:
    token 1 = xxx, token 2 = yyy, etc

Generalizing:

  • Here again, we ask: what can we generalize, and what do we assume we always know?
  • We can generalize the column layout: the thing that changes i s the number of employees or who the employees are
  • One monster scanner for the whole file, line by line
  • Lots of mini scanners, 1 scanner per line, to turn the line string into tokens for processing

Section 6.4: Advanced File Processing

6.4 Definitions

Definitions:

  • Boilerplate code

6.4 Material

Output files with printstream:

  • Can read from files, can also write files
  • Just like with scanner, we create a File object first
  • Then we createa PrintStream object
  • println() prints line to file

Example: Hello File

  • Modify hello world to write to file

Example: remove whitespace

  • Read in tokens, print them out with correct whitespace

Generalization: syntax "output.println()" and "system.out.println()" look similar because they are similar

We can tie the PrintStream object to the output console, or to a file, and it all works the same

Error handling/ensuring files readable:

  • Particularly when taking input, important to ensure we can operate on files before actually operating on files
  • If user inputs invalid filename, could crash program - or could just ask again
  • Half-fencepost design, ask for input, then check and ask for input again if invalid

Section 6.5: Zip Code Lookup Case Study

6.5 Material

(Dating algorithms??? Really? Social justice.)

Introduce the problem, with background

  • File contains data of form:
  • 3 lines per city
ZIP
CITY, STATE
LAT LONG

Program should do the following:

  • Introduce program
  • Ask for user input
  • Find coordinates for target zip code
  • Display nearby zip codes

Break up the problem: start with last 2 steps (hardest)

How to find coordinates, for a given zip code?

  • Step 1: find it (loop through file, 3 lines at a time)
  • Step 2: return it (return the line with lat/long on it)
  • All code into self-contained method
  • Also deal with exceptions (zip not found)

How to find neighbor coordinates?

  • Step 1: find it (same approach: loop through, 3 lines at a time)
  • Step 2: print it

Define new method, show matches, using found lat1/long1

  • For each city, determine lat/long
  • Compute distance from target
  • If threshold, print whatever info we need to print

Final program structure:

  • main() method, asks for input, gets file scanners
  • giveIntro() method
  • find() method to find target zip
  • showMatches() method to find matches nearby
  • distance() method to calculate distance algorithm between lat1/long1 and lat2/long2

Chapter 6 Summary

Deliverables

File reading:

  • Purpose of scaners
  • Tokens vs lines, when to use
  • Syntax required, file object
  • Exceptions

Moving through files: tokenization

  • One scanner = 1 file and 1 cursor
  • Paths/directories
  • Complex input files and strategies
  • nextDouble() etc

Moving through files: line-based processing

  • Lines vs. tokens: when
  • Line/token combo: parse line-by-line with one scanner, parse each line with another
  • Pseudocode

Advanced file processing

  • file writing
  • general principle: System.out is one kind of device, files are another

Case study:

  • Breaking up complexity
  • Don't get overwhelmed! Start simple, with tasks you know how to do
  • Especially at beginning, hardest part is knowing what is possible
  • Java API, while overwhelming, can help with that!


Chapter 6 Homework

HW Questions

(Recommended) Self-check problems: #3, #8, #11, #12, #17, #18

(Required) Exercises: #8, #12, #15

(Required) Projects: #2

HW Details

Self-check:

  • 3, 8 - correct scanner syntax
  • 11 - finding mistakes
  • 12 - reading tokens from input file with scanner
  • 17 - take a line of text and put a box around it
  • 18 - object used to write to output files, and methods available

Exercises:

  • 8 - double space
  • 12 - html tag strip
  • 15 - read file of heads/tails and compute statistics

Projects:

  • 2 - file diff utility

Chapter 6 Code

Lecture Code

PublicSchools - tokenize CSV input file using a scanner

  • using csv data about public school location/information from Seattle Open Data: https://data.seattle.gov/
  • Token scanner only
  • CSV as first example
  • Extract particular field from each row to print it out
  • Could use a Scanner... but that's pretty inflexible. Kind of like, first-thing-that-you-grab-for.
  • Better: ask what we really want to do... we want to tokenize strings... so see if Java standard library provides a class for that
  • Turns out, it does: StringTokenizer [1]
     StringTokenizer st = new StringTokenizer("this is a test");
     while (st.hasMoreTokens()) {
         System.out.println(st.nextToken());
     }

ZipCode - zip code search and finding

  • Reges and Stepp
  • Read property from file
  • Read other properteis from file
  • Compare to original property
  • Conditionally print out

Worksheet Code

Public School zipcode

  • Utilizing City of Seattle data about public schools
  • Utilizing code in textbook - zip code case study
  • Given a public school, or an integer index, what are nearby schools

Chapter 6 Goodies

Puzzle 6

Affine cipher ax+b, gcd modular arithmetic

Puzzles/Crypto Level 1/Puzzle 6

Profiles

Claude Shannon

  • Information entropy, signals, ciphers

Flags