CSC 142/Chapter 6
From charlesreid1
Contents
- 1 Chapter 6: File Processing
- 2 Flags
Chapter 6: File Processing
Sections:
6.1 File reading basics
6.2 Token based processing
6.3 Line based processing
6.4 Advanced file processing
6.5 Case study: zip code lookup
Chapter 3 focused on a scanner for user input. Chapter 6 focuses on a scanner for file reading.
Many intro programming classes see this as a complicated topic, and Java doesn't make it easy. It's awkward, but it's manageable.
We will also explore exceptions relate to file processing.
(Python makes this a dream.)
with open('data.txt','r') as f: lines = f.readlines()
Done.
Section 6.1: File Reading Basics
6.1 Definitions
Definitions;
- File
- File extension
- binary
- ASCII
- Checked exception
- Throws clause
6.1 Material
Examples of the deluge of data available:
- Landmark-project (earthquakes, pollution, baserball, history, weather, etc)
- Gutenberg - see ciphertexts
- ncbi.nlm.nih.gov - biological/genomic data
- IMDB
- Fedstats.gov
- US census
- World bank
- CIA world factbook
Files and file objects:
- Data stored on computer as files
- Files have extensions
- Files can be stored as text, or as binary
- To deal with a file, use a File object
- This provides various methods
- Java API lookup/reference
- Note: we aren't constructing a NEW FILE, we're constructing a new object to represent an existing file
Reading files with scanner:
- Useful methods of File objects: (see list)
- File object is like a pipe: doesn't care much about what kind of fluid flowing thru, or where it comes from
- File object is the delivery system
- You can then pas sthe File object into a scanner
- Again, scanner is like nozzle at end of pipe - does not care much about File type or details of File object, just like nozzle doesn't care about type of fluid
- Need to deal with potential problems; file not there
- Checked exception - like "check" in chess
- Must be dealt with (can't just say, ignore and keep going)
- To handle this exception, put the code that may cause the error into a throws clause
Throws clause: diapers for your code
More in throws/catch clauses:
- You're anticipating a particular kind of mess
- Like an if statement, for exceptions
- If we see this kind of exception, catch it this particular way
public static void main (String[] args) throws FileNotFoundException { ... }
Other exceptions:
- If you reach the end of a file, then ask for more
- NoSuchElementException
A word on the correct way:
Scanner input = new Scanner(new File("hamlet.txt"))
versus the incorrect way:
Scanner input = new Scanner("hamlet.txt")
(Latter would be like saying, a file with the literal contents "hamlet.txt")
NOTE: This is overloading in action (Scanner can take multiple data types)
Section 6.2: Token-Based Processing
6.2 Definitions
Definitions:
- Token-baesd processing
- Input cursor
- Consuming input
- File path
- Current directory
6.2 Material
Token - a single chunk of letters or character data
- Usually WORDS separated by SPACES
- But could also be NUMBERS separated by COMMAS
- Or, other stuff...
Example: file with 5 numbers
- Read in the first 5 numbers
- Cumulative sum of first 5 numbers
- don't forget the throws
Output:
- Program outputs sum as 337.19999999 instead of 337.2
Utilize scanner functions:
- Scanners have next() and nextDouble() and etc to read next values
Structure of files:
- Computer sees a one-dimensional stream of characters: everything else is our own invention (e.g., line breaks are ignored so computer doesn't even see lines)
- Scanner handles details of, e.g., what to do when it gets to a newline char or a number char
Exceptions from wrong data type:
- InputMismatchException
- Pay close attention to errors: not clear, but provide you with hints
Moving through a file:
- Comptuer sees 1D stream of text
- Can't jump around - like a VCR tape
- So, current location/position is important (input cursor)
- Cursor moves down one char at a time
- Scanner handles details:
- nextFloat() knows what to look for
- advances cursor to next word
Scanner object info:
- if we repeatedly call Scanner, it doesn't reset the cursor
- one scanner --> one File, one position
- processTokens(input,2) --> first 2 tokens
- processTokens(input,3) --> processes tokens 3, 4, and 5 (not 1, 2, 3)
etc.
Paths and directories:
- Organization of files: uses directory structure
- Root directory: top level (C:\ or /)
- If no path specified, look in current directory (where Java is being run from)
- If full path is specified, look for the file
- If relative path, specific location to look, starting from current directory
- Slashes: can use C:\\Windows\\ etc or can use C:/Windows/etc
Example: 2 scanners
- One scanner for user input
- One scanner for file
- Scanner deals with backslashes/escaping backslashes just fine (again - abstract away details, just take care of it)
Example: Complex input file
- Multiple columns, 1st column name, remainder numeric
- File processing will use while loops
- Identify things you do want to generalize
- What things do you know ahead of time - things you DON'T want to generalize
- Things you may not know ahead of time (e.g., number of columns) - generalize
- Example: we know number of columns... we don't know number of lines.
- NOTE: This example does a poor job of explaining this distinction, WHAT to abstract and WHEN
- Use while loop (while hasDouble()) to get column data
- Better way to pose this problem:
- Present a general scenario: (STRING) (SET OF AT LEAST 1 NUMBERS0
- Each line has some data, so find the totals for each line
- THEN, you can raise the question: how many numbers?
- Does each row have same number of numbers?
- If so, line-based processing
- Otherwise, token-based processing
Section 6.3: Line Based Processing
6.3 Material
Line by line:
- Rather than deal with tokens, may want to deal with lines
- Scanner has nextLine() and hasNextLine() methods
- uses toUpperCase() method to turn a file into uppercase
- Choice of lines vs tokens also depends on whether whitespace is important (example: poem, vs CSV file)
String scanners, line/token combos:
- Can combine line and token parsing
- Example: modify employees file so now it has an employee ID number out in front
- Need to deal more gracefully with this change
- Pseudocode
for each line of file: split into tokens for each token in file: token 1 = xxx, token 2 = yyy, etc
Generalizing:
- Here again, we ask: what can we generalize, and what do we assume we always know?
- We can generalize the column layout: the thing that changes i s the number of employees or who the employees are
- One monster scanner for the whole file, line by line
- Lots of mini scanners, 1 scanner per line, to turn the line string into tokens for processing
Section 6.4: Advanced File Processing
6.4 Definitions
Definitions:
- Boilerplate code
6.4 Material
Output files with printstream:
- Can read from files, can also write files
- Just like with scanner, we create a File object first
- Then we createa PrintStream object
- println() prints line to file
Example: Hello File
- Modify hello world to write to file
Example: remove whitespace
- Read in tokens, print them out with correct whitespace
Generalization: syntax "output.println()" and "system.out.println()" look similar because they are similar
We can tie the PrintStream object to the output console, or to a file, and it all works the same
Error handling/ensuring files readable:
- Particularly when taking input, important to ensure we can operate on files before actually operating on files
- If user inputs invalid filename, could crash program - or could just ask again
- Half-fencepost design, ask for input, then check and ask for input again if invalid
Section 6.5: Zip Code Lookup Case Study
6.5 Material
(Dating algorithms??? Really? Social justice.)
Introduce the problem, with background
- File contains data of form:
- 3 lines per city
ZIP CITY, STATE LAT LONG
Program should do the following:
- Introduce program
- Ask for user input
- Find coordinates for target zip code
- Display nearby zip codes
Break up the problem: start with last 2 steps (hardest)
How to find coordinates, for a given zip code?
- Step 1: find it (loop through file, 3 lines at a time)
- Step 2: return it (return the line with lat/long on it)
- All code into self-contained method
- Also deal with exceptions (zip not found)
How to find neighbor coordinates?
- Step 1: find it (same approach: loop through, 3 lines at a time)
- Step 2: print it
Define new method, show matches, using found lat1/long1
- For each city, determine lat/long
- Compute distance from target
- If threshold, print whatever info we need to print
Final program structure:
- main() method, asks for input, gets file scanners
- giveIntro() method
- find() method to find target zip
- showMatches() method to find matches nearby
- distance() method to calculate distance algorithm between lat1/long1 and lat2/long2
Chapter 6 Summary
Deliverables
File reading:
- Purpose of scaners
- Tokens vs lines, when to use
- Syntax required, file object
- Exceptions
Moving through files: tokenization
- One scanner = 1 file and 1 cursor
- Paths/directories
- Complex input files and strategies
- nextDouble() etc
Moving through files: line-based processing
- Lines vs. tokens: when
- Line/token combo: parse line-by-line with one scanner, parse each line with another
- Pseudocode
Advanced file processing
- file writing
- general principle: System.out is one kind of device, files are another
Case study:
- Breaking up complexity
- Don't get overwhelmed! Start simple, with tasks you know how to do
- Especially at beginning, hardest part is knowing what is possible
- Java API, while overwhelming, can help with that!
Chapter 6 Homework
HW Questions
(Recommended) Self-check problems: #3, #8, #11, #12, #17, #18
(Required) Exercises: #8, #12, #15
(Required) Projects: #2
HW Details
Self-check:
- 3, 8 - correct scanner syntax
- 11 - finding mistakes
- 12 - reading tokens from input file with scanner
- 17 - take a line of text and put a box around it
- 18 - object used to write to output files, and methods available
Exercises:
- 8 - double space
- 12 - html tag strip
- 15 - read file of heads/tails and compute statistics
Projects:
- 2 - file diff utility
Chapter 6 Code
Lecture Code
PublicSchools - tokenize CSV input file using a scanner
- using csv data about public school location/information from Seattle Open Data: https://data.seattle.gov/
- Token scanner only
- CSV as first example
- Extract particular field from each row to print it out
- Could use a Scanner... but that's pretty inflexible. Kind of like, first-thing-that-you-grab-for.
- Better: ask what we really want to do... we want to tokenize strings... so see if Java standard library provides a class for that
- Turns out, it does: StringTokenizer [1]
StringTokenizer st = new StringTokenizer("this is a test"); while (st.hasMoreTokens()) { System.out.println(st.nextToken()); }
ZipCode - zip code search and finding
- Reges and Stepp
- Read property from file
- Read other properteis from file
- Compare to original property
- Conditionally print out
Worksheet Code
Public School zipcode
- Utilizing City of Seattle data about public schools
- Utilizing code in textbook - zip code case study
- Given a public school, or an integer index, what are nearby schools
Chapter 6 Goodies
Puzzle 6
Affine cipher ax+b, gcd modular arithmetic
Puzzles/Crypto Level 1/Puzzle 6
Profiles
Claude Shannon
- Information entropy, signals, ciphers
Flags
CSC 142 - Intro to Programming I Computer Science 142 - Intro to Programming I, South Seattle College.
Chapter 1: Intro to Java CSC 142/Chapter 1 Chapter 2: Primitive Data and Definite Loops CSC 142/Chapter 2 Chapter 3: Parameters and Objects CSC 142/Chapter 3 Chapter 4: Conditional Execution CSC 142/Chapter 4 Chapter 5: Program Logic and Indefinite Loops CSC 142/Chapter 5 Chapter 6: File Processing CSC 142/Chapter 6 Chapter 7: Arrays CSC 142/Chapter 7 Chapter 8: Classes CSC 142/Chapter 8
Puzzles: Puzzles
Category:Teaching · Category:CSC 142 · Category:CSC Related: CSC 143 Flags · Template:CSC142Flag · e |