Linux代写 | FIT2014 Assignment 1

这个作业是用Linux awk语言完成一些命令编程

FIT2014
Assignment 1

Introduction to the Assignment
In Lab 0, you met the stream editor sed, which detects and replaces certain types of patterns in
text, processing one line at a time. These patterns are actually specified by regular expressions. You
will use sed again in Problem 1 of this Assignment, to help construct regular expressions.
You will also learn about awk, which is a simple programming language that is widely used in
Unix/Linux systems and which also uses regular expressions. In Problem 1, you will construct an
awk program to identify a class of Chinese names.
Finally, Problem 2 is about applying induction to a problem of counting strings.
Introduction to awk
In an awk program, each line has the form
/pattern / { action }
where the pattern is a regular expression (or certain other special patterns) and the action is an
instruction that specifies what to do with any line that contains a match for the pattern. The action
(and the {. . . } around it) can be omitted, in which case any line that matches the pattern is printed.
Once you have written your program, it does not need to be compiled. It can be executed directly, by using the awk command in Linux:
$ awk -f programName inputFileName
Your program is then executed on an input file in the following way.
// Initially, we’re at the start of the input file, and haven’t read any of it yet.
If the program has a line with the special pattern BEGIN, then
do the action specified for this pattern.
Main loop, going through the input file:
{
inputLine := next line of input file
Go to the start of the program.
Inner loop, going through the program:
{
programLine := next line of program (but ignore any BEGIN and END lines)
if inputLine contains a string that matches the pattern in programLine, then
if there is an action specified in the programLine, then
{
do this action
}
else
just print inputLine // it goes to standard output
}
}
If the program has a line with the special pattern END, then
do the action specified for this pattern.
Any output is sent to standard output.
You should read about the basics of awk, including the way it represents regular expressions and
the main instruction types used in its actions. Any of the following sources should be a reasonable
place to start:
• A. V. Aho, B. W. Kernighan and P. J. Weinberger, The AWK Programming Language,
Addison-Wesley, New York, 1988.
2
(The first few sections of Chapter 1 should have most of what you need, but be aware also of
the regular expression specification on p28.)

Introduction to Problem 1
The Master said, ‘What is necessary is to rectify names . . . . If names are not rectified, then words are not appropriate. If words are not appropriate, then deeds are not
accomplished.’
– Confucius (孔夫子), The Analects (transl. R. Dawson), Oxford University Press,
1993, §13.3.
Most organisations today deal with people from many different cultures and language groups,
and they must often record and process people’s names in systems that work mainly with English
language text. In such contexts, it is helpful to be able to recognise names from different language
groups. Example applications include: determining how to pronounce students’ names when reading
them out from a list at graduation ceremonies; determining how to greet a person with whom you
have an appointment; determining how to enter the various parts of a person’s name into a database;
determining how automatically-generated emails, sent to many different people listed in some file,
should address each recipient; determining the most likely native language of a person in situations
where their name is known but they cannot be spoken to directly at the time (e.g., in some emergency
situations). Recognising the language group that a name belongs to is an important first step in all
these situations.
In this problem you will write some code in sed and awk to try to recognise Chinese language
names in a long file of Asian names. More specifically, suppose you are given an input file in which
each line starts with a person’s name in some language, with each name transcribed somehow into
English text. Your task is to detect which of these names come from Mandarin Chinese transcribed
using H`anyˇu P¯ıny¯ın (which is the most standard way of representing Mandarin Chinese using strings
of English letters).
In the input file, all text from the start of each line until the first colon (:) on the line (but not
including the colon itself) is taken to be a person’s name. In most cases, each line ends with a string
of non-blank letters specifying which language the name is believed to come from. An example input
file is provided, as inputFileOfNames. If you browse through the file, you will notice that it contains
names from a variety of Asian languages: Mandarin, Cantonese, Hokkien, Teochew, Hakka, Korean,
Japanese, Thai, Vietnamese, Malay and Indonesian. They have been represented in English text
using a variety of transcription schemes, and with all extra marks on letters (accents, tone marks,
other diacritical marks, etc.) removed. 1
In many cases, the line about a person also contains some
other information about them, but our name recognition task will ignore that information.2
Further information about working with names from different cultures can be found in:
• Fiona Swee-Lin Price, Success with Asian Names, Allen & Unwin, Crows Nest, NSW, 2007.
1These marks carry important information about meaning and pronunciation. But they are often removed when
names are represented using other alphabets.
2The file was compiled from a number of sources, mainly Wikipedia lists of names of type
https://en.wikipedia.org/wiki/List of CultureName people, where CultureName is one of the cultures listed
above; also https://www.goratings.org/en/. The lists obtained from Wikipedia are rather imperfect, with people’s names often not written in a form that clearly shows the claimed cultural background.
3

3There are exceptions, though they are uncommon: some family names have two or even more syllables, and some
given names have more than two syllables. But in this assignment we consider only one-syllable family names and
one- or two-syllable given names.
4This terminology is different to the more usual linguistic terminology of initials and finals. This is deliberate,
since they represent slightly different sets of strings here to the standard initials and finals. This change was just a
small simplification for the purposes of the assignment.
5The only standaloneSuffix that is not also a suffix is er.
6Some reasons for the imperfect matching include: not all the combinations of prefixes and suffixes listed above
give valid syllables in Mandarin; some people write Mandarin H`anyˇu P¯ıny¯ın names with a hyphen or a space between
the two syllables of the given name; and some family or given names may have more syllables than we have allowed
for.
4
We describe the steps in detail now, and then list the assignment submission requirements for
Problem 1.
We start with the file MandarinHanyuPinyin-NameStructure0, which just contains one line
consisting of the text <NAME>. Think of this as a special symbol, eventually to be transformed to a
regular expression for H`anyˇu P¯ıny¯ın names.
We first transform <NAME> to a regular expression representing the syllable structure of H`anyˇu
P¯ıny¯ın names. From our description of the syllable structure above, we can represent this by the
regular expression <SYLLABLE> <SYLLABLE><SYLLABLE>? where <SYLLABLE> is a special symbol to
represent any H`anyˇu P¯ıny¯ın syllable, and putting “?” after a symbol means it can occur just once
or not at all at that position.7 Each occurrence of <SYLLABLE> will eventually be transformed to
a regular expression to represent such syllables. The first step of this transformation can be done
using the following sed command:
$ sed “s/<NAME>/<SYLLABLE> <SYLLABLE><SYLLABLE>?/” MandarinHanyuPinyin-NameStructure0
> MandarinHanyuPinyin-NameStructure1
Alternatively, you can put the sed instruction “s/<NAME>/<SYLLABLE> <SYLLABLE><SYLLABLE>?/”
into the file decomposeNameIntoSyllables (provided with this assignment) and do the following:
$ sed -f decomposeNameIntoSyllables MandarinHanyuPinyin-NameStructure0
> MandarinHanyuPinyin-NameStructure1
Try these, using the given files, and check that you get the required result in the file
MandarinHanyuPinyin-NameStructure1. This use of sed is a template for what you will do next.
Now you need to replace each <SYLLABLE> by a regular expression representing possible syllable
structures, using <PREFIX>, <SUFFIX> and <STANDALONESUFFIX>.
Write the sed instruction to do this, in the file decomposeSyllablesIntoParts, and do the
transformation from your Linux command line using
$ sed -f decomposeSyllablesIntoParts MandarinHanyuPinyin-NameStructure1
> MandarinHanyuPinyin-NameStructure2
Now you need to replace each occurrence of each part, <PREFIX>, <SUFFIX> and <STANDALONESUFFIX>,
by regular expressions representing possible letter strings, according to the sets of strings given on
the previous page.
Write the sed instructions to do this, in the file decomposePartsIntoLetters. You will need
three lines: one line giving the sed instruction for each part type, <PREFIX>, <SUFFIX> and <STANDALONESUFFIX>.
Then do the transformation from your Linux command line using
$ sed -f decomposePartsIntoLetters MandarinHanyuPinyin-NameStructure2
> MandarinHanyuPinyin-NameStructure3
This should put, into the file MandarinHanyuPinyin-NameStructure3, a regular expression that
is intended to match H`anyˇu P¯ıny¯ın names.
Matching names in the input file
We will now use this regular expression to construct an awk program to apply it to names in the
input file, and to do some simple computations to determine how well the regular expression matches
H`anyˇu P¯ıny¯ın names.
For each name in the input file, checking it against your regular expression gives one of four
possible outcomes:
7The operator “?” is a standard feature of regular expression syntax in sed and awk. Question to consider (but not
part of the assignment): does using it enlarge the class of languages that can be represented by regular expressions?
5
• Correct Match: a name that belongs to Mandarin H`anyˇu P¯ıny¯ın (according to the last word
on its line in inputFileOfNames) is also matched by your regular expression.
• False Positive: a name that does not belong to Mandarin H`anyˇu P¯ıny¯ın (according to the
last word on its line in inputFileOfNames), but is matched by your regular expression.
• False Negative: a name that belongs to Mandarin H`anyˇu P¯ıny¯ın (according to the last word
on its line in inputFileOfNames), but is not matched by your regular expression.
• Correct Non-match: a name that does not belong to Mandarin H`anyˇu P¯ıny¯ın (according to
the last word on its line in inputFileOfNames) and is not matched by your regular expression.
Your awk program should compute the number of names, in the input file, of each of these four
types, and report those numbers at the end.
You can write the awk program manually, cutting-and-pasting the regular expression as needed
(and making any necessary modifications to it) from MandarinHanyuPinyin-NameStructure3.
The heart of your awk program will be a statement that does the following.
• The pattern matches a Mandarin H`anyˇu P¯ıny¯ın name whenever it occurs in the specified
position at the start of a line.
– Recall the general structure of an awk statement. By default, matches in awk are casesensitive. To do a case-insensitive match, you need to match the pattern against a version
of the current line in which all alphabetic characters have been converted to lower case.
This can be done using a statement of this form:
tolower($0) ~ /pattern / { action }
Here, $0 is a built-in awk variable that always contains the current line of the input file,
and tolower() is a function that converts all alphabetic characters to lower case. You
can read “ ~ ” as “matches”.
• The action increments appropriate variables in order to update the number of correct matches,
false positives, false negatives, or correct non-matches, as appropriate.
You may need another line or few too, including an END line, at the end, in order to print the total
numbers of matches of each type.

BEST代写-线上留学生作业代写 & 论文代写专家

Linux代写 | FIT2014 Assignment 1

Linux代写 | FIT2014 Assignment 1

bestdaixie