CSC 209 assignment four questions and answers

Example code:

socket examples presented and discussed in class
examples in chapter 10 of the Haviland et al textbook
some notes on calling select()
I've written an explanation of the network newline convention
the linked-list tutorial (and solutions in the directory /u/csc209h/summer/pub/lab/soln/08)
compiled solution and client and server in /u/csc209h/summer/pub/a4
chatsvr protocol description in /u/csc209h/summer/pub/a4/PROTOCOL
/u/csc209h/summer/pub/a4/lookup.c for the code to look up a host name to get an IP address

Furthermore, you probably want to begin your program with /u/csc209h/summer/pub/a4/chatbridge.c.starter so as to parse the command-line arguments correctly.

Note: Do not copy any code which you don't understand into your program, whether it's from the samples above or stuff you find on the net or in a book. If you do this, there end up being bits in your program which you don't understand and are strange or silly. Start with chatbridge.c.starter; make sure you understand the code there fully; and then only write in things you understand.

I really mean the advice in the previous paragraph. If you have code in your program for no reason other than that it was in some other file, you will not get your program working. Really!

Above all, Keep It Simple!

And, I would like to bring back to your attention the "various segfault problems" entry in the assignment two and three Q&A files, which gives some general advice.

Q: How does the code in lookup.c work?

A: Ok, this is an exception to my advice never to copy in code which you don't understand. I can explain this code best in person but it's a bit tricky, but it's standard and you can copy it.

The short version is that gethostbyname() returns the lookup information in a format which is extremely general, and able to represent host names on multiple different kinds of networks, and able to represent multiple IP addresses (or other address information) per host, and so on. So extracting this information specifically for internet hosts is a bit involved, and the code there is the way to do it, and you can copy this into chatbridge.

Q: Can I modify [...] in the starter code?

A: Yes; you can modify anything in chatbridge.c.starter, but please take great care not to break the command-line argument processing or you'll fail all the automated tests.

Just about the easiest way to write a fixed message to a socket is like this:

        static char msg[] = "Hello, world\r\n";
        write(fd, msg, sizeof msg - 1);

(That might look like an "array reference" declaration like in java, but actually what that means in C is that the compiler is to determine the array size from the initializer.)

"sizeof msg" includes the terminating zero byte, which we do not want to write. Similarly, if you make up a string to write to a socket, the number of bytes to write is strlen(buf), not strlen(buf)+1.

(Note that sizeof is only of use here because there is a relation between the size of the data object and the length of the string. You can't use sizeof if you declare an array and then fill it up (it will give you the size of the array, not of the portion of it which is a zero-terminated string); and you can't use sizeof on a string passed as a parameter to a function (what you pass is a pointer to the zeroth element of the string, and sizeof that is the storage requirements of the pointer, not of the array it points to the zeroth element of). If it's a string, you can always use strlen() (including in the above example — strlen() would have worked there too).)

Q: Hey, you're using sizeof to find the number of elements in the array! You kept saying that there was no way to do this in C!

A: sizeof tells you the size of an object or type, in bytes. It's only useful in very specific circumstances.

If you have a function with a header like this:

        int f(int *x)
        {
            ...

then "sizeof x" will be 8 on the teach.cs machines, because that's the number of bytes in a pointer-to-int on these machines.

But maybe you think you will cleverly write instead

        int f(int x[])
        {
            ...

Well, "sizeof x" will still be 8, because as I've said in class, the 'x' there is really a pointer, due to a special conversion rule which applies only inside formal parameter lists. This is why I strongly recommend against using the above syntax, because it looks like it's giving you something it isn't. It looks like an array, but it is actually a pointer.

So "sizeof" doesn't always tell you the size of something in high-level terms, and furthermore, wrt assignment four, note that it does not tell you the number of bytes you need to transmit over the network, in hardly any circumstances.

If you have

        char buf[300];

then "sizeof buf" is 300 (since "sizeof(char)" is always 1, i.e. the units of sizeof are chars).

This value might be useful for some purposes, e.g. in indicating the amount of buffer space available to a read() or fgets(). But it is not useful in a write(), because you don't want to transmit 300 bytes. And if you pass this to a function, it decays into a pointer to its zeroth element, and the sizeof of that is something like 8; this is probably not of any use to you.

If you want to transmit "hi\r\n", for example, that's 4 bytes.

So this is also wrong:

        char msg[] = "hi\r\n";
        write(fd, msg, sizeof msg);     WRONG WRONG WRONG

because sizeof msg is 5, including the terminating zero byte, but you want to transmit 4 bytes.

So you have to figure out how many bytes you want to transmit. This is not always trivial.

read() returns a byte count. You need to store that. read() does not produce a zero-terminated string. You can assign the '\0' to a member of the array in a separate C statement if you want it there.

Q: What is the first argument to select()?

A: The max of all of the file descriptor numbers in all of the fd sets, plus one.

Q: What do I put in the fd set passed as the second argument to select()?

A: You are interested in data from any of the connected servers. So you want all of these file descriptors in your fd_set value. And remember that the first argument to select() is the max of these numbers plus one.

Q: What value do I use as the timeout for select()? Does it matter?

A: Yes, it matters. Specifically, you must not have a timeout. You want to block until something happens. There is no time limit. Thus the last argument to select() is simply NULL.

Q: How do we test how our program behaves on partial reads? When I type a line, 'nc' doesn't send it until I press return.

A: I've written a program to assist with this. Please try running your program against /u/csc209h/summer/pub/a4/trickyserver . (You could try running the supplied chatclient against trickyserver first, to understand what trickyserver does.)

Q: What does your memnewline() do?

A: If we were using the unix newline convention instead of the network newline convention, a number of places in the program would say "memchr(buf, '\n')" to find the end of the line, both to terminate this string so that we can use easy string-processing functions thereafter, and to know where to find the next line in the program (which we move down to the beginning of buf with memmove(), as supplied).

But these bits of the program need to search not for a \n, nor a \r, but either, whichever comes first. It seemed simplest to write a new function to do this search.

Q: So what does "!!memnewline(...)" mean?

A: You can think of "!!" as "boolean normalize". If we were writing an 'if', we could say

        if (memnewline(...))

But if we want to assign it to an int variable, well, it's a pointer, that's not going to work.
But "!memnewline(...)" is an integer. But it's just backwards, because of the 'not'.
So, "!!memnewline(...)" is something which counts as the same as a boolean, but is an integer so can be stored in lines_pending.

Q: Is it ok to use O_NONBLOCK instead of select()?

A: No. Going through all the idle sockets in a loop and checking for activity is called "busy-waiting" — you really have no work to do, but you're spending a lot of time doing it. This takes away processing power from other processes on the same machine.

Have you ever had a program on your home computer which when it's running, everything else is sluggish, even if that program is just sitting idle? So you see that the computer is sluggish and you go "oh, I must have fooglop running" and you switch to it and quit it and all's well.

The problem with that program was that it was busy-waiting instead of doing something like select(). It's a big problem when a program does this.

Review of the FD_* macros:
I think that the FD_* macros cause some of their confusion simply because of the terminology embodied in their names. They use terminology reminiscent of digital logic (e.g. CSC 258), based on the fact that you are turning on and off bits in a word.

So, the function to set the value to the empty set is FD_ZERO, because it sets the value of that int (or long or whatever it is) to the integer zero, which has all bits zero, which means that no elements are in the set.

The function FD_SET "sets" the bit, meaning that it makes it one. The function FD_CLR "clears" the bit, meaning that it makes it zero.

Similarly, "FD_ISSET" is not checking whether or not something is a mathematical set; it's checking whether a bit is set, which means that the given item is an element of that fd_set value.

You may prefer to use more modern abstract-data-type terminology:

fd_set: type of a set of file descriptors (a "set" abstract data type is like a list, but is unordered)
FD_ZERO: make the fd_set value the empty set (ignore the word "ZERO")
FD_SET: add the fd value into the fd_set value (ignore the word "SET")
FD_CLR: remove the fd value from the fd_set value
FD_ISSET: check whether the fd value is in the fd_set (ignore the word "SET" — it's "IS_MEMBER", and the subject of this predicate is the fd)

I think I may be able to answer some (most?) remaining confusion about the FD_* macros by presenting a sample implementation. This would be put into a system include file, either <sys/socket.h> or some other file it #includes. The following implementation assumes a maximum fd of 31, and that ints are at least 32 bits (so as to hold a bit for each of fds 0, 1, 2, ..., 31) (which is to say that the real FD_* macros are more complicated than the following).

#define FD_SETSIZE 32
typedef int fd_set;
#define FD_ZERO(p) (*(p) = 0)
#define FD_SET(fd,p) (*(p) |= 1 << (fd))
#define FD_CLR(fd,p) (*(p) &= ~(1 << (fd)))
#define FD_ISSET(fd,p) (*(p) & (1 << (fd)))

On the other hand, the above might not clarify matters if you are a stranger to this "bit-twiddling". I will explain it in person to anyone who is interested, but you don't have to understand it for this course. Basically, the terminology used in the FD_* macros is excessively low-level, and if you are familiar with some of these low-level matters, the above might help bring it together, I hope. If you aren't, that's ok; consider the semantics to be as listed previously (prior to this sample implementation).

When the other side of a socket connection disconnects, that socket fd is considered by select() to have read-oriented activity, so select() will tell you that there's activity on that fd, and thus you'll do a read, and you'll get zero bytes, and that is the signal that the connection has been dropped. That is to say, check if the read() return value is zero. Negative means error, zero means EOF, and positive indicates the number of data bytes read. (If there are no bytes to be read, read() blocks until there are!)

When the server sends EOF, that means it has exited or for some reason has hung up on you (the latter won't happen if you are following the protocol). Just print a simple message to stdout (such as "Server shut down\n") and exit.

Q: What's this "localhost" thing about?

A: We discussed in lecture how 127.0.0.1 is a special IP address which "loops back" to the machine you're already on.

But we normally use host names, not IP addresses, when running programs. For example, you type "www.teach.cs.toronto.edu" into your web browser instead of 128.100.31.101.

The name "localhost" translates to the IP address 127.0.0.1. Using either one on the command line will yield identical results when you run the client, or nc.

Note that you must not trust the other side of the connection to send appropriately-sized messages. See how the client does not trust the server. Most importantly, no matter what the other side sends, there must be no buffer overrun errors. If someone connects to the server and types an extremely long line and it gets split in two, fine; if it crashes, not fine. If your program exceeds array bounds, then there is almost certainly a possible input which makes it crash.

Buffer overrun errors are not only a concern for your communication over the socket, but are a general issue. For example, the following is wrong, for any size of s:

        if (fgets(s, sizeof s, fp) == NULL)
            ...
        strcat(s, ", and the same goes for you!");

The fgets() might produce a string which is of the maximal size to be able to fit into s; so the strcat (potentially) exceeds the array bounds. C programs are only correct if you can prove that array bounds are not exceeded.

Q: How does chatclient's read_line_from_server() work?

A: When we read some data from the server, we might not get a whole line. Suppose we do a read and we get three bytes. We just store that, and read_line_from_server() returns NULL indicating that we don't have a line. This value 3 is stored in "bytes_in_buf" — we have three bytes in the buf variable.

The next time select() tells us that there is data to read from the server, we read with an argument of buf+3, so that this data will be read into buf after the data we have so far. Suppose we get twenty more bytes. Now bytes_in_buf will be 23.

If there is a newline in the buffer now, then we have a line and can return a non-NULL value. But we might have more data than that, e.g. part of a subsequent line! The "nextbuf" pointer is used to remember where the data _after_ that line is, and in the next call to read_line_from_server(), we will move that data down to the beginning of buf.

One more wrinkle: Suppose we have not just part of another line in the buffer, but another whole line? So we return the first line, and the next time read_line_from_server() is called, we'll return the next line. But suppose select() doesn't say any read activity for ten years? That line is in memory ready to be returned by read_line_from_server(), but there's nothing new to read so we never get called.

This is the reason for the lines_pending variable. It is a flag which tells the caller to call read_line_from_server() without doing a select.

Q: What do I need in the "struct server" representing one server I'm connected to?

A: Everything the client needs to keep track of its connection to the server, plus also the pointer which is the head of the linked list of users seen on that server.

So this includes the file descriptor of the socket, a buffer for a partially-read line and all the data associated with that, and the lines_pending variable.

Q: So what do I do with all of those static variables in read_line_from_server() in chatclient.c?

A: Put them in the per-server struct. We have a buffer potentially in-progress for each server we're connected to. Similarly, the server has to do something like this for each client.

Q: What should I replace [...something in a socket example or some code from the net...] by in my code?

A: No. You should not do this. You should not start with some other program and replace things. You'll never get your program working that way.

Write only things which you want your program to do, for a good reason. Of course you can copy bits of code from elsewhere, but only because you know what they do and you have a good reason for wanting your program to do them.

(And copied code other than from me or the textbook needs to be cited appropriately, of course.)

Q: How do you run gdb if your program needs command-line arguments?

A: The command-line arguments appear as arguments to the "run" command in gdb.

        $ gdb a.out
        (gdb) run localhost 1234

And note that you can indeed use gdb quite normally on your a4 programs. In some of assignment three, gdb was sometimes not very effective because of the forking. Here we still have multiple processes (client and server), but they're manually invoked separately and there is no subsequent forking; gdb works fine.

Q: How do I connect to a chatsvr using nc?

A: nc hostname portnumber
e.g. if you are running chatsvr on a computer called "glop" and it is listening on port number 2345, type "nc glop 2345". Or you can use "localhost" if applicable, e.g. "nc localhost 2345".

Q: How can I "play chatsvr" so that my chatbridge can connect to me?

A: nc -l portnumber

Q: Do we have to check all system calls for error return values?

A: Yes, pretty much. Note that if you don't check error return values, you won't get good error messages and this will impede your debugging of your program. So put in the error checks from the start, and call perror() properly from the start too for the same reason.

However, you are not required to check the write()s for errors for this assignment.

Q: Is the chatbridge program also an infinite loop?

A: Yes, pretty much. It exits in the case of various kinds of errors, and also upon EOF from a server.

Q: I keep getting the error "bind: Address already in use" when I try to use /u/csc209h/summer/pub/a4/chatsvr.

A: The port number you are trying to use is being used by someone else! Pick another one by using the '−p' option.

Q: Now it's saying "bind: Permission denied".

A: Your port number must be at least 1024. Port numbers below 1024 are reserved for system services, as discussed in lecture (please feel free to ask about this in office hours or by e-mail).

Q: Once I get data from one file descriptor, I can't seem to receive data from any of the other file descriptors any more.

A: Sometimes this is the result of passing an old fd_set value back to select(). When select() returns, it has modified its second parameter (the read fd_set value); this parameter is an "in/out" parameter. The modified value is not suitable for passing to select() again.

I suggest you put the FD_ZERO and list of FD_SETs in your main loop, just before the select() call. Just set up the fd_set value every time.

The Haviland et al textbook discusses a different strategy, in which we copy the fd_set value from a master list every time. This saves the setting-up overhead (or, to be precise, it turns it into a single variable assignment, albeit a struct assignment). This is good for a program with a large number of connected sockets, in which the FD_ZERO and list of FD_SETs would be significant. But...

Don't do it. Keep It Simple. The textbook's strategy introduces a situation of what some call "parallel data structures": your other variables which are keeping track of the connection sockets contain data which is equivalent in semantics to this master fd_set value. The problem with parallel data structures is that it's easy for them to get out of sync, and then you have a bug. If they remain perfectly synchronized you still have additional program coding requirements resulting from having to maintain the parallel data structures. Parallel data structures should be avoided except when there is a compelling reason for them (which usually in this case would be an efficiency concern). There are only two file descriptors in the set in this program, so don't do it.

Q: I got an unexpected message from my server saying that someone else had connected (or sent something).

A: Someone else did connect (or send something). It could have been anyone on the planet with internet access. Most likely it was another CSC 209 student.

Q: Can I modify chatsvr.h to add declarations? Or can I submit a "chatbridge.h"?

A: No. Any declarations you are wanting to put in a chatsvr.h file should simply go towards the top of your chatbridge.c.
.h files in C are for coordination between files; you are writing only one .c file. The C pre-processor just copies the file in when you #include it; #include doesn't provide anything you can't just do in a single C file.

Q: When the handout says "everything which anyone types gets relayed to everyone", it really means everyone else, right?

A: No, messages get relayed back to the sender too. You can argue about whether or not this is a good thing (personally, I like the transparency of it), but it does make the server slightly simpler. In the case of assignment four, it also makes bugs in your avoiding relaying messages from "bridge" manifest more quickly. (Which is good — the bug is bad, but evidence of the bug is good because it helps you fix the bug.)

This is different to the chatbridge behaviour; chatbridge never relays a message from a server back to itself; it relays to all other chat servers on which it has seen someone with that handle.

Assignment four questions and answers