cs527

Thursday, November 12, 2009

Branch-and-Bound, Graphical Models, Structured Grids

Branch and bound pattern involves dividing the problem into smaller sub-problems, using the problem's definition to set constraints that must be met by sub-problems and eliminating the sub problems that do not satisfy the constraints. This is all done with the four steps: branching, evaluation, bounding and pruning. The theoretical aspects of the pattern are very well explained, but more code examples would have been helpful for understanding.

I did not understand exactly how the Graphical Models pattern solves the problem statement. In many problems we use measurements that can be performed to infer things and make conclusions. This is usually done with some kind of probability distribution as used in this pattern. But the solution is not clear to me. It is unfortunate that they did not include any examples to illustrate the solution.

I think the Structural Grids pattern is basically about using geometric decomposition to decompose the problem into smaller chunks and have processing elements operate on separate chunks in parallel. It looks like this is applicable to image processing, but I did not fully understand the example presented.

Shared Queue, Speculation, Digital Circuits

Shared Queue:

What experience have you had with shared queues, and what factors did you use to decide whether to use single vs. multiple queues?
Shared queues are widely used in the system I develop at work. One of the share queues is an “I/O completion status queue” shared by multiple processors that process completed I/O operation. We use a single queue, because back in the days there were only 2 processors. Now with the increase number of processors sharing the queue we are getting scalability issues because of excess contention.

An enqueue or dequeue seems like a very small operation. Why is there so much concern over contention to the queue? What factors cause this?
They are basic operations on a sequential program, when the queue is accessed concurrently appropriate synchronization must be implemented

Seeing as how the authors lay out quite a bit of code to support the variations of these patterns, they seem to mitigate their own concern as to the value of "simple protocols". Why would someone opt for the less efficient implementation, given that the concerns are already outlined, and sample code available?
It depends on the requirements of the application. If performance is not a major concern, I would go with a simple implementation even if the more efficient, but complicated code already exists. In my work environment we have performance numbers that we need to meet and we have to meet them even if it means writing an ugly code as long as it performs good.

Speculation:

Speculation seems like a pretty straightforward pattern, though potentially a quite powerful one. What factors would you consider to decide if speculation is an appropriate/beneficial pattern to apply?
I agree this seems like a very powerful pattern. It uses the idea of hardware speculation implemented in almost all modern processors. The factor to consider here is the cost of wrong speculations and the probability of making such wrong speculations. If the cost of wrong speculation is minimal compared to the gain of good speculation, then this pattern can produce great results.

Seeing as how you may not have data about the statistical distribution of inputs to expect before you build the program, and therefore perhaps not know the average cost of rollback, etc., is it prudent to implement the speculation pattern at the outset, or wait to evolve the application to include speculative optimizations down the road? What if the performance increase that speculation has the potential to generate is necessary to meet the minimum requirements of the specification? Is speculation still a viable pattern to apply, or would effort be better spent on other parallel optimizations with more certain outcome?
Speculation is a very powerful technique and I think that, for applications with predictable inputs, it is worth investigating if speculation can help meet the requirements.

Circuits:

Was this pattern what you expected from the name?

No, but after realizing that it was about bitwise operations, the name make more sense because digital circuits deal with bitwise operations.
I am a low level firmware programmer and I deal with bits operations on a regular basis. The concepts explained in the patterns are widely used by low level programmers: Large inputs are divided into smaller blocks and the bitwise operations are applied to the blocks independently to perform the calculation.

Tuesday, November 10, 2009

PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

This is one of the most interesting papers I have read so far in this class. The problem of reproducing bugs is one that I face on a regular basis at work. The code that I write runs on a card (“small computer”) embedded in a big server and is responsible for driving I/O operations between the server and storage over a fiber channel network. We drive on average 100.000 I/O operations per second. Concurrency and timings bugs are common and reproducing them is painful and not always possible. We have been working for years to improve our ability to reproduce problems, but externals timings condition (On the server, network and storage) makes it even more challenging. The system that we currently have mainly consists of logging huge amount of data so that the error can reproduce. However, this represents a significant overhead and is only used under development and test. It is impossible to add so many traces in production code because customers cannot tolerate the performance impact.

The paper presents the idea of recording only partial information (sketches) during production and using an intelligent replayer during debugging. The sketches are used as a guideline to the replayer to reproduce the bug. The tool uses a different idea than most of the bug reproduction tools. The main goal here is to reproduce the bug by finding a combination that leads to the bug and not to reproduce the exact execution path and states that led to the bug. This is a great approach because different executions path and states can lead to the same bugs and this approach increases the changes of reproducing the bug with fewer replays.

Many reproductions are usually necessary for the programmer to find the root cause of complex bugs. So, one of the major advantages of this tool is that it guaranties reliably reproduction of the bug after the first successful reproduction.

Loop Parallelism, Task Queue, Graph Partitioning

Loop Parallelism
1. The author says the goal of incremental parallelization is to "evolve" a sequential program into a parallel program. If you have experience in refactoring code from sequential to parallel, how did your approach differ from the steps presented in the paper?

We currently have a project at work where we are transforming a sequential firmware to a parallel program. The approach we are taking is the same as described on this paper. We are incrementally applying transformations to selected area of the code to achieve parallelism. This is approach is the only option we have because people are not opened to a complete rewrite of the system. The advantage of incremental parallelization is that it is easy to debug and test the program because the code should still work correctly after each change. We are running our regular regression test after each refactoring and that give us confidence after each step.

2. Many of us have used Eclipse or another IDE, with their respective plugins, for execution profiling. What has been your experience with these tools? What is your preferred method of locating the code in question (automatic or manual)?

In this class we have read about many tools in eclipse that can be helpful because the work in automated. I prefer to use tested tools if they are available. Manual work is painful, but sometimes that is the only option available.

3. The author states, "The application of many transformations to many loops may be necessary before efficient performance is achieved." What is your opinion on this statement?

That is expected because of the Amdahl law as mentioned by the author. Enough code needs to be optimized to achieve the desired speedup. So it is good to target the areas with greater parallelization potential first.

Task Queue Implementation Pattern
1. How would you compare the Task Queue pattern to other similar distribution patterns like divide-and-conquer and fork/join pattern?

I think this pattern has the same idea as divide-and-conquer and fork/join pattern. They all involve dividing the work into independent subtasks and executing them in parallel. I think the task queue can be used to schedule the subtasks created by divide-and-conquer and fork/join.

2. It was nice to see a pattern with some actual code examples. How important were the examples in understanding the concepts presented in this paper?

The idea of the pattern is straightforward: First divide the work into independent units, then decide between eager and lazy scheduling, and finally decide on centralized or decentralize task queue. The author explains the tradeoffs between simplicity of implementation (centralized task queue) and scalability (decentralized task queue) the code examples presented were helpful as an illustration of the idea presented.

Graph Partitioning Implementation Strategy Pattern
1. This was probably one of the more math-oriented patterns I have seen and/or read. How often have you experienced a situation where the graph partition strategy would have been useful? If never, can you think of any future situations where this pattern might be come in handy and could be applied?

I thought I was reading a chapter of the “hard” algorithm's class I am taking this semester. I have a very good understanding of various problems in graph theory and I think this pattern did a great job at explaining the theory. I don’t have any experience with real world applications that uses the Graph partitioning strategy. More example of applications of this pattern would have been helpful.

Thursday, November 5, 2009

Pipeline, Geometric Decomposition, Data Parallelism

The three patterns describe how to achieve parallelism by dividing the data or dividing the operations that are performed on the data. I think all parallel patterns can be classified into the two categories: data parallel patterns and task parallel patterns.

The pipeline pattern is probably one of the most used patterns. Pipeline can be used in any system where a sequence of identical operations is processed in stages. I have used this pattern in a network adapter’s code to significantly reduce latency and increase throughput of sending and receiving packet over a fiber channel network.

The geometric decomposition pattern involves decomposing the data structure into chunks and decomposing the update operations on the data into tasks with each task managing a set of chunks. The biggest challenge in this pattern is to ensure that non local data required to perform an update operation is available before the update operation. So data exchange mechanism is critical for this pattern as it affects the potential parallelism.

The Data parallelism pattern advocates treating the problem as a single stream of operations applied to each element of a data structure. This is a similar idea as in the geometric decomposition, but here there are no details about the implementation of the pattern. I am not sure if the solution can be useful without a combine stage. The idea of applying instructions to each element of the data structure is good, but the results need to be combined to produce the final result.

Tuesday, November 3, 2009

Armstrong thesis chapter 5

The concept that I like the most in this chapter is the notion of having supervision hierarchies. The tree representation is a great way of distributing error recovery responsibilities to nodes, so that each node can monitor the errors in its child nodes. The error recovery mechanism described in this chapter is very robust and I think this approach will allow building very robust software systems. The difficulty that I see in real world application is that it is not always clear to the developers what an error is. The error recovery mechanism here requires the programmers to know exactly what an error is so that an error recovery procedure is initiated whenever that error occurs. This is not an easy task because most of time it only after a crash that the programmer becomes aware of some type of errors.

The author gives the notion of well-behaved functions(WBF) as a way of determining what an error is. WBF are functions that should not generate an exception unless an error is encountered. So this allows the programmer to interpret any exception generated by a WBF as an error. Amount the rules to follow when writing WBF, I found rule 2 interesting. The author advocates raising an exception when the specification is not clear about what to do. I think this is a better approach to determine and address specification errors instead of the programmer making assumptions about the design.

Task Parallelism, Recursive Splitting, Discrete Event

Task Parallelism

1.- When you first saw this pattern title, did you feel like you knew the subject beforehand? After reading the pattern, did you confirm your impression or did it surprise you? how?

The title gives a pretty good idea on what to expect. I knew the pattern will describe a solution that consists of dividing a problem into a series of task and solving then concurrently. I did not directly think about defining tasks with data distribution as a primary concern.

2.- In what category falls this pattern?

I am not sure how to classify this pattern. In order to produce the tasks, this pattern can use both the structural pattern and/or computational patterns that describe the problem.

3.- What platform do you have more experience programming parallel applications on? And what Task Management Mechanism? (Jobs over Networks, OS Processes, SW Task Queue, HW Task Queue)

I am working on a prototype to convert a sequential I/O firmware to a parallel application on an embedded I/O adapter consisting of 2 powerPC cores. I have not done much yet that I can share. I am still in the process of getting embedded Linux to boot on the platform before I start to change the current application.

5.- Do you think that this patterns requires you to learn more about hardware platforms in order to make a correct implementation?

The pattern requires the user to have an understanding of both the application’s concurrency characteristics and the hardware platform characteristics in order to arrive at the solution. I think this is good idea because many systems do not perform at their full potential because the programmers only focus on understanding one side. I worked in a system where knowing the cache replacement algorithm of the CPU allowed us to make changes in the application that led to a urge performance improvement. In parallel programming I believe that a good solution is not possible without a good understanding the parallel hardware characteristics.

Recursive Splitting

This pattern describes a solution to problems that can be recursively decomposed in a series of independent tasks. The recursive splitting problems can be solved in parallel hardware by having nodes work concurrently on different recursively generated task. The major issue to be addressed while applying this pattern is to come up with a choice of splitting that make efficient use of the hardware resources to produce the solution. The author mentions the tradeoff between getting greater efficiency by selecting larger base cases, versus getting greater opportunity for concurrency by selecting small base cases. The author provides good explanations on how to implement the solution and states the cases where the solution is not applicable. The premise for the pattern to work is to recursively generate more than one task per tasks. So problem like selection sort where only one task is recursively generated are not applicable.

Discrete Event:

This pattern is similar to the Event-based except that here, more consideration is given to ordering constraint and a solution is given that provides a simple way of dealing with the constraints. I agree with the author that often the best approach to deal with deadlocks is to use timeouts. I work on a real time system where we use timeouts to make the system predictable. It is easier to implement and works very good in our environment.